• 
    

    
    

      99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

      Joint head pose and facial landmark regression from depth images

      2017-09-15 08:59:51JieWangJuyongZhangChangweiLuoandFalaiChen
      Computational Visual Media 2017年3期

      Jie Wang,Juyong Zhang(),Changwei Luo,and Falai Chen

      c○The Author(s)2017.This article is published with open access at Springerlink.com

      Joint head pose and facial landmark regression from depth images

      Jie Wang1,Juyong Zhang1(),Changwei Luo1,and Falai Chen1

      c○The Author(s)2017.This article is published with open access at Springerlink.com

      This paper presents a joint head pose and facial landmark regression method with input from depth images for realtime application.Our main contributions are:firstly,a joint optimization method to estimate head pose and facial landmarks,i.e.,the pose regression result provides supervised initialization for cascaded facial landmark regression,while the regression result for the facial landmarks can also help to further refine the head pose at each stage.Secondly, we classify the head pose space into 9 sub-spaces,and then use a cascaded random forest with a global shape constraint for training faciallandmarks in each specific space.This classification-guided method can effectively handle the problem oflarge pose changes and occlusion. Lastly,we have built a 3D face database containing 73 subjects,each with 14 expressions in various head poses.Experiments on challenging databases show our method achieves state-of-the-art performance on both head pose estimation and faciallandmark regression.

      head pose;faciallandmarks;depth images

      1 Introduction

      Estimation of human head pose and detection of facial landmarks such as eye corners,nose tip,mouth,and chin are of central importance to facial animation [1],expression analysis[2], and face recognition,etc. These two problems have been separately studied for many years[3–7],with significant progress for images[8–11]. However,image-based methods are always subject to illumination and pose angle variations[12],which lead to many limitations. In recent years,withthe development of RGBD technology like Microsoft Kinect and IntelRealsense,head pose estimation and facial landmark detection methods based on depth data have attracted more and more attention due to the rich geometric information in depth images.

      In recent years, cascaded regression approaches have been successfully applied to pose estimation [13,14]and facial landmark regression[15–17].Initialization is a key factor in such regression methods.If the initial value lies far from the actualshape,the finalregression result will be not satisfactory.In existing methods,mean shape is often used for initialization[13,18],but it often fails in challenging cases with extreme head pose or partial occlusion.Tracking methods[7,19,20] can handle the problems caused by large pose variation and occlusion,and achieve fast speed at the same time,but leaving problems such as drift and initialization to be overcome. The main problem in depth images are noise and even missing data,so a good initial value is quite important for facial landmark detection.On the other hand,the structural information provided by facial landmarks[21]is a major clue for head pose estimation. Therefore,joint head pose and faciallandmark regression can be helpfulfor mutual optimization and improving the accuracy of both tasks.

      In this paper,we propose a joint optimization approach which regresses the head pose and facial landmarks together via the random forest framework.Figure 1 shows one result of joint 3D head pose and faciallandmark regression.The main contributions of our approach are as follows:

      Fig.1 Joint 3D head pose and facial landmark regression.Red points represent the ground truth shape while blue points represent the estimated shape.The orange axis indicates the head orientation.

      ?Joint optimization.The pose regression result provides supervised initialization for cascaded facial landmark regression,while the estimated facial landmarks can also help to refine the head pose in each stage. The head pose and facial landmarks are jointly optimized stage by stage, and iteratively converge to optimum results.

      ?Classifi cation-guided.Unlike existing methods, which train one general model for a variety types of input,our proposed method divides the input data into 9 types according to head orientation, and trains a specific model for each type. In this way,our classification-guided facial landmark regression method achieves more accurate results than existing methods,and it can handle some challenging cases including large pose variations and occlusion.

      ?A new database.Since depth images of faces with welllabelled head pose and faciallandmarks are rarely available,we have constructed a database of such data with 73 persons,each with 14 different expressions in various head poses.It contains 130,000 well labelled depth images.

      The remainder of this paper is structured as follows:Section 2 briefly introduces related work. In Section 3,we present our method of head detection,head pose estimation,and faciallandmark localization. Experimental results are given in Section 4,and we conclude the paper in Section 5.

      2 Related work

      2.1 Head pose estimation

      Fanelli et al.[22–24]adopted a voting method to directly determine head pose.However,their feature selection method for depth images degenerates into using 2D features,i.e.,the RGB information used in 2D images was replaced by xyz-coordinate values in depth images.Tan et al.[20]pointed out that when the head is translated along the camera’s viewing direction,2D features cannot determine the relationships between the object’s projective view and the corresponding transformation parameters. Therefore,they captured subject-specific structures by combining a generalized model-based tracker with an online learning method.Papazov et al.[25]also used a random forest-based framework,in a similar way to the methods in Refs.[22–24].They replaced depth features by more elaborate triangular surface patch(TSP)features to ensure view-invariance. Their result achieved better results,but with more computation because of the calculation of TSP features.

      2.2 Facial landmark localization

      Active appearance methods(AAMs)[26–30]fit the input image by minimizing the error between the reconstructed texture model and the input image. These methods always require global features,and thus lose details in some cases.Thus we mainly consider methods based on independent features, which lie in small patches around each landmark, such as active shape models(ASMs),constrained local models(CLMs),or models implicitly learned from the data by machine learning or deep learning. Ruiz and Illingworth[31]extended ASMs to 3D shape for automatic localization of landmarks. Their model was adapted to individual faces via a guided search,which guided the specific shape index models to match local surface patches. Gilani et al.[32]detected a large number of anthropometric landmarks in 3D faces,in common with Ruiz and Illingworth.Instead of using simple geometric matching,their algorithm builds dense correspondences.Baltruˇsaitis et al.[19]extended the CLMs to 3D(CLM-Z)for robust facial feature tracking under varying pose.Since methods based on direct shape-matching are often time-consuming, decisions from the localmodels are always fused with a global shape model.Jourabloo and Liu[33,34] estimated both 2D and 3D landmarks and their 2D visibilities for a face image with an arbitrary pose.

      3 Method

      The pipeline and structure of our method are presented in Fig.2.In order to facilitate our exposition,two concepts frequently used in thispaper are firstly explained.

      Fig.2 Pipeline of our method:(a)head detection,(b)head pose estimation,(c)supervised initialization for cascaded facial landmark localization,(d)pose-space chosen in cascaded regression phase,and(e)cascaded faciallandmark localization via classification-guided approach.

      Bounding box:The bounding box of a point cloud is defined as the smallest 3D cuboid which contains it,measured in terms of cuboid volume. For example,the bounding box in Fig.3(c)is a blue cuboid.The bounding box of the reference face template shown in Fig.6 is defined as Rbin our method.

      3D patch:Just as a patch always represents a region in a 2D image,a 3D patch is a region that belongs to a 3D point cloud.See,e.g.,Fig.4:the red,green,and blue regions are all 3D patches.

      3.1 Face detection

      The discriminative random regression forests (DRRF)[23]method can be used to discriminate depth patches that belong to a head to ensure that only such patches are used to predict head localization and head orientation.We use the DRRF method only in order to estimate head localization. Remarkably,the head localization is defined as the barycenter of the head point cloud,which we call the head center in the following.We illustrate the head detection process in Fig.3.We first estimated the head center by DRRF as the center of a sphere with an empirical fixed radius of 60 mm(small blue sphere).We then compute the mean of the points which lie within the blue sphere,setting this as the center(black point),and compute a further sphere of radius d=75 mm(red sphere).All points which lie within the red sphere points are found;they constitute the head point cloud.Its bounding box is then determined.

      Fig.3 Head detection:(a)fi xing the head center and generating the red sphere,(b)head point cloud,and(c)bounding box of the head point cloud.

      Fig.4 Feature selection.Top:feature used in head pose estimation. Middle,bottom:two kinds of features used in facial landmark localization framework.Note that the latter features are well aligned.

      3.2 Feature selection

      Feature selection is a critical factor in the random forest framework[35–38].A good feature selection method can help to appropriately classify the training samples in each non-leaf node,so we firstly consider the feature selection method before head pose regression and faciallandmark localization.

      3.2.1 Feature selection for pose regression

      In head pose estimation phase,each tree in the forest is constructed from a set of 3D patchesrandomly sampled from the training samples.The scale of each Lidepends on the scale of the facial bounding box to which it belongs.Since each Liis also a group of 3D points,we can get its bounding box Bi.At each non-leaf node,starting from theroot,suppose there are M 3D patches(e.g.,the red 3D patches in the top row of Fig.4),each with its corresponding bounding box Bj,j=1,...,M. In order to divide these M 3D patches into two categories,and ensure that the difference between the 3D patches that fall into the same child node is as small as possible,we randomly choose K pairs of small 3D patcheswithin Rb,and map them to each Bjwith the following equation:

      where Box(l)represents the bounding box of l,and A represents one attribute ofthe bounding box,i.e.,its length,height,or width;in addition,when locating a small cube in a bounding box,A also represents the distance between each side of this cube and the corresponding side ofthe big bounding box it belongs to.Each pair ofblue and green parts in Fig.4 shows a pair of small 3D patches that belong tofor specific k,l,and c.It should be noted that Rbis only used as an arbitrary reference,within which we can choose 3D feature patches:we could use any other cube instead of Rb.Then,we can compute the feature differencefor the j th training sample,where D(l)is the mean depth of the point cloud of the 3D patch l.

      3.2.2 Feature selection for facial landmark regression

      In the facial landmark regression phase,each tree in the forest is constructed from training samples(a set of head point clouds).When using a cascaded random forest to locate facial landmarks,we hope that the selected features for each landmark are unaffected by the head pose.Consider Fig.4,where each row corresponds to one feature.We can see that each pair of features has similar localization in different samples.Denote the transformation of the estimated head pose by Rt;R?1tis its inverse. The influence of head pose on feature selection can be eliminated by transforming each head point cloud byHaving chosen features around each landmark in the template,these features are put into correspondence with each transformed head point cloud using the scale transformation between the bounding box of the template and that of the head point cloud.The feature difference is computed asmean depth of the point cloud in a 3D patches l.

      3.3 Head pose estimation

      Our supervised initialization for the cascaded random forest is based on an accurately estimated head pose,using a random forest framework.The training database are pre-processed,and annotated with ground truth poses.In this paper,we use axis–angle representation to indicate the head orientation, and the position of the head is represented by the tip of the nose.Compared to using Euler angles, axis–angle representation makes it easier to represent the direction vector of the head orientation,and we do not need to consider the problem of gimbal lock which arises in Euler representation. Head pose P can be denoted by a 6-dimensional vector P =(ψx,ψy,ψz,tx,ty,tz),where(tx,ty,tz)is the location of the tip of the nose,while the head orientation is given by(ψx,ψy,ψz).The orientation can be decomposed into the form:(ψx,ψy,ψz)?((ex,ey,ez),θ)?(θex,θey,θez),where(ex,ey,ez)is a rotation axis encoded as a unit vector,andθis the rotation angle.

      3.3.1 Training

      Beginning with a set ofdepth images,we can get the corresponding head point clouds using the method in Section 3.1.Each tree in the forest is constructed from a set of 3D patchesrandomly sampled from these head point clouds,whereΘi= {ψix,ψiy,ψiz,Δtix,Δtiy,Δtiz} contains the head orientation{ψix,ψiy,ψiz},and{Δtix,Δtiy,Δtiz}is the offset between the i th 3D patch’s center and the ground truth localization of the tip of the nose.

      We define the binary tests in each non-leaf node

      asβLi,τ:

      where D(l)is the mean depth of the points in the 3D patch l,and l1i,l2ibelong to Li,as described in Section 3.2. τis a threshold. Using mean depth instead ofa single point makes the feature less sensitive to noise.

      Each non-leaf node chooses the best splitting functionβ?out of{βk}by minimizing:

      where|L|and|R|are the number oftraining samples that are routed to the left and right child nodes according toβLi,τ. σψand σΔtare the traces of covariance matrices of the rotation angles andoff sets.After choosing the best binaryβ?in one non-leaf node,the training samples in this node can be split using Eq.(2).This process is iterated untila leafis created when either the maximum tree depth is reached,or fewer than a certain number of 3D patches are left.When the training process is finished for one tree,each leaf node stores several training samples with their ground truth labels{Θi}. We compute the mean vector and the corresponding determinant of the covariance matrixΣ of these ground truth labels and save them in each leaf node.

      Equation (3)ensures that the differences of samples that fall into the same child node is as small as possible,while the difference of samples in different child nodes is as large as possible. We compare our entropy in Eq.(3)with the classical information gain function[23]in Eq.(12) in Section 4.

      3.3.2 Testing

      In the testing phase,the 3D patchesare densely extracted from one testing head point cloud,each with the labelk=1,...,K.{ckx,cky,ckz}is the center localization of the 3D patch Lkin the real world(camera coordinates).All of theseare sent into the forest,the binary tests stored in the nonleaf nodes guiding each 3D patch all the way to a leaf node of each tree in the forest. After all 3D patches have travelled over the forest,a set of leaves with their mean vectors and determinants of their covariance matrices are left as a basis for predicting the head pose. However,not all leaves can be considered for pose regression:only those whose covariance matrix determinant is less than the maximum allowed determinant should be taken into account. Suppose thatis the mean vector for head pose stored in the m th leaf,and the 3D patch Lklies in this leaf. Then,the head pose predicted by the chosen leaf and 3D patch Lkcan be expressed as

      In order to further reduce the influence ofoutliers, we perform a k-means algorithm on the remaining leaves,and the cluster with the largest number of leaves is used for determining head pose.

      3.4 Pose-based 3D facial landmark localization

      3.4.1 Pose classifi cation

      Our classification-guided method divides the samples into 9 pose views(upper-left,upper-centre,upperright,left,center,right,lower-left,lower-center, lower-right)and trains 9 models in each sub-space respectively. The training samples are split into these 9 sub-spaces before the phase of cascaded facial landmark regression,and the training phase is carried out in each sub-space in turn. When testing with a new sample,we must firstly determine which category it belongs to.The distance in each sub-space is measured with Gaussian probabilities: the orientation space is discretized into disjoint setsand for an estimated head pose P,the Gaussian kernel between P and Piis computed as

      where wiis the centroid of Pi,and?is a bandwidth parameter.We choose view space Piwhich has the highest Gaussian probability when estimating the facial landmarks.

      3.4.2 Supervised initialization via head pose

      In Section 3.3,we presented our method ofhead pose estimation.Our estimated result is accurate,and as a result,provides important and helpfulinitialization information at the beginning of the cascaded shape regression phase.

      Supervised initialization is based on estimated head pose and a mean template marked with 33 facial landmarks,as shown in Fig.6.The estimated head orientation is firstly transformed by rotation matrix Rt,then the 33 faciallandmarks in the mean template are rotated by Rtand translated to the estimated nose tip.Thus,a coarse initialization of facial landmarks is given by the transformed facial landmarks on the template.In order to refine the initial face shape,we further move the face shape along the direction of the head orientation using a rigid translation,so that the initial location of the nose tip is moved to one point belonging to the head point cloud.

      3.4.3 Joint head pose and facial landmark regression

      We now present our facial landmark localization method in one specific pose space.An n-point shape in 3D can be defined by concatenating all points’ coordinates into a k=m×3 vector: where (xj,yj,zj)represents the j th landmark. Beginning with an initial shape S0and the target shape~S,the main idea of cascaded shape regression is to obtain a value Rtat stage t,so that by a certain stage T:

      In machine learning,given a training database with N samplesis a set oftraining samples and{~Si}is the corresponding ground truth shape,Eq.(7)is typically satisfied by obtaining a set of regressors(R1,...,Rt,...,RT)by solving:

      Learning withΦt.When learningΦtusing Eq.(8),we begin with N training sampleswhere Fiis the i th head point cloud with its estimated head pose Pi0,while Si0and ~Siare the initial shape(see Section 3.4.2)and the ground truth shape for the i th sample respectively. When training one tree using these samples,a group of 3D patches{Lk}is chosen for each non-leaf node as described in Section 3.2,while the binary test in each non-leaf node is given by

      The feature and binary test that give rise to minimum entropy 3 are chosen to split the samples in each node. The training process for one tree is iterated until a leaf is created when either the maximum tree depth is reached,or fewer than a certain number of training samples are left.

      After training the random forest has finished after the t th stage,each component ofφtj(Fi,Sti,Pti)for an arbitrary sample{Fi,~Si}is determined to be either 0 or 1 by passing it through the random forest,where 1 indicates the leaves contain this sample while 0 indicates otherwise.All theφtjare concatenated to form a high-dimensional binary featureΦt:Φt= [φt1,...,φt33],whereφj(j=1,...,33)are the local binary features for each landmark.Let Djrepresent the total number of leaf nodes in the random forest specified by the j th landmark.Thenφjis a Djdimensional binary vector,and Wtthat stores the displacement in each corresponding leaf node is a 3-matrix in which each column is a 3D vector.

      Learning withWt.Since Wtdoes not contain global shape information,we re-computed Wtwith the estimated Φtby performing the following minimisation:

      Variable update in each stage.After building the forest in stage t,we can get the local binary functionsΦt=[φt1,...,φt33]for each training sample and the global linear regression Wtas described above.Then we update the face shapes ofalltraining samples using:

      4 Experiments

      4.1 Experimental setup

      Fig.5 Cascaded facial landmark localization.Starting from the initial guess,the locations of the landmarks are gradually optimized.

      We performed training and testing on a Windows 64-bit desktop with 4 GB RAM and a 3.3 GHz Intel Core i5 CPU.We mainly used three databases. Two were from the ETHZ-Computer Vision Lab[22–24]and available online,the BIWI Kinect Head Pose database(BIWI)and the BIWI 3D Audiovisual Corpus of Affective Communication database(B3D(AC)2).In order to test our algorithm on different kinds ofhead pose and facialexpression, we constructed our own 3D Faces with Expression and Pose database(3DFEP).

      4.1.1 ETHZ databases

      The BIWI database contains 24 sequences acquired with a Kinect sensor.The ground truth for the head pose indicates the head center in 3D space and the head orientation encoded as a 3×3 rotation matrix. In this paper,we acquire the ground truth localization of the nose tip with the aid of iterative closest point(ICP)algorithm,and the head orientation is transformed to axis–angle representation.

      The B3D(AC)2database contains 14 persons, each captured in 40 poses using a 3D scanner. We manually marked 33 landmarks on the mean template,as shown in Fig.6.We mapped these marked points to each sequence to obtain ground truth facial landmarks,while the ground truth head orientation in axis–angle representation and the position ofnose tip were computed at the same time.

      4.1.2 3DFEP database

      Fig.6 Ground truth facial landmarks for one example in the B3D(AC)2database.

      In order to test our algorithm on diff erent kinds of head pose and facialexpression,we built a database with 73 persons,each with 14 expressions in various head poses.We scanned the subjects using an Intel Realsense sensor.The format of the depth data and RGB images are the same as that used to encode the BIWIdatabase.To provide ground truth head poses and facial landmarks,we used 3D face modeling and tracking techniques,and poor calibration results were eliminated by checking each one in turn.Again, 33 landmarks are used,and the ground truth of the head pose is also given by axis–angle representation and nose tip.

      4.2 Head pose estimation

      Our proposed head pose estimation method was evaluated on the above three databases.Various fixed parameters are given in Table 1. Further parameters were:maximum variance to eliminatethe stray leaf node:300,ratio of large 3D patch to face bounding box:0.7,ratio of sub-3D patch to face bounding box:0.3,number of large 3D patches sampled from each head point cloud:25.

      Table 1 Parameters used for training

      Figure 7(left)shows the relationship between the average error in head pose and the depth of tree. Error reduces as tree depth increases. Figure 7(middle)shows the relationship between accuracy and depth of tree. Accuracy increases with depth of tree. Figure 7(right)shows that as the maximum depth of the tree increases,the number of samples in each leaf node reduces,while the consumption of time shows slow growth.All experiments in Fig.7 were done with the BIWI database.

      Table 2 is in three parts. The first four rows show head pose estimation results using our method with different databases.For each database,we use 70%of the samples for training and the remaining 30%for testing.Note that the head pose estimation result for the B3D(AC)2database is more accurate than for the other databases,because all samples in B3D(AC)2database belong to the same posespace.We also trained a head pose estimation model by combining the BIWI and B3D(AC)2databases; the results were similar to those obtained by only using the BIWI database.The results in the next three rows of Table 2 show that by using axis–angle representation,the head pose estimation is more accurate than using Euler angle form directly. Moreover,using entropy(Eq.(3))is more effective than using information gain[23]:

      Table 2 Head pose estimation results.The fi rst four rows give results using diff erent databases and our method;the associated training parameters are given in Table 1.The next three rows show results for head pose by transforming the estimated pose from axis–angle representation to Euler angle form(A→E&N).We also compare results with diff erent loss functions:entropy and information gain.The last four rows give head pose estimation results from other methods

      Fig.7 Left:relationship between average error of each head pose component and depth of tree.The left-hand vertical axis measures the positional error of the nose(shown as solid lines)while the right-hand vertical axis represents the orientation error(dotted lines).Center: relationship between accuracy and depth of tree.Orange:ratio of samples whose position error satisfi es[tx<5,ty<4,tz<4].Blue:ratio of samples whose position error satisfi es[tx<10,ty<10,tz<10].Right:Relationship between depth of tree and time.

      Table 3 Estimation errors for facial landmark localization for B3D(AC)2and 3DFEP databases.M-initial:cascaded facial landmark localization with mean shape as initialization.V-based:cascaded facial landmark localization with position aware shape as initialization (Unit:mm)

      We also later show some head pose estimation results for the BIWI database in the case of many missing head data–see Fig.10.

      4.3 Facial landmark localization

      We used the B3D(AC)2and 3DFEP databases to perform some facial landmark localization experiments.

      Since all samples in the B3D(AC)2database belong to the same pose space,they need not be classified with respect to pose space. Errors of each faciallandmark are computed in the Euclidean norm.Table 3 lists the errors for each landmark using three methods.V-based gives results of the pose-based facial landmark localization method.It shows that the maximum estimation error of facial landmarks is less than 1.6 mm.In order to show the importance of pose-based initialization in cascaded shape regression,in the M-initial method we only locate the mean shape in the bounding box as the initialshape of each training sample,while the other parameters in the training phase are held the same as for the pose-based method.The error is two to three times larger than the errors for the V-based method. We also compare our result with the results reported by Fanelli et al.[40]for the B3D(AC)2database.

      Fig.8 Success with varying occlusion for the B3D(AC)2database.

      Fig.9 Success of faciallandmark localization in diff erent pose-spaces for the 3DFEP database.

      Fig.10 Robustness of the proposed method to occlusion.Top two rows:results of head pose estimation on the BIWI database.Next two rows:facial landmark results on the B3D(AC)2database.Red points:ground truth shape.Blue points:estimated shape.Blank areas in the depth images in the second and fourth rows represent parts removed from the input data.Last row:facial landmark localization results for large pose angles,using 3DFEP.To show face shapes clearly,we have adjusted the corresponding point cloud to a suitable view.

      Figure 8 shows the rate of success for different degrees of occlusion,using three thresholds for assessing success.We also show some faciallandmark localization results for the B3D(AC)2database in the third and fourth rows of Fig.10.It should be noted that the estimation results are accurate even with missing data samples.

      We also tested our algorithm on the 3DFEP database built by ourselves,containing different kinds of head poses and face expressions.We choose 6867 training samples from 3DFEP and classify the head pose space into 9 categories.The facial landmark localization modelwas then trained in each sub-space.Figure 9 shows the success rate of facial landmark localization when the mean error threshold of facial landmarks is set to 7 mm.The dataset was divided based on the projection of the head unit direction vector along the x-and y-axes,in areas of 0.25×0.25.The color-coding represents the amount of data contained in each angle range,as indicated by the color bar on the right.The performance decreases in the corner regions,where the rotations are very large or the number of images is low.The last row of Fig.10 shows faciallandmark localization results for the 3DFEP database.Our pose-based 3D facial landmark localization method achieves good results on this database.Table 3 shows the error of face shapes with 3DFEP in neutral-view-space. Since the ground truth shapes in our database are inherently noisy,the errors are larger than those for the B3D(AC)2database. When testing with the 3DFEP database and 9 training models,each training modelcontained 198 trees,and 15 layers foreach tree.The average time taken for each stage in the cascaded facial landmark regression phase was about 35 ms.

      5 Conclusions

      We have presented a joint 3D head pose and facial landmark regression method for depth images.Our head pose estimation result is accurate and robust even for low quality depth data;it also provides a supervised initialization for the cascaded facial landmark estimation. Moreover,the head pose can be further refined using the estimated facial landmarks at each stage with our joint optimization framework. Our cascaded regression is based on a classifi cation-guided approach,which divides the original complex problem into several easier sub-problems. In this way,our facial landmark localization approach can handle challenging cases including large pose angles and occlusion. We have also built a 3D face database which contains 73 persons,each with 14 expressions in various head poses,which can be used to train models on different kinds of head poses and facial expressions. The experiments demonstrate that the proposed method achieves a significant improvement on head pose estimation and facial landmark localization compared to existing methods.

      Acknowledgements

      We thank Luo Jiang and Boyi Jiang for their help in constructing the 3DFEP database.We thank the ETHZ-Computer Vision Lab for permission to use the BIWI Kinect Head Pose database and BIWI 3D Audiovisual Corpus of Affective Communication database.This work was supported by the National Key Technologies R&D Program of China(No. 2016YFC0800501), and the National Natural Science Foundation of China(No.61672481).

      [1]Cao,C.; Weng,Y.; Lin,S.; Zhou,K.3D shape regression for real-time facial animation.ACM Transactions on Graphics Vol.32,No.4,Article No. 41,2013.

      [2]Cao,C.;Hou,Q.;Zhou,K.Displaced dynamic expression regression for real-time facial tracking and animation.ACM Transactions on Graphics Vol.33, No.4,Article No.43,2014.

      [3]Breitenstein,M.D.;Kuettel,D.;Weise,T.;van Gool,L.;Pfister,H.Real-time face pose estimation from single range images.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,1–8,2008.

      [4]Meyer,G.P.;Gupta,S.;Frosio,I.;Reddy,D.;Kautz, J.Robust model-based 3D head pose estimation.In: Proceedings of the IEEE International Conference on Computer Vision,3649–3657,2015.

      [5]Padeleris,P.;Zabulis,X.;Argyros,A.A.Head pose estimation on depth based on particle swarm optimation.In:Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops,42–49,2012.

      [6]Seeman,E.;Nickel,K.;Stiefelhagen,R.Head pose estimation using stereo vision for human–robot interaction.In:Proceedings of the 6th IEEE International Conference on Automatic Face and Gesture Recognition,626–631,2004.

      [7]Tulyakov,S.;Vieriu,R.L.;Semeniuta,S.;Sebe,N. Robust real-time extreme head pose estimation.In: Proceedings of the 22nd International Conference on Pattern Recognition,2263–2268,2014.

      [8]Burgos-Artizzu,X.P.; Perona,P.; Dollar,P. Robust face landmark estimation under occlusion.In: Proceedings of the IEEE International Conference on Computer Vision,151–1520,2013.

      [9]Cao,X.;Wei,Y.;Wei,F.;Sun,J.Face alignment by explicit shape regression.International Journal of Computer Vision Vol.107,No.2,177–190,2014.

      [10]Dantone,M.;Gall,J.;Fanelli,G.;van Gool,L. Real-time facial feature detection using conditional regression forests.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2578–2585,2012.

      [11]Zhang,Z.;Zhang,W.;Liu,J.;Tang,X.Multiview facial landmark localization in RGB-D images via hierarchical regression with binary patterns.IEEE Transactions on Circuits and Systems for Video Technology Vol.24,No.9,1475–1485,2014.

      [12]Zhu,Z.;Martin,R.R.;Pepperell,R.;Burleigh, A.3D modeling and motion parallax for improved videoconferencing.Computational Visual Media Vol. 2,No.2,131–142,2016.

      [13]Doll′ar,P.;Welinder,P.;Perona,P.Cascaded pose regression.In:Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,1078–1085,2010.

      [14]Sun,X.;Wei,Y.;Liang,S.;Tang,X.;Sun,J. Cascaded hand pose regression.In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,824–832,2015.

      [15]Chen,D.;Ren,S.;Wei,Y.;Cao,X.;Sun,J.Joint cascade face detection and alignment.In:Computer Vision–ECCV 2014.Fleet,D.;Pajdla,T.;Schiele,B.; Tuytelaars,T.Eds.Springer International Publishing Switzerland,109–122,2014.

      [16]Lee,D.;Park,H.;Yoo,C.D.Face alignmentusing cascade Gaussian process regression trees.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,4204–4212,2015.

      [17]Tzimiropoulos,G.Project-out cascaded regression with an application to face alignment.In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,3659–3667,2015.

      [18]Ren,S.;Cao,X.;Wei,Y.;Sun,J.Face alignment at 3000 fps via regression local binary features.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,1685–1692,2014.

      [19]Baltruˇsaitis,T.;Robinson,P.;Morency,L.P.3D constrained local model for rigid and non-rigid facial tracking.In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2610–2617, 2012.

      [20]Tan,D.J.;Tombari,F.;Navab,N.A combined generalized and subject-specifi c 3D head pose estimation.In: Proceedings of the International Conference on 3D Vision,500–508,2015.

      [21]Borovikov,E.Human head pose estimation by facial features location.arXiv preprint arXiv:1510.02774, 2015.

      [22]Fanelli,G.;Dantone,M.;Gall,J.;Fossati,A.; van Gool,L.Random forests for real time 3D face analysis.International Journal of Computer Vision Vol.101,No.3,437–458,2013.

      [23]Fanelli,G.;Gall,J.;van Gool,L.Real time head pose estimation with random regression forests.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,617–624,2011.

      [24]Fanelli,G.;Weise,T.;Gall,J.;van Gool,L.Real time head pose estimation from consumer depth cameras. In:Pattern Recognition.Mester,R.;Felsberg,M.Eds. Springer-Verlag Berlin Heidelberg,101–110,2011.

      [25]Papazov,C.;Marks,T.K.;Jones,M.Real-time 3D head pose and facial landmark estimation from depth images using triangular surface patch features.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,4722–4730,2015.

      [26]Cootes,T.F.;Edwards,G.J.;Taylor,C.J.Active appearance models.IEEE Transactions on Pattern Analysis and Machine Intelligence Vol.23,No.6,681–685,2001.

      [27]Cristinacce,D.;Cootes,T.Boost regression active shape models.In:Proceedings of the British Machine Conference,79.1–79.10,2007.

      [28]Sauer, P.; Cootes, T.; Taylor, C.Accurate regression procedures for active appearance models.In: Proceedings ofthe British Machine Vision Conference, 30.1–30.11,2011.

      [29]Tzimiropoulos,G.;Pantic,M.Optimization problems for fast AAMfitting in-the-wild.In:Proceedings ofthe IEEE International Conference on Computer Vision, 593–600,2013.

      [30]Xiao,J.;Baker,S.;Matthews,I.;Kanade,T.Realtime combined 2D+3D active appearance models.In: Proceedings ofthe IEEE Computer Society Conference on Computer Vision and Pattern Recognition,535–542,2004.

      [31]Ruiz,M.C.;Illingworth,J.Automatic landmarking of faces in 3D-ALF3D.In: Proceedings of the 5th International Conference on Visual Information Engineering,41–46,2008.

      [32]Gilani,S.Z.;Shafait,F.;Mian,A.Shape-based automatic detection of a large number of 3D facial landmarks.In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,4639–4648,2015.

      [33]Jourabloo,A.; Liu,X.Pose-invariant 3D face alignment.In:Proceedings of the IEEE International Conference on Computer Vision,3694–3702,2015.

      [34]Jourabloo,A.;Liu,X.Large-pose face alignment via CNN-based dense 3D modelfi tting.In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,4188–4196,2016.

      [35]Breiman,L.Random forests.Machine Learning Vol. 45,No.1,5–32,2001.

      [36]Schulter,S.;Leistner,C.;Wohlhart,P.;Roth,P.M.; Bischof,H.Alternating regression forests for object detection and pose estimation.In:Proceedings of the IEEE International Conference on Computer Vision, 417–424,2013.

      [37]Schulter,S.;Wohlhart,P.;Leistner,C.;Saff ari,A.; Roth,P.M.;Bischof,H.Alternating decision forests. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,508–515,2013.

      [38]Wan,C.;Yao,A.;van Gool,L.Direction matters: Hand pose estimation from local surface normals. arXiv preprint arXiv:1604.02657,2016.

      [39]Ren,S.; Cao,X.; Wei,Y.; Sun,J.Global refinement of random forest.In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,723–730,2015.

      [40]Fanelli,G.;Dantone,M.;van Gool,L.Real time 3D face alignment with random forests-based active appearance models.In: Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition,1–8,2013.

      Jie Wang is currently an M.S.student in the Schoolof Computer Science at the University of Science and Technology of China. Her research interest is in computer vision and machine learning.

      Juyong Zhang is an associate professor in the Department of Mathematics at the University of Science and Technology of China. He received his Ph.D.degree from the School of Computer Science and Engineering at Nanyang Technological University, Singapore.Before that,he received his B.S.degree in computer science and engineering from theUniversity of Science and Technology of China in 2006.His research interests fall into the areas of computer graphics, computer vision,and machine learning.

      Changwei Luo is a research assistant in the Department of Automation at the University of Science and Technology of China.His research interests cover computer vision and human computer interaction.

      Falai Chen is a professor in the Department of Mathematics at the University of Science and Technology of China.He received his B.S.,M.S., and Ph.D.degrees from the University of Science and Technology of China in 1987,1989,and 1994,respectively. His current research interests are in computer aided geometric design and geometric modeling, specifi cally,in algebraic methods in geometric modeling, splines over T-meshes and their applications to isogeometric analysis.

      Open Access The articles published in this journal are distributed under the terms of the Creative Commons Attribution 4.0 International License(http:// creativecommons.org/licenses/by/4.0/), which permits unrestricted use,distribution,and reproduction in any medium,provided you give appropriate credit to the original author(s)and the source,provide a link to the Creative Commons license,and indicate if changes were made.

      Other papers from this open access journalare available free of charge from http://www.springer.com/journal/41095. To submit a manuscript,please go to https://www. editorialmanager.com/cvmj.

      1 University of Science and Technology of China,Hefei, Anhui,230026,China.E-mail:juyong@ustc.edu.cn(). Manuscript

      2017-01-16;accepted:2017-03-08

      临桂县| 开鲁县| 景东| 望都县| 庆元县| 甘泉县| 无棣县| SHOW| 龙岩市| 千阳县| 治县。| 凉城县| 新乐市| 平阴县| 靖西县| 宜春市| 晋江市| 龙川县| 页游| 顺昌县| 昌邑市| 河北省| 西丰县| 库车县| 怀集县| 于都县| 广东省| 喀喇| 大足县| 手游| 长泰县| 泽州县| 若羌县| 浮梁县| 仁布县| 雷山县| 秦安县| 永登县| 宁津县| 延边| 门源|