Zhen Wei,Yao Sun(),Junyu Lin,and Si Liu,4
Abstract In this paper,we introduce a novel approach to automatically regulate receptive fields in deep image parsing networks.Unlike previous work which placed much importance on obtaining better receptive fields using manually selected dilated convolutional kernels,our approach uses two affine transformation layers in the network’s backbone and operates on feature maps.Feature maps are in flated or shrunk by the new layer,thereby changing the receptive fields in the following layers.By use of end-to-end training,the whole framework is data-driven,without laborious manual intervention.The proposed method is generic across datasets and different tasks.We have conducted extensive experiments on both general image parsing tasks,and face parsing tasks as concrete examples,to demonstrate the method’s superior ability to regulate over manual designs.
Keywords semantic segmentation;receptive field;data-driven;face parsing
In deep neural networks,the notion of a receptive field refers to the data that are path-connected to a neuron[1].After the introduction of fully convolutional networks(FCN)[2],receptive fields have become especially important for deep image parsing networks;they can significantly affect the network’s performance.As discussed in Ref.[3],a small receptive field may lead to inconsistent parsing results for large objects while a large receptive field may ignore small objects and classify them as background.Even if such extreme problems do not arise,unsuitable receptive fields can still impair performance.
Recent works such as Refs.[4,5]have already discussed adapting network structures to use different receptive fields.Dilated convolutional kernels are often used for this purpose.The convolutional kernels’receptive field size can be controlled by appropriate choice of dilation values(typically integers).However,this approach has several main drawbacks that should be addressed.Firstly,these dilation values are treated as hyper-parameters in network design.The selection of dilation values is based on the designer’s observations or the results of a series of trials on a certain dataset,which is laborious and time-consuming. Secondly,such choices are not generic across different image parsing tasks,or even various datasets given the same task.During network transfer,the selection procedure must be performed again.Thirdly,dilated convolutional kernels only lead to discrete values for receptive fields.When a dilation value is incremented,the corresponding receptive field(e.g.,the fc6 layer in VGG[6])may expand by tens or even hundreds of pixels,making it hard to accurately control the receptive field.
The contribution of this paper is a learning-based,data-driven method for automatically regulating receptive fields in deep image parsing networks.The main idea is to introduce a novel affine transformation layer,the in flation layer,before the convolutional layer whose receptive field is to be regulated.This inflation layer uses interpolation algorithms to enlarge or shrink feature maps.The following layers perform inference on these in flated features,thus changing the receptive fields after the in flation layer.Then,inference results(before softmax normalization)are resized to a fixed size by an interpolation layer.During training,the in flation factor,f,embedded in both the in flation layer and the interpolation layer has computable derivatives,and is trained end-to-end together with the network backbone.Asfmay be a real number,the in flation layer can produce a more fine-grained receptive field,and is trained only once.
To corroborate the method’s effectiveness,we have conducted experiments on both general image parsing tasks as well as face parsing.With proper initialization,the proposed method can achieve comparable,or even superior,results compared to the best manually selected dilated convolutions.In particular,due to the strong regulation ability brought by our method,the improved model achieves state-of-the-art face parsing accuracy on the Helen dataset[7,8].
The rest of this paper is organized as follows.In Section 2,we review related work on image parsing,especially focusing on issues relevant to receptive fields.Section 3 provides details of the new affine transformation layer and derivatives of the in flation factorf.Section 4 describes the experimental settings,while Section 5 discusses our experimental results.Section 6 concludes the paper.
This paper extends our former conference publication[9].Additional content here mainly includes:(i)more elaborate discussion of several issues during optimization(in Section 5),(ii)detailed network settings used in experiments(in Table 1),and(iii)further qualitative and quantitative results(in Tables 2–5 and Figs.2 and 3).
This section provides a brief review and discussion of related work.
The introduction of FCNs[2]has emphasised the importance of receptive fields.The forward process of FCNs to generate dense classification results is equivalent to a series of inferences using sliding windows on the input image.Using fixed stride sliding,inference at the pixel-level is solely based on data inside the window. The window is,in this case,the receptive field of the network. In Ref.[2],the authors discuss dilated convolution,but do not make use of it in their network.DeepLab[4]uses dilated convolutions to reduce pooling strides while expanding receptive fields and reducing the number of parameters in the fc6 layer.In Ref.[5],the authors append a series of dilated convolutional layers after an FCN backbone(the frontend)to expand the receptive field.Recently,in DeepLab v2[10],the authors manually designed four different dilated convolutions which are used in parallel to achieve multi-scale parsing.
However,these dilation designs are all based on trials or the designers’observations of the dataset.This is not difficult,but nevertheless is laborious and time-consuming.This paper offers the first way to replace such a process with an automatic method.
Adding input variability can also be used to provide dynamic receptive fields for a network.Zoomout[11]uses 4 inputs at different scales during inference to capture both contextual and local information.DeconvNet[3]applies prepared detection bounding boxes and crops out object instances.Inference is conducted on both these sub-images and the whole image.
Such approaches require complex pre-and postprocessing.Furthermore,they are computationally expensive as tens or even hundreds of forward propagations may be needed for each input image.
Affine transformations are usually seen in deep networks.Spatial transformer network(STN)[12]for character recognition uses a side branch to regress a set of affine parameters and applies the corresponding transformation to feature maps.In Ref.[13],the network predicts both facial landmarks in both the original image and the transformed sub-images.Affine transformation parameters are then obtained by projection of these two sets of landmarks.
Our method intrinsically differs from such related work.Taking STN for example:
·Affine transformation is only the tool used by STN to solve various problems;in particular,STN uses affine transformation to correct spatial variability of input data for recognition.Our method regulates the receptive field in the parsing network.
· The different aims result in different network structures.Affine parameters used in STN are data-dependent,as each input is different.The parameterfin our method is embedded,and knowledge-dependent(obtained by training):the receptive field should be stable during inference.Our work focuses on replacing the manual receptive field selection process.Studies on use of dynamic receptive fields are not taken into consideration here.
·As the receptive field depends only on size,rotation functionality is discarded in this work,unlike other work.
In Ref.[14],deformable convolutions are used to reformulate the sampling process in convolutions in a learning-based approach.Deformable convolutions can also be regarded as a way of reallocating convolutional weights.If nearby weights in lower layers are increased,the receptive fields of the corresponding weights in higher layers become smaller,and vice versa.
In this section, we provide details ofour methods,including the modified network structure,implementation of the in flation and interpolation layers,and loss guidance for our multi-path network.These allow us to realize multi-scale inference with our data-driven method.
We use both single-path and multi-path structures.Almost all state-of-the-art deep image parsing networks are either single-path[2,4,5,15]or multipath[10],so we use these two structures to show that our method is effective and compatible with such state-of-the-art methods.
Figure 1 presents the details of our framework.The specific settings for the network backbone are provided in Table 1.Using dilated convolutions,pooling strides in pool4 and pool5 are removed.The extent of the receptive field for the fc6 layer is 212×212.Note that we still use dilated convolutions in the fc6 layer to generate different initial receptive fields.
In the single-path network,the in flation layer and the interpolation layer are inserted before layer fc6 and after layer fc8 respectively.The receptive field is regulated by pool5 features.To reduce feature variability and increase robustness during optimization,we add a batch normalization(BN)[16]layer before the in flation layer.
Fig.1 Framework.(a)Modified single-path network.New layers are inserted before layer fc6 and after layer fc8.(b)Modified multi-path network in which all branches have the same structure and initialization.Weighted gradient layers are used to break symmetry during training.The specific settings for the single-path network are given in Table 1.
Table 1 Network structures used,including network backbone,single-path baseline model,and single-path modified model
In the multi-path version,layers from BN to the interpolation layer are duplicated,and followed by a summation operation for feature fusion.Each duplicate is initialized in the same way.In order to break this symmetry and achieve discriminative multi-scale inference,a loss guidance layer is added to enforce each duplicate to focus on a different scale.These issues will be explained in detail in the following subsections.
The affine transformation layers include the in flation layer and the interpolation layer.
The in flation layer learns a parameterf,the in flation factor.The feature map is enlarged by thefactorf beforethefollowing convolution operations.Unlike other deep networks with affine operations[12,13],regulating receptive fields does not require cropping or rotation,so only one parameter is needed in the in flation layer.
There are two steps in the in flation operation,coordinate transformation and sampling.To formulate the first process,let(,)and(,)be coordinates in the source feature map(input)and target feature map(output)respectively.The in flation process performs element-wise coordinate projection using:
The size of the feature map changes accordingly:
whereHandWare the height and width of the feature maps,superscripts s and t meaning “source”and “target”,respectively.
In the second step,we use a sampling kernelk(·)to assign pixel values in target feature maps.It is denoted bywhereiis the pixel index andcis the channel index.Letbe a pixel value in a source feature map.Then we have
This operation is identical for each input channel.The sampling kernelk(·)could be any differentiable image interpolation kernel.Here we use the bilinear kernel:k(x,f,m)=max(0,1?|x/f?m|),giving
where
and
Using the chain rule,the gradient from the in flation layer Ginfis
Additionally,we normalizeGinfby dividing by HtWt,the number of pixels in a target feature map.
The interpolation layer has almost the opposite functionality.In this layer,feature maps are resized back to a fixed size.The resizing factorf′used in interpolation layers is
whereFis a constant determined by the desired output size.In our implementationFis 8.11 to resize the final result to be as large as the label map or input image.
The interpolation layer provides a further contribution to the in flation factor’s gradient:
where ?Loss/?f′has exactly the same form as in Eq.(7).In practice,we simply add these two gradients together to update the in flation factor f:
When considering specific layers in our network,we obtain:
whereCis the number of channels in the BN layer,and subscripts bn and img refer to the BN layer and input image respectively.
In this way,it is possible to learn the in flation factor during end-to-end training.
To calculate the ranges of the new receptive fields,we can transform the question to one of obtaining an equivalent kernel size for the fc6 layer while leaving the feature maps unchanged.Denoting the original kernel size byk,Eq.(2)gives the new equivalent size:.Thus the extent of the new receptive field is 212+8×(k′?1),where 212 is the receptive field in the pool5 layer,and 8 is the overall stride from the conv11 layer to the pool5 layer in the network backbone.
Deep networks with multi-scale receptive fields have brought performance improvements in image parsing tasks[10]. Such networks usually use several slightly different parallel paths to achieve multiple receptive fields.Our method can be also used in similar structures to realize further improvements,taking the place of hand-craft dilated convolutional kernels.
To achieve this,as shown in Fig.1(b),layers fc6,fc7,and fc8 arefirst copied in parallel.The outputs of the fc8 layers are fused by summation.Then,in flation and interpolation layers are inserted before each fc6 layer and after each fc8 layer.A shared BN layer is appended after pool5.
However,this framework is symmetric and is unsuited to learning discriminative features. To break this symmetry,a weighted gradient layer is added behind each interpolation layer during training.Following the class-rebalancing strategy in Ref.[17]and the use of weighted loss in Ref.[18],the weighted gradient layer weights the gradient valuesif the ground truth labelliof the corresponding pixel(theith pixel in thecth channel)is in a given label setS.The weightwis usually greater than 1.Thus
We conducted experiments to show the superiority of our method,in its ability to select a finer receptive field.The experiment consisted of three parts:
·We first reproduced the receptive field search process by using dilated convolutional kernels and found the optimal receptive field manually.
·Leaving the network backbone intact,the singlepath network was modified by inserting new affine transformation layers.The in flation factor was learned with different initial dilation values.
·We used the best two and three receptive field settings according to the results in the first experiment to build a bi-path network and a tripath network as baseline models.For modified models,parallel paths were constructed with the same structure.By deploying loss guidance,each parallel path learned a discriminative in flation factor and feature.
Results demonstrate the effectiveness of the proposed method to learn and obtain better receptive if elds with little manual intervention.
The Helen dataset[7,8]was used in the face parsing task.It contains 2330 facial images with 11 manually labelled facial components including eyes,eyebrows,nose,lips,and mouth.The hair region is annotated b;it is thus not accurate enough for comparison.We adopted the same dataset division setting as in Refs.[18,19],using 100 images for testing.
All images were aligned using similar steps to those in Ref.[19].We used Ref.[20]to generate facial landmarks and align each image to a canonical position.After alignment,each image was cropped or padded and then resized to 500×500 pixels.
The augmented PASCAL VOC 2012 segmentation dataset was used in the general image parsing task.The augmented PASCAL VOC 2012 segmentation dataset is based on the PASCAL VOC 2012 segmentation benchmark[21]with extra annotation provided by Ref.[22].It has 12,031 images for training and 1499 images for validation,consisting of 20 foreground object classes and one background class.
Structures of models modified by our method are shown in Table 1 and Fig.1.
In the face parsing task,we trained each model with mini-batch gradient descent.The momentum,weight decay,and batch size were set to be 0.9,0.0005,and 2 respectively.The base learning rate was1e?7while the softmax loss was normalized by batch size.A total of 55,000 iterations were used;training stopped after 50,000 iterations.
The batch normalization layer used default settings.In flation factors were initialized to 1 and their learning rates were base learning rates multiplied by a weight ranging from 3×104to 9×104.No weight decays were applied to in flation factors during training.In flation factors were restricted to the range[0.25,4]in order to avoid numerical problems or exceptional memory usage.
In the general image parsing task,we realized its single-path version with a batch size of 20 and a learning rate multiplierfof 3×105.A total of 9600 iterations were used with 3 steps.The great data variability in the VOC dataset as well as data shuffling and random cropping strategies provided significant obstacles to optimizingf.To increase robustness,the following strategies were used during training:(a)clipping exceptional?Loss/?fvalues;(b)when updatingf,gradients from background areas were masked by multiplying them by a weight less than 1,preventing them from becoming dominant;(c)the original step using aγvalue of 0.1 was replaced by two smaller steps 200 iterations apart withγvalues of 0.32.
4.3.1 Single-path models
In the face parsing task,we quantitatively evaluated and compared our model with baseline models using F-measures:see Table 2.First,we manually determined the best receptive field using dilated convolutional kernels based on the baseline models,by using a series of dilation values for each model and selecting the one providing the highest F-score as the optimal manually designed model.
Next,the other unselected networks were modified using our proposed method,and their receptive fields were used for initialization. Results in Table 2 show that almost all modified models(except for a dilation value of 2,which will be further discussed in Section 5.1)have witnessed improvement,providing results comparable to those from the optimal manually designed model.The new receptive fields,e.g.,of size 292,are morefinegrained and cannot be obtained by use of the dilation algorithm.Their results are equal to,or even surpass,those of the best manually design models.
Table 2 Quantitative evaluation of baseline models and modified models using the Helen dataset.Key:dilation:dilation values in the fc6 layer.rf-fc6:the extent of the receptive field in the fc6 layer.?:the in flation factor starts being updated after 10,000 iterations in training
Qualitative comparisons for the face parsing task are provided in Fig.2.Results in Figs.1(d)and 1(e)show the improvements brought by our method.Smaller semantic areas are parsed better,especially the eyebrows and nose.Face boundaries are smoother and more accurate.Results in Figs.1(c)and 1(d)show that the proposed models provide comparable results to manually designed models:our method can replace previous manual receptive field selection methods.
For the general image parsing task,a similar process was used.Evaluation was conducted using the VOC validation set under a mean IOU metric(average Jaccard distance).
Table 3 provides quantitative results.Modified models with initial dilation values of 16,18,and 20 show noticeably improved results that are comparable with the best manually designed models,with receptive fields adjusted to an optimal range.Note that with the current network backbone,dilation convolutional kernels cannot generate receptive field of size 396,showing that the proposed method can generate receptive fields with a finer granularity.
Choosing different dilation values when initializing the modified models helps to evaluate the potential of the proposed method.Modified models with small initial dilation values have improved parsing accuracy but still perform worse than the best manually designed one,mainly due to the shrinkage of features and information loss.On the other hand,models with large initial dilation values perform better than the optimal baseline model.The reasons may vary,but one possible reason is that the modified models learn from data with dynamic sizes whilefis changing,with similar effects to those from data augmentation methods.These phenomena are further discussed in Section 5.1.
Qualitative comparisons for the general image parsing task are provided in Fig.3. Results in(d)and(e),(f)and(g)show the improvements brought by our method.With finer receptive fields,results from modified models are generally more consistent.Results in(d)have clearer shapes and boundaries than results in(e).Results in(c),(f),and(g)show that if an unsuitable initial receptive field is used,while modified models are improved,they are still not comparable to the best manually designed models.Results in(c)and(d)show that,if the initial receptive field is appropriately set,our models provide results very close to those of manually designed models:our method can replace previous manual methods of receptive field design.
Table 3 Quantitative evaluation of baseline models and modified models using the PASCAL VOC 2012 validation set
Fig.2 Face parsing results for the Helen dataset.(a)Original images.(b)Ground truth.(c)Baseline model with dilation value of 4(with best manually selected receptive field).(d)Modified model with initial dilation value of 12.(e)Baseline model with dilation value of 12.(d)and(e)show the improvements brought by our method.Smaller semantic areas have better parsing results,especially eyebrows,nose.Face boundaries are smoother and more accurate.(c)and(d)show that our models have very similar ability to manually designed models:our method can replace manual receptive field design processes.
These results demonstrate that,with proper initial settings,the proposed method is able to help deep image parsing networks find better receptive fields automatically,providing results that are equivalent to,or better than,the best manually designed one.
4.3.2 Multi-path models
A bi-path network and a tri-path network were built for use in a face parsing experiment.As baseline models,dilated convolutional kernels with best accuracy were selected:kernels with dilation values 4(best overall results,with highest eye F-score)and 6(highest nose and mouth F-scores)for the bi-path network,and dilation values of 4,6,and 8(providing highest face F-score)for the tri-path network.
Fig.3 General image parsing results on the PASCAL VOC 2012 validation set.(a)Original images.(b)Ground truth.(c)Baseline model with dilation value of 12 and best manually selected receptive field.(d)Modified model with initial dilation value of 20.(e)Baseline model with dilation value of 20.(f)Modified model with initial dilation value of 4.(g)Baseline model with dilation value of 4.Results in(d)and(e),(f)and(g)show the improvements brought by our method.With finer receptive fields,results from the modified model are generally more consistent.Results in(d)have clearer shapes and boundaries than results in(e).Results in(c),(f),and(g)show that with poor initial receptive fields,modified models are still improved but not as good as the best manually designed models.Results in(c)and(d)show that,if the initial receptive field is properly set,our model has comparable performance to the manually designed model:our method can replace previous receptive field design processes.
As a comparison,the parallel paths in both modified bi-path and tri-path networkswere symmetric using an initial dilation value of 8.The weightwused in the weighted gradient layer was 1.2.
Results in Table 4 show that the proposed method is able to obtain better receptive fields for each parallel path,providing superior results to the manually designed network.We observe that the loss guidance manages to break symmetry in the network structure and learn discriminative features.
Table 4 Quantitative evaluation of multi-path versions of baseline models and modified models using Helen dataset[7,8].Each parallel path in the modified network was initialized with a dilation value of 8
4.3.3 Comparison with previous face parsing methods
Table 5 shows a quantitative comparison for face parsing between our method and other state-of-theart methods.We use reported results from Refs.[8,19,23].Our method used a single-path network with an initial dilation value of 8.Even without CRF or RNN post-processing,our method still achieves the highest accuracy.
Although our method has a strong ability to regulate receptive fields,to get the best results,suitable initial dilation values must be chosen.Figures 4 and 5 demonstrate typical fluctuations infduring training for both tasks.
Using initial receptive fields much smaller than the desired one,fis hard to optimize as the network attempts to keep it larger than 1(see“dilation 2”in Fig.4).The shrinkage of features results in information loss,impairing parsing performance.In the face parsing task,even with some strategies,e.g.,beginning to updatefafter 10k iterations(see“dilation 2 after 10k”in Fig.4),fgoes down but does not reach the value expected.Consequently,modified models with small initial receptive fields provide improved results,but they are still not comparable to those from the best manually designed models.When it comes to the general image parsing task,models with small initial dilation values are sometimes trapped in local minima wheref fluctuates around a value larger than 1(see Fig.6).On the other hand,using extremely large initial dilations requires more extensive learning off,leading to unaffordable memory loads and time costs,as feature maps accordingly become much larger.In summary,our suggestion is to use moderately large dilation values for initialization,but not too large.
Unlike the face parsing task,in which images are coarsely aligned and semantic constituents from different images are of similar size(e.g.,eyes,lips),object sizes in general datasets have much greater variability,making optimizingfrather more difficult.Even with proper initialization and identical network settings,whilefstays in a certain range it does not converge to a specific value(see Fig.7).Results shown in Table 3 are typical examples.
In this paper,we have introduced a new automatic regulation method for receptive fields in deep image parsing networks.This data-driven approach is able to replace existing hand-crafted receptive field selection methods.It enables deep image parsing networks to obtain better receptive fields with finer granularity in a single training process.Experimental resultsusing the Helen and PASCAL VOC 2012 datasets demonstrate the effectiveness of our method in comparison to existing methods.
Table 5 Quantitative comparison of our method and other face parsing models on the face parsing task.Our method performs best
Fig.5 Typical fluctuations infduring training for the general image parsing task.Modified models used initial dilation values of:(a)4,(b)6,(c)16,(d)18,(e)20.Unlike the training process for the face parsing task,fshows more noticeable fluctuation due to high data variability in the VOC dataset.
Fig.6 Fluctuation offduring training in the general image parsing task,using the same initial network settings.Only changes in the first 2500 iterations are plotted here.The initial dilation value was 4,much smaller than the optimal value.In this case,fsometimes may become trapped in local minima and stay near 1.Small initial dilation values are to be avoided.
Fig.7 Fluctuation offduring training for the general image parsing task,with identical initial network settings.Only changes in the first 3000 iterations are plotted here.The initial dilation value was 18.Due to the great variability during optimization,fwill reach a range of values,instead of stopping at a specific number.
Acknowledgements
This work was supported by the National Natural ScienceFoundation ofChina(Nos.U1536203,61572493),the Cutting Edge Technology Research Program of the Institute of Information Engineering,CAS(No.Y7Z0241102),theKeyLaboratory of Intelligent Perception and Systems for High-Dimensional Information of the Ministry of Education(No.Y6Z0021102),and Nanjing University of Science and Technology(No.JYB201702).
Computational Visual Media2018年3期