Liu Sen, Zhang Zhizheng, Yu Tao, Chen Zhibo
CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application System, University of Science and Technology of China, Hefei 230027, China
Abstract: Video inpainting aims to fill the holes across different frames upon limited spatio-temporal contexts. The existing schemes still suffer from achieving precise spatio-temporal coherence especially in hole areas due to inaccurate modeling of motion trajectories. In this paper, we introduce fexible shape-adaptive mesh as basic processing unit and mesh flow as motion representation, which has the capability of describing complex motions in hole areas more precisely and efficiently. We propose a Mesh Oriented Video Inpainting nEtwork, dubbed MOVIE, to estimate mesh flows then complete the hole region in the video. Specifically, we first design a mesh flow estimation module and a mesh flow completion module to estimate the mesh flow for visible contents and holes in a sequential way, which decouples the mesh flow estimation for visible and corrupted contents for easy optimization. A hybrid loss function is further introduced to optimize the flow estimation performance for the visible regions, the entire frames and the inpainted regions respectively. Then we design a polishing network to correct the distortion of the inpainted results caused by mesh flow transformation. Extensive experiments show that MOVIE not only achieves over four-times speed-up in completing the missing area, but also yields more promising results with much better inpainting quality in both quantitative and perceptual metrics.
Keywords: mesh flow; deep neural networks; video inpainting
Video inpainting is of high importance for many professional video post-production applications, including video editing, scratch or damage video repair, logo or watermark removal in broadcast videos, etc[1-4]. The goal of video inpainting is to fill missed regions of a given video sequence with spatially and temporally consistent results. More challengeable than image inpainting in which only spatial consistency need be considered, improving aforementioned consistency requires us to not only exploit spatial contexts but also attach importance to exploiting the contents from nearby frames. To this end, solving the temporal misalignment problem for videos plays a dominant role.
For video data, temporal motions are complex due to local human or objects motions, global camera motions, and other environmental dynamics. The previous works which target video tasks such as video super-resolution and video stabilization[5-8]achieve promising results by explicitly taking advantage of motion information including optical flow, motion vector, homography, etc. Despite this, aligning features of adjacent frames in video inpainting is more challenging than other video tasks due to the pixel absence in the hole areas. Because the missed contents introduce noises and make the motion information not reliable as well.
For the task of video inpainting, several temporal alignment solutions have been investigated, including patch matching methods, motion-based methods, 3D convolutional neural networks and attention-based neural networks. ①Patch matching methods select the most similar patch from adjacent frames for temporal coherence[9-12]. These methods may give rise to block artifact in the scenario with complex textures. ②Motion-based methods estimate motion information for copying corresponding contents from nearby frames, including optical flow and homography. Some works compute optical flows between two adjacent frames directly, which fail to provide precise flow prediction in the absence of pixel information[13-15]. A state-of-the-art optical flow-based work uses a sequence of flow maps from consecutive frames to complete the target optical flow[16]. However, the dense computation of pixel-level flow is a very time-consuming operation such that the efficiency is still under-explored. Homography-based methods use global affine transformation matrices to align two frames[13,17]. They only work well over plane motions or motions caused by camera rotations. ③3D convolutional neural networks use 3D filters to convolve the features from reference frames to target frame[18-20]. They have limited window size and suffer from high computation cost. ④Attention-based neural networks compute similarities between the hole boundary pixels in the target and the non-hole pixels in the references[21]via attention module. They are unstable because the context information of the hole boundary are insufficient for similarity computation. So far, an efficient and accurate temporal alignment solution for video inpainting still remains under-investigated.
In addressing the temporal alignment problem, we propose to introduce flexible shape-adaptive mesh as basic processing unit and mesh flow as motion representation. We suggest that, in the case of videos with holes, mesh flow is more efficient and effective than other motion representations. First, mesh flow can represent more complex and coherent motions than homography and motion vector because it describes the motion trajectory of each pixel with multi-parameter model. Second, mesh flow can represent more accurate and robust motions in the hole area than optical flow because it can introduce richer spatial context information. Third, mesh flow is more computationally efficient than optical flow because it is a sparse motion field. More detailed theoretical analysis is illustrated in Section 3.1.
To take advantage of mesh flow towards more effective and efficient video inpainting, we propose a Mesh Oriented Video Inpainting nEtwork, called MOVIE, which consists of a sequential mesh flow estimation network and a polishing network. Since computing the mesh flow based on frames directly is unreliable and easily cause misalignments due to the existence of the holes in video, we design two sequential modules in the sequential mesh flow estimation network. Specifically, the mesh flow estimation module predicts mesh flows for the visible contents of the frames to guarantee the accuracy of the computed motions. Then the mesh flow completion module completes the mesh flow in the hole regions of the target frame by learning from a sequence of adjacent mesh flows. We design a hybrid loss function to optimize the flow estimation performance for the visible regions, the entire frames and the inpainted regions respectively, and train the sequential mesh flow estimation network in an end-to-end and self-supervised manner. For the polishing network, we align the frames with the estimated mesh flows in a propagate manner and feed them into the network for further refinement. The polishing network is trained to correct the distortion of the hole regions of each frame caused by mesh flow transformation. Experimental results reveal the superior of our method than the state-of-the-art schemes.
We summarize our contributions as follow:
(Ⅰ) We are the first one to propose to take advantage of mesh flows as motion representations in addressing the misalignment problem for video inpainting and demonstrate its superiority in both effectiveness and efficiency compared to other motion representations.
(Ⅱ) We propose a simple yet effective model and a hybrid loss function to better estimate mesh flows for the video with holes and better take advantage of the mesh flow for video inpainting.
(Ⅲ) We evaluate our method on various challenging videos and demonstrate our proposed approach can achieve impresive improvements in both effectiveness and efficiency compared to the state-of-the-art approaches.
Recent years have witnessed remarkable progress in video inpainting by deep learning-based approaches. In this section, we provide an overview of the literature in terms of alignment techniques given their effectiveness in handling temporal consistency. We summarize the existing methods on video inpainting into four categories: patch matching methods, motion-based methods, 3D convolutional neural networks and attention-based neural networks.
Early works are mainly patch-based optimization methods[9-12]. They split each frame of the video into small patches and recover the hole region by pasting the most similar patch from other frames in the video. Since the patch-based alignment only describe simple translation motion for each patch, the block effects of the completed videos are obvious. Furthermore, the computation of patch similarity suffer from large search space, which make the completing process extremely slow.
Motion-based methods propose to estimate motion information first, then warp the content from nearby frames to target frame. The motion representations have been investigated in deep learning-based approaches are optical flow and homography.
Optical flowThe widely used motion representation in deep learning based video inpainting is optical flow, which describes per-pixel motions between frames for warping the visible content of reference frames to the hole area of target frame. Five optical flow related works have been published, which explore various strategies to exploit optical flow field information[13-16,22]. Three works estimated flows only between the adjacent frames[13-15]. Chang et al. proposed a selective scheme to combine an optical flow warping model and an image-based inpainting model[15]. Ding et al. considered two branch optical flows generated from images and deep features separately[13]. Woo et al. used the computed flow field between the previous completed frame and the target frame as an auxiliary model to enforce temporal consistency[13]. However, estimating optical flow on the hole region directly is easy to lead incorrect flow prediction. Further, Kim et al. proposes to estimate flows of feature maps between the source and five reference frames in multi-scales, and complete the hole based on the aggregation of five aligned features[22]. Despite consideration of long-range frames, the flows are still computed on the hole area, which address the above issue with little success. Xu et al. proposed a Deep Flow Completion network to complete optical flow by watching a sequence of flow maps from consecutive frames, which further used to guide the propagation of pixels to fill up the missing regions in the video[16]. This strategy can provide more accurate optical flow estimation than previous approaches. Despite its significant performance improvements, the dense computation of pixel-level flow is a very time-consuming operation such that the efficiency is still under-explored.
HomographyTwo homography-based methods are proposed to predict global transformation parameters for aligning frames[13,17]. They both computed affine matrices between multiple reference frames and the target frame for the alignment, followed by an aggregation and refinement process.In the aggregation stage, Woo et al. proposed a non-local attention model to pick up the best matching patches in aligned frames[13], Lee et al. proposed a context matching module to assign weights for each aligned frames[17]. However, homography cannot be used to describe complex motions, limiting its application. Besides, to ensure long time dependency, they complete current frame by visiting the reference frames over a long-range distances, even the whole video shot. This strategy will result in intensive computational cost despite the simplicity of homography.
Several 3D convolutional networks are proposed to use 3D filters to convolve the features from reference frames to target frame, which equivalent to temporal alignment[18-20]. Wang et al. proposed a 3D-2D encoder-decoder network, which uses the output from 3D completion network to guide the 2D completion network[18]. Chang et al. did video inpainting with 3d gated convolutional and temporal patch discriminator[18], and further introduced a learnable gated temporal shift module to replace the computation-intensive 3D convolutional layer, which leads to a 3x reduction in computation[20]. However, the computation cost is still heavy. The temporal window size is limited in these 3D convolutional-based methods, hence they lack the ability to handle long time dependency challenge.
Since attention module can be used for feature matching, Oh et al. proposed an asymmetric attention block to compute similarities between the hole boundary pixels in the target and the non-hole pixels in the references in a non-local manner[21]. The results are unstable regarding the complex situation and small dimension of the hole boundary.
Towards better understanding of mesh flow, we first formulate the concept of mesh flow and compare it with other motion representations.
Given a target frameT(x,y) and a reference frameR(x,y), we aim to find mapping functionsx′=f(x,y) andy′=g(x,y) to minimize the following objective function:
E=d{T(x,y),R(x′,y′)}
(1)
To compute mesh flow, we first need to partition the frame into non-overlapped regular blocks and treat each block as a basic processing unit mesh. Each mesh is a flexible shape-adaptive quadrilateral and can be arbitrarily transformed according to its four vertices. Then we compute motions of the pixels at the vertices of the mesh quadrilaterals, and interpolate the motions of other pixels based on the motions of the vertexes with bilinear interpolation kernel. The model of the motions of the vertices in the mesh quadrilaterals can be expressed as followed:
(2)
wherevdenotes the vertices in the mesh quadrilaterals.
Mesh flow possesses several characteristics: ①mesh flow describes the motion trajectory of each pixel with multi-parameter model, namely, the motion of each pixel is computed with four nodal motions in its belonged mesh quadrilateral, ②mesh flow represents coherent motion trajectories across mesh quadrilaterals (see Figure 2), ③ mesh flow computes the motions of the vertices according to their four belonged quadrilaterals, which can introduce rich context information for each vertice, ④mesh flow is a sparse motion field.
Figure 1. Comparison of the state-of-the-art methods in term of quality and speed on 37 video sequences with 2139 frames in total.
Figure 2. Motion representations, including homography, optical flow, motion vector and mesh flow.
Homographydescribes affine mapping between two images, which is formulated as below:
(3)
where all the pixel share the same affine transformation parametersa0,a1,a2,b0,b1,b2.
Since homography can only represents camera moving over a stationary scene, it cannot handle complex scenarios with multi-object motions.
Optical flowcomputes motion information in pixel-level, the motion model of each pixel can be expressed as followed:
(4)
It is worth noting that what we need are the motion trajectories of the pixels in the hole area. While optical flow cannot introduce rich context information for these pixels as mesh flow, the optical flow estimation in the hole area is not reliable. In addition, optical flow is a dense motion field, which makes the flow estimation computation-intensive and time-consuming.
Motion vectordescribes motion information in a patch-based manner. Since realistic motion in a patch may be more complicated than translation, motion vector cannot provide precise motion representation. Further, applying motion vectors as the description of motion information is easy to render blocking effect when predicting image due to discontinuity across block boundary (see Figure 2). In comparison, mesh flow can represent more complex non-linear motion transformation and produce more continuous results.
Overall, mesh flow is more efficient and effective compared with other motion representations for the video inpainting task, in which the video has holes.
Given a sequence of frames {It|t=1,…,n} with holes {Ht|t=1,…,n}, our goal is to estimate the meshes {Mt|t=1,…,n-1} between frames wherein we also need take into account corrupted contents (corresponding to the holes).
Since the spatial information is not sufficient due to the holes in video, computing the meshes based on frames directly is unreliable and easily cause misalignments. Thus, we propose to estimate the mesh flow of each entire frame in a sequential manner. The first step is to compute the mesh flow corresponding to visible regions of frames. Afterwards, we estimate the mesh flow corresponding to the corrupted parts (i.e., holes) based on adjacent mesh flows. The framework is illustrated in Figure 3.
Figure 3. Overview of the sequential mesh estimation network. The network consists of two modules: mesh estimation module and mesh completion module. The mesh estimation module predicts meshes for the visible regions of the frames, and the mesh completion module completes the missing area of mesh of the target frame by learning from a sequence of adjacent meshes.
We optimize the sequential mesh flow estimation network in a self-supervised manner. Firstly, we upsample the mesh flows to pixel-level flow maps by interpolating from nodal motions with bilinear interpolation kernel. Then we align the frames based on the pixel-level flow maps, and optimize for minimizing the L1distance between the target frames and the aligned frames.
For the first mesh flow estimation module, we adopt the L1loss for visible regions. Since (2×N+1) mesh flows are needed for completing one mesh flow, we compute the average value of these (2×N+1) L1distances.
L1(visible region)=
(5)
whereIdenotes the input frame,Hdenotes the holes,ω′ denotes the warping function with the predicted mesh flowθ′.
For the second mesh flow completion module, we propose two L1loss functions, loss for the entire frame and loss for the inpainted regions.
(6)
whereIdenotes the input frame,Hdenotes the holes,ω″ denotes the warping function with the final completed mesh flowθ″.
In summary, the hybrid loss function of the sequential mesh flow estimation network is as follows:
L=L1(visible region)+L1(the frame)+L1(inpainted region)
(7)
We illustrate the procedure of the polishing network in Figure 4. Given the mesh flows estimated by the sequential mesh flow estimation network, we perform the alignment procedure between two frames as described in section 3.3. To get the final aligned results, we perform the alignment operations for the frames (with the holes) pair by pair from the first two frames to the last two frames, and then repeat the same procedure backwardly. Then we concatenate the aligned frames and holes with the input frames and holes, and feed them into a residual block-based polishing network to generate the final refined output.
We illustrate the procedure of the polishing network in Figure 4. The whole procedure has three steps: forward propagation, backward propagation and refinement. In the first step, we first warp the visible content of the first frame to the hole of the second frame with mesh flow, and update the mask of the second frame. We do the same warping operation one by one from the first frame to the last frame. In the second step, we do the same procedure as the first step backwardly. In the final step, we concatenate the aligned frames and holes with the input frames and holes, and feed them into a residual block-based polishing network to generate the final refined output.
Figure 4. The process of video polishing. We first use the estimated mesh flows to warp the contents in a propagation manner forwardly, and then repeat the same procedure backwardly. Finally, we concatenate the aligned frames and holes with the input frames and holes, and feed them into a residual block-based polishing network to generate the final refined output.
Figure 5. Video sequence for quantitative comparisons. The masks are selected from other videos in the DAVIS dataset.
Figure 6. Qualitative results compared with the state-of-the-art methods on DAVIS dog-agility video sequence. Our method can complete the white and red pole with more precise and coherent structure.
Figure 7. Qualitative results compared with the state-of-the-art methods on DAVIS motocross-bumps video sequence. Our method can complete the two white-stripe warning lines more continuously.
We design two loss functions specific to the inpainted region and the entire frame respectively to train the polishing network. Among them, L1based loss function is designed for refining adopted contents in the hole regions of the frame, while the adversarial loss[24]is designed for making the completed contents more realistic and more consistent with visible regions.
(8)
In summary, the total loss of the polishing network is as follows:
L=L1(inpainted region)+Ladv
(9)
We train our method in two stage, first the sequential mesh flow estimation network, then the polishing network. The two models both run on hardware with the Intel(R) Xeon(R) CPU E5-2620 v4 and GeForce GTX 1080Ti GPUs.
We train the sequential mesh flow estimation network on 3471 videos in the YouTubVOS[25]dataset. For each sample, we select 12 frames with random frame step between 1 to 5 in one video. To collect the hole mask, we use the irregular mask dataset provided by an image inpainting work PartialConv[26], which contains 12000 mask files. We further augment the mask dataset to 480000 mask files by performing random translation, flipping and rotation. During training, we randomly select 12 masks for each sample.
We use the Adam Optimizer withβ=(0.9, 0.999) and learning rates =10-4. Using one GeForce GTX 1080Ti GPU, the early convergency takes about 8 hours, and the final convergency takes about one week (the PSNR improves about 1dB).
Then we build training data for the polishing network by performing temporal alignment operations on the 3471 videos of the YouTubeVOS dataset. For each video, the alignment operation propagates forwardly and backwardly with the output of the sequential mesh flow estimation network. The random selected masks are also saved with the aligned results.
We use the Adam Optimizer withβ=(0.9, 0.999) and learning rates =10-4. Using one GeForce GTX 1080Ti GPU, the early convergency takes about 2 hours, and the final convergency takes about 3 days (the PSNR improves about 1dB).
To demonstrate the qualitative and quantitative performance of our proposed method MOVIE, we evaluate it on DAVIS[27,28]dataset, which consists of pixel-wise foreground object annotation.
For qualitative evaluation, we test on several video sequences with large motions and use the labeled pixel-wise foreground objects as holes. For quantitative evaluation, we randomly select 37 video sequences with 2139 frames in total in DAVIS dataset. Since the ground-truths of removed regions are not available while removing the objects directly in the video, we randomly select a mask sequence for each video from other videos in the DAVIS dataset. Figure 5 shows two samples of the test sequences, which contain large foreground objects in the hole region and the motions are complicated. We report the evaluation in terms of PSNR and SSIM, which are commonly used in video inpainting tasks. We also conduct ablation studies on these 37 video sequences. Inference speed is computed on a NVIDIA GTX 1080 Ti GPU for frames of 256x256 pixels.
We compare our approach with the state-of-the-art approaches which can be categorized as follows:
Patch-basedPatch-based[11]completes the hole of a video via patch-based similar matching.
Optical flow-basedDVI[22]is a two-frame optical flow-based method, and Flow-guided[16]uses a sequence of flow maps from consecutive frames to complete the target optical flow.
Homography-basedCPnet[17]does temporal alignment by computing affine matrices between two frames.
3D convolution-basedGateTSM[20]and 3DGatedConv[18]both introduce new modules in 3D convolution layers for better performance and faster computation speed.
Attention-basedOnionPeel[21]uses an asymmetric attention block to compute similarities between the hole boundary pixel in the target and the non-hole pixels in the references in a non-local manner.
We illustrate the qualitative results in Figures 6 and 7, where the video sequence is a large motion video. As shown in the Figures, our method can complete the white and red pole with more precise and coherent structure. Our method is able to deal with the complicated situations, while the state-of-the-art methods have limitations in inpainting consistent results in which obvious artifacts can be observed. We illustrate the qualitative results in Figures 6 and 7, where these two video sequences are both large motion videos. As shown in the Figures, our method can complete the white and red pole with more precise and coherent structure (in Figure 6), and can complete the two white-stripe warning lines more continuously (in Figure 7). Our method is able to deal with these complicated situations, while the state-of-the-art methods have limitations in inpainting consistent results in which obvious artifacts can be observed.
The performance improvement would come from two aspects: the superiority of mesh flow and the design of the sequential mesh flow estimation network. Mesh flow can represent complex non-linear motion transformations and coherent motion trajectories across mesh quadrilaterals. Further, the design of the sequential mesh flow estimation network can guarantee the precise estimation of the mesh flow in the hole areas of the video.
We show the quantitative results in Table 1. Our method produces significant improvement (more than 1dB PSNR) over the current state-of-the-arts on the challenging datasets which contain complex motions from foreground objects, and show speedups of up to about 4x against the fastest method and more than 20x against the best method.
Table 1. Quantitative comparison with the state-of-the-art methods. Our method produces signicant improvement (more than 1dB PSNR) over the current state-of-the-art methods, and show speedups of up to about 4x against the fastest method and more than 20x against the best method.
The reconstruction values demonstrate that several methods fail to produce high reconstructed values due to the complex foreground motions in the hole region, including patch-based method[11], two-frame optical flow-based method DVI[22], homography-based method CPnet[17], 3D convolution-based methods GateTSM[20]and 3DGatedConv[18]. Their weaknesses have been analyzed in detail in section related work. Further, Flow-guided[16]and OnionPeel[21]can inpaint with higher PSNR but not reliable. Flow-guided[16]is easily affected by the noise introduced by the missed contents while estimates motion trajectories of the hole area in pixel-level. OnionPeel[21]is failed due to the small dimension of the hole boundary which used for computing similarities.
In comparison, our method can handle complex motions more precisely and efficiently. The results show that mesh flow can be used to provide more precise temporal alignment with the well-designed sequential mesh flow estimation network. Meanwhile, our method is a computationally efficient solution.
In this section, we conduct a series of ablation studies to analyze the effectiveness of each component in the sequential mesh flow estimation network. Quantitative analyses are conducted on 37 video sequences described in Section 4.1 whose mask sequences are selected from other videos. We train each model in 50,000 iterations and optimize the models in the same training settings for fair compirason. The training process takes about 8 hours.
(Ⅰ) The effectiveness of the sequential mesh flow estimation: Our model estimates mesh flows in a sequential manner: first estimates mesh flows for visible contents of the frames, then completes the mesh flows of hole areas by learning from the adjacent mesh flows. To analyze the effectiveness of this sequential strategy, we compare with a direct mesh flow estimation model which estimates the mesh flows directly given a sequence of frames and holes. As illustrated in Figure 8, estimating mesh flow in sequential manner can generate more accurate mesh flows, while estimating mesh flows directly totally fails.
Figure 8. Ablation study on the effectiveness of the sequential mesh flow estimation strategy.
Figure 9. Ablation study on the hybrid loss function in the sequential mesh flow estimation network. The symbols in the left-top corner of each frame represent the three loss functions respectively: L1 loss for visible region, L1 loss for the entire frame, L1 loss for inpainted region.
(Ⅱ) Ablation study on mesh size: As shown in Table 2, we analyze the influence of mesh size.
Table 2. Ablation study on different mesh sizes.
Table 3. Ablation study on number of references.
The results show that setting the mesh size to 8 can achieve better performance. The mesh size larger than 8 may fail to describe the complex motions in the holes. And the mesh size smaller than 8 cannot exploit sufficient information for aligning frames.
(Ⅲ) Ablation study on number of references: We further analyze the influence of number of references. Note that the number of mesh flows larger than 12 will lead to out-of-memory error, hence we set the number of mesh flows up to 10. As shown in Table 3, larger number can lead to better performance.
(Ⅳ) Ablation study on the hybrid loss function: To evaluate the hybrid loss function of the sequential mesh flow estimation network, we train the model in different combination settings of the three loss functions in the hybrid loss function. As shown in Table 4, each of the three loss functions make positive contribution to the final performance.
Table 4. Ablation study on the hybrid loss function of the sequential mesh flow estimation network.
Table 5. Ablation study on Polishing Network. The polishing network can achieve 1dB PSNR improvement.
The aligned results are illustrated in Figure 9. Specifically, the two results in the left column indicate that the results are totally failed without L1 loss for the entire frame. The comparison between the results in the right column and the middle column shows that L1loss for inpainted region can lead to more consistent texture.
Figure 10. Ablation study on the polishing network. The polishing network can smooth the artifact of the result and make it more visually plausible.
And the effectiveness ofL1 loss for visible region is illustrated in the comparison between the two results in the right column.
In this section, we conduct ablation study regarding the polishing network. The results in Table 5 indicate that the polishing network can achieve 1dB PSNR improvement. Figure 10 shows that the polishing network can smooth the artifact of the result and make it more visually plausible.
In this paper, we propose an efficient and effective method for video inpainting. In essence, our main idea is to introduce mesh flow as a more proper representation of motion information so as to better target the temporal misalignment problem for video inpainting. Specifically, We design a sequential mesh flow estimation network which firstly predicts mesh flow only for visible regions of frames, then completes the holes of mesh flow by learning from the adjacent mesh flows. We further design a polishing network to polish the aligned results. Experiment results show that our method yields more promising results with higher inpainting quality in both quantitative and perceptual metrics, and achieves four-time speed-up at least in completing the missing area.
中國科學(xué)技術(shù)大學(xué)學(xué)報2021年1期