DynaFill Demo

Slide the white double arrow to the left or to the right to compare the dynamic object removal and inpainting by DynaFill.


Dynamic objects have a significant impact on the robot's perception of the environment which degrades the performance of essential tasks such as localization and mapping. In this work, we address this problem by synthesizing plausible color, texture and geometry in regions occluded by dynamic objects. We propose the novel geometry-aware DynaFill architecture that follows a coarse-to-fine topology and incorporates our gated recurrent feedback mechanism to adaptively fuse information from previous timesteps. We optimize our architecture using adversarial training to synthesize fine realistic textures which enables it to hallucinate color and depth structure in occluded regions online in a spatially and temporally coherent manner, without relying on future frame information.

Casting our inpainting problem as an image-to-image translation task, our model also corrects regions correlated with the presence of dynamic objects in the scene, such as shadows or reflections. We introduce a large-scale hyperrealistic dataset with RGB-D images, semantic segmentation labels, camera poses as well as ground truth RGB-D information of occluded regions. Extensive quantitative and qualitative evaluations show that our approach achieves state-of-the-art performance, even in challenging weather conditions. Furthermore, we present results for retrieval-based visual localization with the synthesized images that demonstrate the utility of our approach.

How Does It Work?

We propose an end-to-end deep learning architecture for dynamic object removal and inpainting from temporal RGB-D sequences. Our coarse-to-fine model trained under the generative adversarial framework synthesizes spatially coherent realistic color as well as textures and enforces temporal consistency using a gated recurrent feedback mechanism that adaptively fuses information from previously inpainted frames using odometry and the previously inpainted depth map. Our model encourages geometric consistency during end-to-end training by conditioning the depth completion on the inpainted image and simultaneously using the previously inpainted depth map in the feedback mechanism. As opposed to existing video inpainting methods, our model does not utilize future frame information and produces more accurate and visually appealing results by also removing shadows or reflections from regions surrounding dynamic objects.

Network architecture
Figure: Schematic representation of our DynaFill architecture. The image is first coarsely inpainted based on the spatial context in regions occluded by dynamic objects, which is obtained from the semantic segmentation mask. Subsequently, the inpainted image from the previous timestep is warped into the current timestep using odometry and the inpainted depth map in our recurrent gated feedback mechanism. The coarsely inpainted image and the warped image are then input to the refinement stream that fuses feature maps through a gating network using the learned mask. An image discriminator is employed to train the network in an adversarial manner to yield the final inpainted image. Simultaneously, the depth completion network fills the regions containing dynamic objects in the depth map, conditioned on the inpainted image.

We performed extensive experiments that show that our DynaFill model exceeds the performance of state-of-the-art image and video inpainting methods with a runtime suitable for real-time applications (~20 FPS). Additionally, we presented detailed ablation studies, qualitative analysis and visualizations that highlight the improvement brought about by various components of our architecture. Furthermore, we presented experiments by employing our model as a preprocessor for retrieval-based visual localization that demonstrates the utility of our approach as an out-of-the-box front end for localization and mapping systems.

DynaFill Dataset

Our hyperrealistic synthetic dataset, generated using CARLA simulator consists of 6-DoF ground truth poses and aligned RGB-D images with and without dynamic objects, as well as ground truth semantic segmentation labels. The dataset was collected in several weather conditions including ClearNoon, CloudyNoon, WetNoon, WetCloudyNoon, ClearSunset, CloudySunset, WetSunset, and WetCloudySunset. The images were acquired at a resolution of 512 x 512 pixels with a field of view of 90° using a front-facing camera mounted on the car. The images were acquired at 10 Hz and we split the data into training and validation sets. The training set was collected in the Town01 map and consists of 77,742 RGB-D images. While the validation set was collected in the Town02 map and consists of 23,722 RGB-D images.

License Agreement

The data is provided for non-commercial use only. By downloading the data, you accept the license agreement which can be downloaded here. If you report results based on the DynaFill dataset, please consider citing the paper mentioned in the Publications section.



Borna Bešić, Abhinav Valada
Dynamic Object Removal and Spatio-Temporal RGB-D Inpainting via Geometry-Aware Adversarial Learning
arXiv preprint arXiv:2008.05058, 2020.

(PDF) (BibTeX)