EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision

Jiawei Yang^*,¶, Boris Ivanovic^¶, Or Litany^†,¶, Xinshuo Weng^¶, Seung Wook Kim^¶, Boyi Li^¶, Tong Che^¶,

Danfei Xu^$,¶, Sanja Fidler^§,¶, Marco Pavone^‡,¶, Yue Wang^*,¶

^*University of Southern California^$Georgia Institute of Technology^§University of Toronto^‡Stanford University^†Technion^¶Nvidia Research

Paper Code BibTeX

^*Best viewed in Chrome. Initial loading of videos may require a few minutes (1 to 4 minutes).

Using only self-supervision, EmerNeRF effectively decomposes dynamic scenes into static and dynamic components.
Importantly, EmerNeRF derives scene flows without explicit flow supervision.

Abstract

We present EmerNeRF, a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. Grounded in neural fields, EmerNeRF simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping. EmerNeRF hinges upon two core components: First, it stratifies scenes into static and dynamic fields. This decomposition emerges purely from self-supervision, enabling our model to learn from general, in-the-wild data sources. Second, EmerNeRF parameterizes an induced flow field from the dynamic field and uses this flow field to further aggregate multi-frame features, amplifying the rendering precision of dynamic objects. Coupling these three fields (static, dynamic, and flow) enables EmerNeRF to represent highly- dynamic scenes self-sufficiently, without relying on ground truth object annotations or pre-trained models for dynamic object segmentation or optical flow estimation. Our method achieves state-of-the-art performance in sensor simulation, significantly outperforming previous methods when reconstructing static (+2.93 PSNR) and dynamic (+3.70 PSNR) scenes. In addition, to bolster EmerNeRF’s semantic generalization, we lift 2D visual foundation model features into 4D space-time and address a general positional bias in modern Transformers, significantly boosting 3D perception performance (e.g., 37.50% relative improvement in occupancy prediction accuracy on average). Finally, we construct a diverse and challenging 120-sequence dataset to benchmark neural fields under extreme and highly-dynamic settings.

Exploring the Potential of EmerNeRF representations

Self-supervised Scene Flow Estimation

EmerNeRF demonstrates emerging flow estimation properties without explicit flow supervision. EmerNeRF's approach to flow estimation does not rely on explicit flow supervision. Instead, its effectiveness is derived from optimizing scene reconstruction losses and temporal aggregation. Through the integration of temporally-consistent features across multiple frames, it achieves accurate scene flow predictions.

Show results for

^*Interact with the plot using the mouse. To optimize page load times, results are displayed every second, showcasing a sampled 1/6 of the points per frame.

Novel View Synthesis

EmerNeRF effectively reconstructs spatial-temporal scenes, generating high-quality views of both static and dynamic elements. For demonstration, a novel trajectory is rendered by:

Following the camera's initial video path..
Synthesize novel views while pausing time.
Returning to the camera's initial path.
Fixing the ego-center and allowing time progression.

However, it's worth noting that once the ego-center is set in place (step 4), subsequent synthesized views may exhibit increased noise due to the absence of further observations in training data.

EmerNeRF synthesizes high-quality novel appearances, depth and object motions.

Positional Embedding Decomposition

We observe prominent and undesired PE patterns when using current state-of-the-art foundation models, notably DINOv2. These patterns persist across images regardless of 3D viewpoint shifts, violating 3D multi-view consistency. EmerNeRF offers a solution to this problem.

EmerNeRF frees foundation models from positional embedding artifacts.

Without the introduced PE decomposition, NeRF exhibits noticeable PE artifacts, leading to foggy floaters and ghosting in the rendered features.

Comparison of EmerNeRF with and without our proposed positional embedding decomposition. PCA has randomness, so the colors may vary across scene reconstructions and novel view synthesis.

Spatial-Temporal Foundation Feature Fields

EmerNeRF leverages the robust semantics of 2D vision foundation models, overcoming their positional embedding limitations. We visualize the lifted spatial-temporal features.

Show results for

^*Interact with the plot using the mouse. To optimize page load times, results are displayed every second. Note: 2D and 3D features are visualized distinctly and may have different color representations. Voxel size is 0.15m.

Citation

@article{yang2023emernerf,
    title={EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision},
    author={Jiawei Yang and Boris Ivanovic and Or Litany and Xinshuo Weng and Seung Wook Kim and Boyi Li and Tong Che and Danfei Xu and Sanja Fidler and Marco Pavone and Yue Wang},
    journal={arXiv preprint arXiv:2311.02077},
    year={2023}
}