Future Dynamic 3D Reconstruction:
A 3D World Model with Disentangled Ego-Motion

ICML 2026

Nils Morbitzer^*,1,2 Jonathan Evers^*,1,3 Artem Savkin³ Thomas Stauner³
Nassir Navab^1,2 Federico Tombari^1,2 Stefano Gasperini^1,2,4

¹ TUM ² MCML ³ BMW Group ⁴ Visualais

^* Equal contribution.

arXiv ICML Paper

Abstract

Forecasting the evolution of dynamic environments is crucial for autonomous agents. While generative world models have recently achieved high photorealism in 2D video synthesis by mixing ego-motion and environmental dynamics within the image plane, they exhibit physical inconsistencies, such as morphing or vanishing objects, especially over long time horizons. In this paper, we propose FR3D, a world model that predicts a persistent 3D latent representation for future dynamic 3D reconstruction. Unlike prior works that treat the world as a sequence of image-based features, FR3D explicitly decouples the 3D evolution of the scene from the agent's trajectory, treating the inferred ego-motion as a latent proxy for action. This disentanglement resolves the ambiguities between self-motion and world-motion, ensuring geometric consistency into the future. Furthermore, we introduce a teacher-student distillation strategy that leverages the spatial "common sense" of off-the-shelf foundation models, leading to robust zero-shot generalization. Extensive experiments demonstrate FR3D's strong performance for future dynamic 3D reconstruction from monocular observations across multiple datasets, even 2 seconds into the future.

Method Overview

Overview of the FR3D architecture and forecasting pipeline

Our approach targets predicting how a 3D scene evolves over time and forecasting the most likely camera motion. Displayed above, we propose a world modeling approach for dynamic 3D reconstruction. Our key idea is to learn how to temporally operate in the latent space of a pre-trained, feed-forward 3D reconstruction model, thereby effectively disentangling camera motion from the reconstructed 3D scene structure and potentially induced motion. For that, we propose an autoregressive teacher-student distillation training strategy that yields the first 3D world model for dynamic 3D reconstruction learning to propagate scene states and agent poses within a unified 3D latent space.

After training our pose and spatial forecasting models, FR3D can temporally operate on the latent space of the frozen, feed-forward 3D reconstruction model (i.e. CUT3R). In contrast to prior works that rely on dataset-specific decoders (e.g., DINO-Foresight, VFMF), our frameworks' main advantage is that we can leverage CUT3R's full pipeline, i.e., encoder, decoders, and heads, which are pre-trained for approximately a month on 32 datasets, thereby inheriting its strong generalization capability.

Interactive Results

Our unified framework tackles future dynamic 3D reconstruction on several outdoor datasets (Waymo, KITTI, nuScenes). Select each tab below to explore the results for each dataset. Since our forecasting model does not predict RGB, we augment the point clouds (context and future) with the RGB frames.

Waymo

Our method can forecast how dynamic 3D scenes evolve in a feed-forward manner.

Left-click and drag to rotate

Right-click and drag or WASD to move

Scroll to zoom

Click to pause

KITTI

Our method exhibits strong zero-shot generalization on KITTI.

Left-click and drag to rotate

Right-click and drag or WASD to move

Scroll to zoom

Click to pause

nuScenes

Our method exhibits strong zero-shot generalization on nuScenes.

Left-click and drag to rotate

Right-click and drag or WASD to move

Scroll to zoom

Click to pause

In the qualitative results, we show the context frames as RGB and the 3D reconstruction produced by our model and the oracle. We indicate the ego-camera poses estimated by the oracle in green, both for the context and the future frames. Instead, the forecasted ego trajectory is indicated in blue. As an upper bound, we show the 3D reconstruction performance of the oracle, given the future input images.

Qualitative FR3D reconstruction results on nuScenes and KITTI

The figure above shows qualitative zero-shot results of FR3D on nuScenes and KITTI, with two different challenging dynamic scenes for each dataset. Significant domain shifts occur between Waymo (where the FR3D forecasting model is trained) and the two datasets here. For example, the aspect ratio of the KITTI inputs is significantly wider than that of Waymo (see Figure at the top of the project page). Yet, all scenes show remarkable forecasting performance by FR3D, closely matching its oracle model.

FR3D forecast of camera trajectory during a left turn

In the figure above, we show another interesting qualitative result with a zero-shot prediction of our model on the nuScenes. The scene exhibits a left-hand turn. In the context, it can be seen that the turn has just started (visible in the camera poses). Our model correctly predicts a smooth completion of the turn throughout the rollout (blue cameras). Remarkably, the pose prediction of the oracle (right) is more jittery than that of our model, although our model is trained to mimic the oracle predictions. However, we can explain the improvement as our model learns on noisy tokens thanks to our autoregressive strategy. This makes it robust against the oracle tokens, too.

Quantitative Results

FR3D delivers strong results for depth and pose forecasting, also at long time horizons.

Zero-Shot Results

FR3D exhibits strong zero-shot results on KITTI and nuScenes for depth and pose forecasting.

Quantitative zero-shot depth forecasting results on KITTI and nuScenes

Quantitative zero-shot pose forecasting results on KITTI and nuScenes

Static vs. Dynamic

We also evaluate FR3D's depth forecasting performance for static and dynamic regions against Copy Last and CUT3R-Foresight on the Waymo Open dataset. FR3D outperforms both of them being especially strong for static regions supporting our assumption that sharing information between pose and depth primarily benefits the depth prediction of static scene parts. Although FR3D outperforms the baselines in depth forecasting for dynamic regions, accurately modeling scene dynamics remains challenging.

Depth forecasting comparison for static regions on Waymo

Depth forecasting comparison for dynamic regions on Waymo

Acknowledgements

This project was supported by the BMW Group. In addition, Stefano Gasperini was supported by the German Federal Ministry of Research, Technology and Space (BMFTR) through the Robotics Institute Germany (RIG).

BibTeX

@inproceedings{morbitzer_evers2026fr3d,
                    title     = {Future Dynamic 3D Reconstruction: A 3D World Model with Disentangled Ego-Motion},
                    author    = {Morbitzer, Nils and Evers, Jonathan and Savkin, Artem and Stauner, Thomas and Navab, Nassir and Tombari, Federico and Gasperini, Stefano},
                    booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
                    series    = {Proceedings of Machine Learning Research},
                    publisher = {PMLR},
                    year      = {2026},
                }

Future Dynamic 3D Reconstruction:
A 3D World Model with Disentangled Ego-Motion

TL;DR: A world model for dynamic 3D reconstruction operating on a pre-trained, feed-forward reconstruction model's latent space.

Abstract

Method Overview

Interactive Results

Waymo

KITTI

nuScenes

More Qualitative Results

Quantitative Results

Zero-Shot Results

Static vs. Dynamic

Acknowledgements

BibTeX

Future Dynamic 3D Reconstruction:A 3D World Model with Disentangled Ego-Motion

TL;DR: A world model for dynamic 3D reconstruction operating on a pre-trained, feed-forward reconstruction model's latent space.

Abstract

Method Overview

Interactive Results

Waymo

KITTI

nuScenes

More Qualitative Results

Quantitative Results

Zero-Shot Results

Static vs. Dynamic

Acknowledgements

BibTeX

Future Dynamic 3D Reconstruction:
A 3D World Model with Disentangled Ego-Motion