AnyView:
Synthesizing Any Novel View in Dynamic Scenes

AnyView Logo

1Toyota Research Institute

2Amazon Web Services


In Submission

Summary

We introduce AnyView, a diffusion-based video generation framework for dynamic view synthesis. Given a single video from any camera trajectory, the task is to predict a temporally synchronized video of the same scene from any other camera trajectory. Unlike most existing methods that rely on explicit 3D reconstructions, costly test-time optimization, or are limited to narrow viewpoint changes, AnyView operates end-to-end and supports extreme camera displacements where there may be little overlap between input and output viewpoints.

Method

We adopt Cosmos, a latent diffusion transformer, as our base model. Rather than using warped depth maps as explicit conditioning, we rely solely on an implicitly learned 4D representation. We encode all camera parameters into a unified Plücker representation \( \boldsymbol{P} = (\boldsymbol{r}, \boldsymbol{m}) \), combining extrinsics and intrinsics into dense per-pixel ray and moment vectors. These embeddings are concatenated along the channel dimension, while both viewpoints are concatenated along the sequence dimension to form the overall set of tokens.

AnyView Method Overview

Representative Results

Extreme View Synthesis & Challenging Camera Trajectories

We run AnyView on DROID, Kubric, and Ego-Exo4D below, and visualize camera trajectories on the right, showing that AnyView is capable of performing consistent extreme monocular dynamic view synthesis with high fidelity, even when the camera poses become very complex (2nd row).

Baseline Comparison

While the AnyView generation does not match the ground truth precisely, it is still a plausible, realistic output that is coherent with the input. Existing methods tend to fail to extrapolate, largely copying the input view under large shifts in perspective. Meanwhile, our method preserves scene geometry, appearance, and dynamics, despite working with drastically different target poses and highly "incomplete" visual observations.

AnyView (Ours):
Depth reprojection:

DyCheck iPhone & Ego-Exo4D Results

While the scenes below are not highly dynamic in terms of content, they do contain subtle, intricate motions and hand-object interactions, often with relatively small objects, as well as shaky input poses.

In videos similar to the examples below, the background often has to be “guessed” from the other camera viewpoint, but the inpainted regions integrate harmoniously with the rest of the scene.

Object Permanence & Advanced Reasoning

We show anecdotal examples of AnyView leveraging subtle visuals cues to improve generation accuracy in unobserved areas, as evidence of advanced common sense and spatiotemporal reasoning. In the first example, the red car arriving at the intersection is predicted in the output (left) view before it is visible in the input (front) view, showing that AnyView has learned to maintain spatiotemporal consistency, leading to improved performance in areas that otherwise would be ill-defined.

Here, while the white van is partially visible in the first few frames, and thus gets depicted accurately in the remainder of the video, the white car is never observed in any input frame. Instead, AnyView appears to pick up on the headlights reflecting on the road. Although the reconstructed car does not have the correct appearance, the model indirectly estimates its trajectory by tracking the reflection over time.

Finally, the zero-shot ArgoVerse scene below depicts the ego vehicle pausing for a moment, and then turning left. The model correctly hallucinates the black car passing by on the left of the generated video, despite never observing it, presumably based on the suspicion that the driver must be waiting at the green light because of oncoming traffic before executing the unprotected left turn.

Spatial Uncertainty Analysis

In the robotics video below, the model cannot see what is contained inside the black bin, because the contents are occluded, and resorts to predicting fruit (since those objects are common in LBM), in addition to spawning spurious objects out-of-frame on the left. Video layout:

Input Ground truth Diversity heatmap
Sample 1 Sample 2 Sample 3

Next, in Kubric, we often observe variations of object positions along the input viewing direction, which presumably stems primarily from uncertainty in terms of "implicit" depth estimation by the model.

Real-World Driving Scene Completion

While these target viewpoints only exist in synthetic training data, almost every car that AnyView can see is reconstructed with high fidelity and accurate dynamics.

Datasets

For training, we combined 12 different 4D (multi-view video) datasets across four distinct domains: Robotics, Driving, 3D, and Other. During training, we perform weighted sampling to ensure each domain is seen equally often (i.e. comprises 25% of the batch) to create a balanced representation.

AnyView Datasets Figure

Kubric-5D

We introduce Kubric-5D, a newly generated variation of Kubric-4D that vastly increases the diversity of camera trajectories, incorporating advanced filmmaking effects such as the dolly zoom. These scenes contain multi-object interactions with rich visual appearance and complicated dynamics, with synchronized videos from multiple viewpoints covering a diverse range of camera motions.

Download link coming soon!

AnyViewBench

We propose AnyViewBench, a multi-faceted benchmark that covers datasets across multiple domains:

Download link coming soon!

Paper

Abstract
Modern generative video models excel at producing convincing, high-quality outputs, but struggle to maintain multi-view and spatiotemporal consistency in highly dynamic real-world environments. In this work, we introduce AnyView, a diffusion-based video generation framework for dynamic view synthesis with minimal inductive biases or geometric assumptions. We leverage multiple data sources with various levels of supervision, including monocular (2D), multi-view static (3D) and multi-view dynamic (4D) datasets, to train a generalist spatiotemporal implicit representation capable of producing zero-shot novel videos from arbitrary camera locations and trajectories. We evaluate AnyView on standard benchmarks, showing competitive results with the current state of the art, and propose AnyViewBench, a challenging new benchmark tailored towards extreme dynamic view synthesis in diverse real-world scenarios. In this more dramatic setting, we find that most baselines drastically degrade in performance, as they require significant overlap between viewpoints, while AnyView maintains the ability to produce realistic, plausible, and spatiotemporally consistent videos when prompted from any viewpoint.
BibTeX Citation
@inproceedings{vanhoorick2026anyview,
title={AnyView: Synthesizing Any Novel View in Dynamic Scenes},
author={Van Hoorick, Basile and Chen, Dian and Iwase, Shun and Tokmakov, Pavel and Irshad, Muhammad Zubair and Vasiljevic, Igor and Gupta, Swati and Cheng, Fangzhou and Zakharov, Sergey and Guizilini, Vitor Campagnolo},
journal={In Submission},
year={2026}}

All Figures

Below is a comprehensive gallery of videos corresponding to all figures in the paper, including comparisons with baseline methods and supplementary visualizations.

Video Presentation

Coming soon!

Acknowledgements

The webpage template was inspired by this project page.