Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas

CVPR 2026 Findings

1Google, 2University of Oxford, 3MCML, 4Technical University of Munich
Stepper teaser image showing panoramic scene generation

Abstract

The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360° diffusion model that enables consistent, high-resolution exploration, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new large-scale, multi-view panorama dataset, Stepper achieves state-of-the-art fidelity and structural consistency, outperforming prior approaches, thereby setting a new standard for immersive scene generation.

Key Contributions

  • Multi-view Panorama Diffusion Model: Novel diffusion model that enables iterative scene expansion by "stepping" into generated scenes using full panoramic contexts, minimizing context drift while maintaining high resolution
  • Robust Reconstruction Framework: Pipeline that converts multi-view panoramas to perspective views, processes them with feed-forward SfM (MapAnything) for dense point cloud recovery, and optimizes 3D Gaussian Splatting for real-time exploration
  • Large-Scale Multi-view Panorama Dataset: 230,000 samples at 4096×2048 resolution across 5,000 procedurally generated scenes from Infinigen, addressing the scarcity of multi-view panoramic data for training and evaluation

Qualitative Results

Baseline comparisons showing Stepper vs. existing methods

Barbershop
Blue Room
Classroom
Mountain Cabin
Valley

Qualitative examples: Stepper flythroughs

Castle
Rosetrees

Method Overview

Stepper follows a four-stage pipeline for immersive scene generation:

  1. Multi-view Panorama Generation: Synthesizes novel panoramic views by moving ~25cm forward from input panorama using cubemap representation (6 faces × 2 viewpoints = 12 total views)
  2. Perspective View Extraction: Converts cubemap faces to perspective views suitable for 3D reconstruction
  3. Feed-forward 3D Reconstruction: Uses MapAnything model for initial 3D scene estimation
  4. 3D Gaussian Splatting Optimization: Refines reconstruction for real-time exploration and rendering

Stepper method pipeline overview

Method pipeline: From input panorama to stepwise 3D scene generation

Dataset

We introduce a large-scale synthetic dataset for training panorama-based scene generation:

  • Scale: 230,000 panorama pairs at 4096×2048 resolution
  • Generation: 5,000 unique scenes created with Infinigen procedural generator
  • Test Set: 6 photorealistic Blender scenes + 10 Infinigen scenes for evaluation
  • Stepping Distance: ~25cm forward movement between consecutive panoramas
Infinigen dataset visualization

Dataset visualization showing Infinigen-generated scenes with panoramic views

Quantitative Results

Stepper consistently outperforms baseline methods across all evaluation metrics:

Method PSNR ↑ SSIM ↑ LPIPS ↓
WorldExplorer 11.864 0.674 0.739
LayerPano3D 18.305 0.783 0.509
Matrix-3D 18.532 0.753 0.502
Stepper (Ours) 21.775 0.797 0.430
Average Results
Method PSNR ↑ SSIM ↑ LPIPS ↓
WorldExplorer 13.145 0.624 0.648
LayerPano3D 17.931 0.688 0.503
Matrix-3D 18.133 0.665 0.515
Stepper (Ours) 21.426 0.735 0.385

BibTeX

@misc{wimbauer2026stepper,
  title={Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas},
  author={Wimbauer, Felix and Manhardt, Fabian and Oechsle, Michael and Kalischek, Nikolai and Rupprecht, Christian and Cremers, Daniel and Tombari, Federico},
  year={2026},
  eprint={2603.xxxxx},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Acknowledgments

We thank the open-source community for providing the Infinigen, MapAnything, and 3D Gaussian Splatting frameworks that made this work possible.