Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas

Abstract

The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360° diffusion model that enables consistent, high-resolution exploration, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new large-scale, multi-view panorama dataset, Stepper achieves state-of-the-art fidelity and structural consistency, outperforming prior approaches, thereby setting a new standard for immersive scene generation.

Key Contributions

Multi-view Panorama Diffusion Model: Novel diffusion model that enables iterative scene expansion by "stepping" into generated scenes using full panoramic contexts, minimizing context drift while maintaining high resolution
Robust Reconstruction Framework: Pipeline that converts multi-view panoramas to perspective views, processes them with feed-forward SfM (MapAnything) for dense point cloud recovery, and optimizes 3D Gaussian Splatting for real-time exploration
Large-Scale Multi-view Panorama Dataset: 230,000 samples at 4096×2048 resolution across 5,000 procedurally generated scenes from Infinigen, addressing the scarcity of multi-view panoramic data for training and evaluation

Qualitative Results

Baseline comparisons showing Stepper vs. existing methods

Qualitative examples: Stepper flythroughs

Method Overview

Stepper follows a four-stage pipeline for immersive scene generation:

Multi-view Panorama Generation: Synthesizes novel panoramic views by moving ~25cm forward from input panorama using cubemap representation (6 faces × 2 viewpoints = 12 total views)
Perspective View Extraction: Converts cubemap faces to perspective views suitable for 3D reconstruction
Feed-forward 3D Reconstruction: Uses MapAnything model for initial 3D scene estimation
3D Gaussian Splatting Optimization: Refines reconstruction for real-time exploration and rendering

Method pipeline: From input panorama to stepwise 3D scene generation

Dataset

We introduce a large-scale synthetic dataset for training panorama-based scene generation:

Scale: 230,000 panorama pairs at 4096×2048 resolution
Generation: 5,000 unique scenes created with Infinigen procedural generator
Test Set: 6 photorealistic Blender scenes + 10 Infinigen scenes for evaluation
Stepping Distance: ~25cm forward movement between consecutive panoramas

Dataset visualization showing Infinigen-generated scenes with panoramic views

Quantitative Results

Stepper consistently outperforms baseline methods across all evaluation metrics:

Infinigen Indoors
Infinigen Outdoors
Blender Scenes

Method	PSNR ↑	SSIM ↑	LPIPS ↓
WorldExplorer	11.864	0.674	0.739
LayerPano3D	18.305	0.783	0.509
Matrix-3D	18.532	0.753	0.502
Stepper (Ours)	21.775	0.797	0.430

Method	PSNR ↑	SSIM ↑	LPIPS ↓
WorldExplorer	13.912	0.561	0.594
LayerPano3D	17.364	0.590	0.537
Matrix-3D	17.970	0.581	0.529
Stepper (Ours)	20.507	0.646	0.384

Method	PSNR ↑	SSIM ↑	LPIPS ↓
WorldExplorer	13.659	0.637	0.611
LayerPano3D	18.124	0.692	0.463
Matrix-3D	17.898	0.660	0.515
Stepper (Ours)	21.995	0.762	0.342

Average Results

Method	PSNR ↑	SSIM ↑	LPIPS ↓
WorldExplorer	13.145	0.624	0.648
LayerPano3D	17.931	0.688	0.503
Matrix-3D	18.133	0.665	0.515
Stepper (Ours)	21.426	0.735	0.385

BibTeX

@misc{wimbauer2026stepper,
  title={Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas},
  author={Wimbauer, Felix and Manhardt, Fabian and Oechsle, Michael and Kalischek, Nikolai and Rupprecht, Christian and Cremers, Daniel and Tombari, Federico},
  year={2026},
  eprint={2603.xxxxx},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Acknowledgments

We thank the open-source community for providing the Infinigen, MapAnything, and 3D Gaussian Splatting frameworks that made this work possible.