Stereo Carla

Abstract

We introduce Stereo CARLA, a challenging dataset designed for stereo depth estimation. Collected using the CARLA simulator, our dataset captures diverse driving scenarios under varying lighting conditions, weather, and dynamic environments. The use of simulation enables the generation of high-quality multi-modal sensor data with precise ground truth, including stereo RGB images and depth maps. To enhance data diversity, we create numerous virtual environments featuring distinct scene styles, varied viewpoints, and complex angles, which are difficult to achieve with real-world data collection. We assess the influence of different factors on stereo depth estimation and visual perception tasks. Experiments with state-of-the-art models demonstrate that methods trained on traditional datasets like KITTI struggle with more challenging scenarios. While our dataset originates from simulation, it aims to advance stereo depth estimation in real-world applications by serving as both a rigorous benchmark for evaluation and a large-scale training resource for learning-based methods.

Comparison of available stereo datasets

In/Out/Dy/W/Acc./Divers. refers to Indoor/Outdoor/Dynamic/Weather/Accuracy/Diversity. FL refers to focal length. Range refers to disparity range. Ave./Med. refers to the average/median of disparity.

StereoCarla samples distribution

Number of Samples Collected Across Different Towns and Camera Configurations. Each camera configuration includes baselines of 10, 54, 100, 200, and 300cm.

Cross-domain evaluation when fine-tuning on different training sets

Relative to baseline (top row), green/red shows performance improvement/decline. Best results in bold, second-best underlined.

Ablation study of baseline on StereoCarla dataset

Ablation study of camera angles on StereoCarla dataset

Ablation study on the weather on the StereoCarla dataset.

The Mission

Accurate depth estimation is a critical capability for autonomous driving systems, enabling perception, obstacle detection, and scene understanding. With the widespread availability of stereo camera setups, binocular depth estimation has become an essential component in autonomous navigation. Significant progress has been made using both geometric-based and learning-based approaches. However, achieving robust and reliable depth estimation in real-world autonomous driving remains a challenging task. Real-world driving environments present numerous difficulties, such as varying lighting conditions, dynamic traffic participants, and texture-less road surfaces. Existing popular benchmarks like KITTI and other stereo datasets cover only a limited range of scenarios and motion patterns compared to real-world driving conditions.

We construct a large-scale binocular dataset using the CARLA simulator in autonomous driving scenarios. To bridge the sim-to-real gap, we generate data from diverse urban and suburban environments with various road layouts and dynamic elements. Our dataset particularly emphasizes challenging conditions, including varying illumination, adverse weather, and moving obstacles.

The three most important features of our dataset are:

Multiple Baselines
Horizontal Viewing Angles
Elevated Viewpoints
High Resolution
Weather Variations

Download

You can find the download instructions here:

https://github.com/XiandaGuo/OpenStereo/blob/v2/docs/download_StereoCarla.md

Format

The dataset consists of seven folders, including six different viewpoint settings: normal, pitch00, pitch30, roll05, roll15, and roll30, as well as a weather folder that provides variations under different weather conditions.

Each folder contains sequences from eight different urban scenes, with each sequence consisting of approximately 2500 frames captured at 0.05s intervals.

For each frame, the dataset includes a left image and five right images with different baselines (10cm, 54cm, 100cm, 200cm, and 300cm). Each image set includes RGB images, depth maps, and camera extrinsic parameters. The camera intrinsic parameters remain unchanged across all sequences and are stored in intrinsic.txt.

Using the depth maps, the dataset enables the construction of more than 6 × 8 × 2500 × 15 stereo pairs in total.

Citation

Please read our paper for details.

@article{guo2025stereocarla,
      title={StereoCarla: A High-Fidelity Driving Dataset for Generalizable Stereo},
      author={Xianda Guo and Chenming Zhang and Ruilin Wang and Youmin Zhang and Wenzhao Zheng and Matteo Poggi and Hao Zhao and Qin Zou and Long Chen},
      year={2025},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2509.12683},
}