We introduce Stereo CARLA, a challenging dataset designed for stereo depth estimation. Collected using the CARLA simulator, our dataset captures diverse driving scenarios under varying lighting conditions, weather, and dynamic environments. The use of simulation enables the generation of high-quality multi-modal sensor data with precise ground truth, including stereo RGB images and depth maps. To enhance data diversity, we create numerous virtual environments featuring distinct scene styles, varied viewpoints, and complex angles, which are difficult to achieve with real-world data collection. We assess the influence of different factors on stereo depth estimation and visual perception tasks. Experiments with state-of-the-art models demonstrate that methods trained on traditional datasets like KITTI struggle with more challenging scenarios. While our dataset originates from simulation, it aims to advance stereo depth estimation in real-world applications by serving as both a rigorous benchmark for evaluation and a large-scale training resource for learning-based methods.
In/Out/Dy/W/Acc./Divers. refers to Indoor/Outdoor/Dynamic/Weather/Accuracy/Diversity. FL refers to focal length. Range refers to disparity range. Ave./Med. refers to the average/median of disparity.
Number of Samples Collected Across Different Towns and Camera Configurations. Each camera configuration includes baselines of 10, 54, 100, 200, and 300cm.
Relative to baseline (top row), green/red shows performance improvement/decline. Best results in bold, second-best underlined.
EPE is for disparity evaluation. AR denotes the Absolute Relative Error (Abs Rel), and δ1 indicates the threshold accuracy δ < 1.25 for depth evaluation. Best results in bold.
EPE is for disparity evaluation. AR denotes the Absolute Relative Error (Abs Rel), and δ1 indicates the threshold accuracy δ < 1.25 for depth evaluation. Best results in bold.
Accurate depth estimation is a critical capability for autonomous driving systems, enabling perception, obstacle detection, and scene understanding. With the widespread availability of stereo camera setups, binocular depth estimation has become an essential component in autonomous navigation. Significant progress has been made using both geometric-based and learning-based approaches. However, achieving robust and reliable depth estimation in real-world autonomous driving remains a challenging task. Real-world driving environments present numerous difficulties, such as varying lighting conditions, dynamic traffic participants, and texture-less road surfaces. Existing popular benchmarks like KITTI and other stereo datasets cover only a limited range of scenarios and motion patterns compared to real-world driving conditions.
We construct a large-scale binocular dataset using the CARLA simulator in autonomous driving scenarios. To bridge the sim-to-real gap, we generate data from diverse urban and suburban environments with various road layouts and dynamic elements. Our dataset particularly emphasizes challenging conditions, including varying illumination, adverse weather, and moving obstacles.
The three most important features of our dataset are:
The dataset is published using Baidu Netdisk. Please download here for the data.
The dataset is composed of 8 different outdoor autonomous driving scenes, containing transparent or reflective objects. Each scene was acquired with several weather conditions and 6 viewpoint conditions, for a total of 8 * 6 * 2500 frames at 1600 * 900 resolution, which can construct over 1.8M stereo pairs through 5 different right images. For images belonging to the split, we release high-resolution stereo images, left and right depth ground-truth, and calibration parameters.
The dataset consists of seven folders, including six different viewpoint settings: normal, pitch00, pitch30, roll05, roll15, and roll30, as well as a weather folder that provides variations under different weather conditions.
Each folder contains sequences from eight different urban scenes, with each sequence consisting of approximately 2500 frames captured at 0.05s intervals.
For each frame, the dataset includes a left image and five right images with different baselines (10cm, 54cm, 100cm, 200cm, and 300cm). Each image set includes RGB images, depth maps, and camera extrinsic parameters. The camera intrinsic parameters remain unchanged across all sequences and are stored in intrinsic.txt.
Using the depth maps, the dataset enables the construction of more than 6 × 8 × 2500 × 15 stereo pairs in total.
Please read our paper for details.
@article{StereoAnything,
title={Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data},
author={Xianda Guo and Chenming Zhang and Youmin Zhang and Dujun Nie and Ruilin Wang and Wenzhao Zheng and Matteo Poggi and Long Chen},
year={2024}
}