Stereo Anything

Unifying Stereo Matching with Large-Scale Mixed Data


Xianda Guo1*     Chenming Zhang2,3*    Youmin Zhang4,5    Dujun Nie6    Ruilin Wang6    Wenzhao Zheng7 Matteo Poggi4 Long Chen6,2,3†
1Wuhan University              2Xi'an Jiaotong University              3Waytous             4University of Bologna              5Rock Universe             6Institute of Automation, Chinese Academy of Sciences             7University of California, Berkeley

This work presents Stereo Anything, a highly practical solution for stereo estimation by training on a combination of 30M+ unlabeled images, with following contributuions:

  • study on how different synthetic datasets affect the performance of trained stereo models.
  • new synthetic dataset features unique view angles and baselines, called StereoCarla.
  • scale up the training data by incorporating both synthetic stereo data and diverse unlabeled monocular images.
  • a stereo model exhibiting the strongest zero-shot capability among all of the existing networks.

Zero-Shot Comparison between Stereo Anything and NMRFStereo*

Abstract

Stereo matching has been a pivotal component in 3D vision, aiming to find corresponding points between pairs of stereo images to recover depth information. In this work, we introduce StereoAnything, a highly practical solution for robust stereo matching. Rather than focusing on a specialized model, our goal is to develop a versatile foundational model capable of handling stereo images across diverse environments. To this end, we scale up the dataset by collecting labeled stereo images and generating synthetic stereo pairs from unlabeled monocular images. To further enrich the model’s ability to generalize across different conditions, we introduce a novel synthetic dataset that complements existing data by adding variability in baselines, camera angles, and scene types. We extensively evaluate the zero-shot capabilities of our model on five public datasets, showcasing its impressive ability to generalize to new, unseen data.

Data Coverage

Our Stereo Anything is trained on a combination set of 12 labeled datasets (1.3M+ images) and 5 unlabeled datasets (30M+ images), and tested on 5 labeled datasets.

pipeline
* 17 labeled datasets, with 12 of them being used as training datasets, and the remaining 5 designated as zero-shot datasets.
pipeline
* 5 unlabeled datasets with over 30M+ images.

StereoCarla datasets

To expand the diversity and quantity of existing stereo matching datasets, we utilized the CARLA simulator to collect new synthetic stereo data. Compared to previous stereo datasets, our approach offers more varied settings, providing different baselines and novel camera configurations that enhance the richness of stereo data.

pipeline
* The first raw illustrates the left-eye image (1st column) and the right-eye images at varying baselines (2nd-6th columns). The second row showcases the depth map (1st column) and corresponding disparity maps (2nd-6th columns). The third row Depicts left images from varied horizontal viewing angles and elevated viewpoints.

Comparison with SOTA methods

We compare StereoAnything with SOTA methods on KITTI12, KITTI15, Middlebury, ETH3D, and DrivingStereo.

pipeline
* Lower is better. NMRF Stereo refers to NMRF Stereo SwinT. Bold: Best.

Ablation study on different models

We conduct an ablation study that underscores the substantial impact of our proposed training strategy. Our results show that applying this strategy to various stereo backbones leads to significant performance improvements across all evaluated datasets.

pipeline
* NMRFStereo refers to NMRFStereo-SwinT. † refers to our training strategy.

Cross-domain evaluation fine-tuning on different single training sets

pipeline
* Relative to baseline(top row), green (red) indicates performance improvement (decline). Bold:Best.

MIX setups

The table below showcases the detailed mix setups for cross-domain evaluation.

pipeline

Cross-domain evaluation on different combinations

The results across these benchmarks for each mix configuration are shown below.

pipeline

Ablation study of different monocular depth estimation methods

pipeline

Ablation study of using pseudo-stereo generated from different datasets

pipeline

Cross-domain evaluation combines mix labeled stereo and pseudo-stereo generated from different datasets

pipeline

Citation

@article{guo2024stereo,
  title={Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data},
  author={Guo, Xianda and Zhang, Chenming and Zhang, Youmin and Nie, Dujun and Wang, Ruilin and Zheng, Wenzhao and Poggi, Matteo and Chen, Long},
  journal={arXiv preprint arXiv:2411.14053},
  year={2024}
}