Stereo Anything

Unifying Zero-shot Stereo Matching with Large-Scale Mixed Data

Xianda Guo^1* Chenming Zhang^2,3* Youmin Zhang⁴ Ruilin Wang⁵ Dujun Nie⁵ Wenzhao Zheng⁶ Matteo Poggi⁴ Hao Zhao⁶ Mang Ye¹ Qin Zou^1,† Long Chen^5,2,3,†

¹Wuhan University ²Xi'an Jiaotong University ³Waytous ⁴University of Bologna ⁵Institute of Automation, Chinese Academy of Sciences ⁶Tsinghua Univercity

arXiv Paper Code

This work presents Stereo Anything, a highly practical solution for stereo estimation by training on a combination of 30M+ unlabeled images, with following contributuions:

study on how different synthetic datasets affect the performance of trained stereo models.

new synthetic dataset features unique view angles and baselines, called StereoCarla.

scale up the training data by incorporating both synthetic stereo data and diverse unlabeled monocular images.

a stereo model exhibiting the strongest zero-shot capability among all of the existing networks.

Zero-Shot Comparison between Stereo Anything and NMRFStereo*

Abstract

Stereo matching has been a pivotal component in 3D vision, aiming to find corresponding points between pairs of stereo images to recover depth information. In this work, we introduce StereoAnything, a highly practical solution for robust stereo matching. Rather than focusing on a specialized model, our goal is to develop a versatile foundational model capable of handling stereo images across diverse environments. To this end, we scale up the dataset by collecting labeled stereo images and generating synthetic stereo pairs from unlabeled monocular images. To further enrich the model’s ability to generalize across different conditions, we introduce a novel synthetic dataset that complements existing data by adding variability in baselines, camera angles, and scene types. We extensively evaluate the zero-shot capabilities of our model on five public datasets, showcasing its impressive ability to generalize to new, unseen data.

Data Coverage

Our Stereo Anything is trained on a combination set of 12 labeled datasets (1.3M+ images) and 5 unlabeled datasets (30M+ images), and tested on 5 labeled datasets.

* 16 labeled datasets, with 12 of them being used as training datasets, and the remaining 8 designated as zero-shot datasets.

* 5 unlabeled datasets with over 30M+ images.

Comparison with SOTA methods

We compare StereoAnything with SOTA methods on KITTI12, KITTI15, Middlebury, ETH3D, and DrivingStereo.

Ablation study on different models

We conduct an ablation study that underscores the substantial impact of our proposed training strategy. Our results show that applying this strategy to various stereo backbones leads to significant performance improvements across all evaluated datasets.

* NMRFStereo^∗ refers to NMRFStereo-SwinT. † refers to our training strategy.

Cross-domain evaluation fine-tuning on different single training sets

* Relative to baseline(top row), green (red) indicates performance improvement (decline). Bold:Best.

Cross-domain evaluation on different combinations

The results across these benchmarks for each mix configuration are shown below. The table below showcases the detailed mix setups for cross-domain evaluation.

Ablation study of different monocular depth estimation methods

Ablation study of using pseudo-stereo generated from different datasets

Cross-domain evaluation combines mix labeled stereo and pseudo-stereo generated from different datasets

Citation

@article{guo2024stereo,
  title={Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data},
  author={Guo, Xianda and Zhang, Chenming and Zhang, Youmin and Wang, Ruilin and Nie, Dujun and Zheng, Wenzhao and Poggi, Matteo and Zhao, Hao and Ye, Mang and Zou, Qin and Chen, Long},
  journal={arXiv preprint arXiv:2411.14053},
  year={2024}
}