Stereo matching has been a pivotal component in 3D vision, aiming to find corresponding points between pairs of stereo images to recover depth information. In this work, we introduce StereoAnything, a highly practical solution for robust stereo matching. Rather than focusing on a specialized model, our goal is to develop a versatile foundational model capable of handling stereo images across diverse environments. To this end, we scale up the dataset by collecting labeled stereo images and generating synthetic stereo pairs from unlabeled monocular images. To further enrich the model’s ability to generalize across different conditions, we introduce a novel synthetic dataset that complements existing data by adding variability in baselines, camera angles, and scene types. We extensively evaluate the zero-shot capabilities of our model on five public datasets, showcasing its impressive ability to generalize to new, unseen data.
Our Stereo Anything is trained on a combination set of 12 labeled datasets (1.3M+ images) and 5 unlabeled datasets (30M+ images), and tested on 5 labeled datasets.
We compare StereoAnything with SOTA methods on KITTI12, KITTI15, Middlebury, ETH3D, and DrivingStereo.
We conduct an ablation study that underscores the substantial impact of our proposed training strategy. Our results show that applying this strategy to various stereo backbones leads to significant performance improvements across all evaluated datasets.
The results across these benchmarks for each mix configuration are shown below. The table below showcases the detailed mix setups for cross-domain evaluation.
@article{guo2024stereo,
title={Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data},
author={Guo, Xianda and Zhang, Chenming and Zhang, Youmin and Wang, Ruilin and Nie, Dujun and Zheng, Wenzhao and Poggi, Matteo and Zhao, Hao and Ye, Mang and Zou, Qin and Chen, Long},
journal={arXiv preprint arXiv:2411.14053},
year={2024}
}