Xianda Guo

I am interested in computer vision and autonomous driving. My current research focuses on:

  • Stereo Matching
  • MLLM
  • If you want to work with me (in person or remotely) as an intern, feel free to drop me an email at xianda_guo@163.com. I will support GPUs if we are a good fit.

    Email  /  Google Scholar  /  GitHub

    profile photo
    News

  • 2026-01: One paper are accepted to ICLR2026.
  • 2025-12: One paper is accepted to T-MM.
  • 2025-09: Three paper are accepted to NeurIPS2025.
  • 2025-05: Two paper are accepted to IROS2025.
  • 2025-05: One paper is accepted to T-MM.
  • 2025-02: One paper is accepted to T-PAMI.
  • 2025-02: One paper is accepted to T-CSVT.
  • 2025-01: One paper is accepted to ICRA 2025.
  • *Equal contribution    Project leader/Corresponding author.

    Selected Papers [Full List]

    🚙 Stereo Matching

    dise OpenStereo: A Comprehensive Benchmark for Stereo Matching and Strong Baseline
    Xianda Guo, Chenming Zhang, Juntao Lu, Yiqun Duan , Yiqi Wang , Tian Yang, Zheng Zhu, Long Chen
    arXiv, 2024.
    [arXiv] [Code]

    OpenStereo includes training and inference codes of more than 10 network models, making it, to our knowledge, the most complete stereo matching toolbox available.

    dise LightStereo: Channel Boost Is All You Need for Efficient 2D Cost Aggregation
    Xianda Guo*, Chenming Zhang*, Youmin Zhang , Wenzhao Zheng, Dujun Nie , Matteo Poggi , Long Chen
    ICRA, 2025.
    [arXiv] [Code]

    We present LightStereo, a cutting-edge stereo-matching network crafted to accelerate the matching process.

    dise Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data
    Xianda Guo*, Chenming Zhang*, Youmin Zhang Ruilin Wang, Dujun Nie, Wenzhao Zheng, Matteo Poggi, Hao Zhao, Mang Ye, Qin Zou, Long Chen
    arXiv, 2025.
    [arXiv] [Code]

    We introduce a novel synthetic dataset that complements existing data by adding variability in baselines, camera angles, and scene types. We extensively evaluate the zero-shot capabilities of our model on five public datasets, showcasing its impressive ability to generalize to new, unseen data.

    dise StereoCarla: A High-Fidelity Driving Dataset for Generalizable Stereo
    Xianda Guo*, Chenming Zhang*, Ruilin Wang, Youmin Zhang Wenzhao Zheng, Matteo Poggi, Hao Zhao, Qin Zou, Long Chen
    arXiv, 2025.
    [arXiv] [Code]

    We present StereoCarla, a high-fidelity synthetic stereo dataset specifically designed for autonomous driving scenarios. Built on the CARLA simulator, StereoCarla incorporates a wide range of camera configurations, including diverse baselines, viewpoints, and sensor placements as well as varied environmental conditions such as lighting changes, weather effects, and road geometries.

    🚙 Depth Estimation:

    dise MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer
    Chaoqiang Zhao *, Youmin Zhang*, Matteo Poggi, Fabio Tosi, Xianda Guo, Tao Huang, Zheng Zhu, Guan Huang, Tian Yang , Stefano Mattoccia ,
    3DV, 2022.
    [arXiv] [Code]

    In light of the recent successes achieved by Vision Transformers (ViTs), we propose MonoViT, a brand-new framework combining the global reasoning enabled by ViT models with the flexibility of self-supervised monocular depth estimation.

    dise CompletionFormer: Depth Completion with Convolutions and Vision Transformers
    Youmin Zhang, Xianda Guo, Matteo Poggi, Zheng Zhu, Guan Huang, Stefano Mattoccia ,
    CVPR, 2023.
    [arXiv] [Code]

    This paper proposes a Joint Convolutional Attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure.

    dise DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation
    Yiqun Duan, Xianda Guo, Zheng Zhu
    ECCV, 2024.
    [arXiv] [Code]

    We propose DiffusionDepth, a new approach that reformulates monocular depth estimation as a denoising diffusion process. It learns an iterative denoising process to `denoise' random depth distribution into a depth map with the guidance of monocular visual conditions.

    🚙 End-to-End Driving

    dise MaskFuser: Masked Fusion of Joint Multi-Modal Tokenization for End-to-End Autonomous Driving
    Yiqun Duan, Xianda Guo, Zheng Zhu, Yao Zheng*, Zhen Wang, Yu-Kai Wang, Chin-Teng Lin
    arXiv, 2024.
    [arXiv] [Code]

    This paper proposes MaskFuser, which tokenizes various modalities into a unified semantic feature space and provides a joint representation for further behavior cloning in driving contexts. Given the unified token representation, MaskFuser is the first work to introduce cross-modality masked auto-encoder training.

    dise GenAD: Generative End-to-End Autonomous Driving
    Wenzhao Zheng*, Ruiqi Song* , Xianda Guo*, Chenming Zhang , Long Chen
    ECCV, 2024.
    [arXiv] [Code]

    GenAD casts end-to-end autonomous driving as a generative modeling problem.

    🚙 LLM&MLLM

    dise Instruct Large Language Models to Drive like Humans
    Ruijun Zhang*, Xianda Guo*, Wenzhao Zheng*, Chenming Zhang , Kurt Keutzer , Long Chen
    arXiv, 2024.
    [arXiv] [Code]

    In this paper, we propose an InstructDriver method to transform LLM into a motion planner with explicit instruction tuning to align its behavior with humans.

    dise DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving
    Xianda Guo*, Ruijun Zhang* , Yiqun Duan* , Yuhang He , Dujun Nie , Wenke Huang , Chenming Zhang , Shuai Liu, Hao Zhao, Long Chen
    NeurIPS, 2025.
    [arXiv] [Code]

    We introduce DriveMLLM, a benchmark specifically designed to evaluate the spatial understanding capabilities of multimodal large language models (MLLMs) in autonomous driving.

    Academic Services

  • Conference Reviewer: ECCV 2024, ACM MM2025, NeurIPS2025
  • Journal Reviewer: T-IP, T-MM, T-CSVT, RAL

  • Website Template


    © Xianda Guo | Last updated: Mar. 1, 2025.