Xianda Guo

I am currently a Ph.D student at Wuhan University.

I am interested in computer vision and autonomous driving. My current research focuses on:

  • Stereo Matching
  • MLLM
  • If you want to work with me (in person or remotely) as an intern, feel free to drop me an email at xianda_guo@163.com. I will support GPUs if we are a good fit.

    Email  /  Google Scholar  /  GitHub

    profile photo
    News

  • 2025-02: One paper is accepted to T—PAMI.
  • 2025-02: One paper is accepted to T-CSVT.
  • 2025-01: One paper is accepted to ICRA 2025.
  • 2024-07: Two papers are accepted to ECCV 2024.
  • *Equal contribution    †Project leader/Corresponding author.

    Newest Papers

    dise Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data
    Xianda Guo*, Chenming Zhang*, Youmin Zhang, Dujun Nie, Ruilin Wang, Wenzhao Zheng, Matteo Poggi, Long Chen
    arXiv, 2025.
    [arXiv] [Code]

    We introduce a novel synthetic dataset that complements existing data by adding variability in baselines, camera angles, and scene types. We extensively evaluate the zero-shot capabilities of our model on five public datasets, showcasing its impressive ability to generalize to new, unseen data.

    Selected Papers [Full List]

    Gait Recognition

    dise Gait Recognition in the Wild: A Large-scale Benchmark and NAS-based Baseline
    Xianda Guo*, Zheng Zhu*, Tian Yang , Beibei Lin , Junjie Huang , Jiankang Deng , Guan Huang, Jiwen Lu , Jie Zhou
    T-PAMI, 2025.
    [arXiv] [Code]

    The proposed GREW benchmark proves to be essential for both training and evaluating gait recognizers in unconstrained scenarios. In addition, we propose the Single Path One-Shot neural architecture search with uniform sampling for Gait recognition, named SPOSGait, which is the first NAS-based gait recognition model.

    dise GaitC3I: Robust Cross-Covariate Gait Recognition via Causal Intervention
    Jilong Wang , Saihui Hou , Xianda Guo, Yan Huang , Yongzhen Huang , Tianzhu Zhang , Liang Wang
    T-CSVT, 2025.
    [Paper] [Code]

    We propose a Cross-Covariate Causal Intervention (GaitC3I) framework, a unified causality-inspired approach aimed at enhancing the robustness of gait recognition across diverse conditions.

    dise Gait Recognition in the Wild: A Benchmark
    Zheng Zhu* , Xianda Guo*, Tian Yang , Xin Tao , Junjie Huang , Jiankang Deng , Guan Huang , Dalong Du , Jiwen Lu , Jie Zhou
    ICCV, 2021.
    [Paper] [Code]

    To the best of our knowledge, this is the first large-scale dataset for gait recognition in the wild. The proposed GREW benchmark proves to be essential for both training and evaluating gait recognizers in unconstrained scenarios.

    dise DyGait: Exploiting Dynamic Representations for High-performance Gait Recognition
    Ming Wang* , Xianda Guo*, Beibei Lin , Tian Yang , Borui Zhang , Zheng Zhu, Lincheng Li, Shunli Zhang, Xin Yu
    ICCV, 2023.
    [arXiv] [Code]

    We propose a novel and high-performance framework named DyGait. This is the first framework on gait recognition that is designed to focus on the extraction of dynamic features.

    🚙 Depth Estimation:

    dise MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer
    Chaoqiang Zhao *, Youmin Zhang*, Matteo Poggi, Fabio Tosi, Xianda Guo, Tao Huang, Zheng Zhu, Guan Huang, Tian Yang , Stefano Mattoccia ,
    3DV, 2022.
    [arXiv] [Code]

    In light of the recent successes achieved by Vision Transformers (ViTs), we propose MonoViT, a brand-new framework combining the global reasoning enabled by ViT models with the flexibility of self-supervised monocular depth estimation.

    dise CompletionFormer: Depth Completion with Convolutions and Vision Transformers
    Youmin Zhang, Xianda Guo, Matteo Poggi, Zheng Zhu, Guan Huang, Stefano Mattoccia ,
    CVPR, 2023.
    [arXiv] [Code]

    This paper proposes a Joint Convolutional Attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure.

    dise DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation
    Yiqun Duan, Xianda Guo†, Zheng Zhu
    ECCV, 2024.
    [arXiv] [Code]

    We propose DiffusionDepth, a new approach that reformulates monocular depth estimation as a denoising diffusion process. It learns an iterative denoising process to `denoise' random depth distribution into a depth map with the guidance of monocular visual conditions.

    🚙 Stereo Matching

    dise Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data
    Xianda Guo*, Chenming Zhang*, Youmin Zhang Dujun Nie, Ruilin Wang, Wenzhao Zheng, Matteo Poggi, Long Chen
    arXiv, 2025.
    [arXiv] [Code]

    We introduce a novel synthetic dataset that complements existing data by adding variability in baselines, camera angles, and scene types. We extensively evaluate the zero-shot capabilities of our model on five public datasets, showcasing its impressive ability to generalize to new, unseen data.

    dise OpenStereo: A Comprehensive Benchmark for Stereo Matching and Strong Baseline
    Xianda Guo, Chenming Zhang, Juntao Lu, Yiqun Duan , Yiqi Wang , Tian Yang, Zheng Zhu, Long Chen
    arXiv, 2024.
    [arXiv] [Code]

    OpenStereo includes training and inference codes of more than 10 network models, making it, to our knowledge, the most complete stereo matching toolbox available.

    dise LightStereo: Channel Boost Is All You Need for Efficient 2D Cost Aggregation
    Xianda Guo*, Chenming Zhang*, Youmin Zhang , Wenzhao Zheng, Dujun Nie , Matteo Poggi , Long Chen
    ICRA, 2025.
    [arXiv] [Code]

    We present LightStereo, a cutting-edge stereo-matching network crafted to accelerate the matching process.

    🚙 End-to-End Driving

    dise MaskFuser: Masked Fusion of Joint Multi-Modal Tokenization for End-to-End Autonomous Driving
    Yiqun Duan, Xianda Guo, Zheng Zhu, Yao Zheng*, Zhen Wang, Yu-Kai Wang, Chin-Teng Lin
    arXiv, 2024.
    [arXiv] [Code]

    This paper proposes MaskFuser, which tokenizes various modalities into a unified semantic feature space and provides a joint representation for further behavior cloning in driving contexts. Given the unified token representation, MaskFuser is the first work to introduce cross-modality masked auto-encoder training.

    dise GenAD: Generative End-to-End Autonomous Driving
    Wenzhao Zheng*, Ruiqi Song* , Xianda Guo*†, Chenming Zhang , Long Chen
    ECCV, 2024.
    [arXiv] [Code]

    GenAD casts end-to-end autonomous driving as a generative modeling problem.

    🚙 LLM&MLLM

    dise Instruct Large Language Models to Drive like Humans
    Ruijun Zhang*, Xianda Guo*†, Wenzhao Zheng*, Chenming Zhang , Kurt Keutzer , Long Chen
    arXiv, 2024.
    [arXiv] [Code]

    In this paper, we propose an InstructDriver method to transform LLM into a motion planner with explicit instruction tuning to align its behavior with humans.

    dise DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving
    Xianda Guo*, Ruijun Zhang* , Yiqun Duan* , Yuhang He , Chenming Zhang , Shuai Liu, Long Chen
    arXiv, 2024.
    [arXiv] [Code]

    We introduce DriveMLLM, a benchmark specifically designed to evaluate the spatial understanding capabilities of multimodal large language models (MLLMs) in autonomous driving.

    Academic Services

  • Conference Reviewer: ECCV 2024, ACM MM2025, NeurIPS2025
  • Journal Reviewer: T-IP, T-MM, T-CSVT, RAL

  • Website Template


    © Xianda Guo | Last updated: Mar. 1, 2025.