Xianda Guo

I am currently a Ph.D student at Wuhan University.

I am interested in computer vision and autonomous driving. My current research focuses on:

Stereo Matching

MLLM

If you want to work with me (in person or remotely) as an intern, feel free to drop me an email at xianda_guo@163.com. I will support GPUs if we are a good fit.

Email / Google Scholar / GitHub

News

2025-05: Two paper are accepted to IROS2025.

2025-05: One paper is accepted to T-MM.

2025-02: One paper is accepted to T-PAMI.

2025-02: One paper is accepted to T-CSVT.

2025-01: One paper is accepted to ICRA 2025.

2024-07: Two papers are accepted to ECCV 2024.

*Equal contribution ^†Project leader/Corresponding author.

Newest Papers

Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data
Xianda Guo*, Chenming Zhang*, Youmin Zhang, Dujun Nie, Ruilin Wang, Wenzhao Zheng, Matteo Poggi, Long Chen
arXiv, 2025.
[arXiv] [Code]

We introduce a novel synthetic dataset that complements existing data by adding variability in baselines, camera angles, and scene types. We extensively evaluate the zero-shot capabilities of our model on five public datasets, showcasing its impressive ability to generalize to new, unseen data.

Selected Papers [Full List]

Gait Recognition

	Exploring Generalized Gait Recognition: Reducing Redundancy and Noise within Indoor and Outdoor Datasets Qian Zhou, Xianda Guo^†, Jilong Wang, Chuanfu Shen , Zhongyuan Wang , Hua Zou , Qin Zou , Chao Liang, Long Chen , Gang Wu arXiv**, 2025. [arXiv] [Code] We propose a unified framework for cross-domain gait recognition that mitigates dataset conflicts and data noise through separate triplet loss and targeted distillation, achieving robust generalization across diverse gait models.
	Gait Recognition in the Wild: A Large-scale Benchmark and NAS-based Baseline Xianda Guo, Zheng Zhu, Tian Yang , Beibei Lin , Junjie Huang , Jiankang Deng , Guan Huang, Jiwen Lu , Jie Zhou T-PAMI, 2025. [arXiv] [Code] The proposed GREW benchmark proves to be essential for both training and evaluating gait recognizers in unconstrained scenarios. In addition, we propose the Single Path One-Shot neural architecture search with uniform sampling for Gait recognition, named SPOSGait, which is the first NAS-based gait recognition model.
	GaitC3I: Robust Cross-Covariate Gait Recognition via Causal Intervention Jilong Wang , Saihui Hou , Xianda Guo, Yan Huang , Yongzhen Huang , Tianzhu Zhang , Liang Wang T-CSVT, 2025. [Paper] [Code] We propose a Cross-Covariate Causal Intervention (GaitC3I) framework, a unified causality-inspired approach aimed at enhancing the robustness of gait recognition across diverse conditions.
	DyGait: Exploiting Dynamic Representations for High-performance Gait Recognition Ming Wang* , Xianda Guo, Beibei Lin , Tian Yang , Borui Zhang , Zheng Zhu, Lincheng Li, Shunli Zhang, Xin Yu ICCV*, 2023. [arXiv] [Code] We propose a novel and high-performance framework named DyGait. This is the first framework on gait recognition that is designed to focus on the extraction of dynamic features.
	Gait Recognition in the Wild: A Benchmark Zheng Zhu* , Xianda Guo, Tian Yang , Xin Tao , Junjie Huang , Jiankang Deng , Guan Huang , Dalong Du , Jiwen Lu , Jie Zhou ICCV*, 2021. [Paper] [Code] To the best of our knowledge, this is the first large-scale dataset for gait recognition in the wild. The proposed GREW benchmark proves to be essential for both training and evaluating gait recognizers in unconstrained scenarios.

🚙 Depth Estimation:

	MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer Chaoqiang Zhao , Youmin Zhang, Matteo Poggi, Fabio Tosi, Xianda Guo, Tao Huang, Zheng Zhu, Guan Huang, Tian Yang , Stefano Mattoccia , 3DV, 2022. [arXiv] [Code] In light of the recent successes achieved by Vision Transformers (ViTs), we propose MonoViT, a brand-new framework combining the global reasoning enabled by ViT models with the flexibility of self-supervised monocular depth estimation.
	CompletionFormer: Depth Completion with Convolutions and Vision Transformers Youmin Zhang, Xianda Guo, Matteo Poggi, Zheng Zhu, Guan Huang, Stefano Mattoccia , CVPR, 2023. [arXiv] [Code] This paper proposes a Joint Convolutional Attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure.
	DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation Yiqun Duan, Xianda Guo^†, Zheng Zhu ECCV, 2024. [arXiv] [Code] We propose DiffusionDepth, a new approach that reformulates monocular depth estimation as a denoising diffusion process. It learns an iterative denoising process to `denoise' random depth distribution into a depth map with the guidance of monocular visual conditions.

🚙 Stereo Matching

	Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data Xianda Guo, Chenming Zhang, Youmin Zhang Dujun Nie, Ruilin Wang, Wenzhao Zheng, Matteo Poggi, Long Chen arXiv, 2025. [arXiv] [Code] We introduce a novel synthetic dataset that complements existing data by adding variability in baselines, camera angles, and scene types. We extensively evaluate the zero-shot capabilities of our model on five public datasets, showcasing its impressive ability to generalize to new, unseen data.
	OpenStereo: A Comprehensive Benchmark for Stereo Matching and Strong Baseline Xianda Guo, Chenming Zhang, Juntao Lu, Yiqun Duan , Yiqi Wang , Tian Yang, Zheng Zhu, Long Chen arXiv, 2024. [arXiv] [Code] OpenStereo includes training and inference codes of more than 10 network models, making it, to our knowledge, the most complete stereo matching toolbox available.
	LightStereo: Channel Boost Is All You Need for Efficient 2D Cost Aggregation Xianda Guo, Chenming Zhang, Youmin Zhang , Wenzhao Zheng, Dujun Nie , Matteo Poggi , Long Chen ICRA, 2025. [arXiv] [Code] We present LightStereo, a cutting-edge stereo-matching network crafted to accelerate the matching process.

🚙 End-to-End Driving

MaskFuser: Masked Fusion of Joint Multi-Modal Tokenization for End-to-End Autonomous Driving
Yiqun Duan, Xianda Guo, Zheng Zhu, Yao Zheng*, Zhen Wang, Yu-Kai Wang, Chin-Teng Lin
arXiv, 2024.
[arXiv] [Code]

This paper proposes MaskFuser, which tokenizes various modalities into a unified semantic feature space and provides a joint representation for further behavior cloning in driving contexts. Given the unified token representation, MaskFuser is the first work to introduce cross-modality masked auto-encoder training.

GenAD: Generative End-to-End Autonomous Driving
Wenzhao Zheng*, Ruiqi Song* , Xianda Guo*^†, Chenming Zhang , Long Chen
ECCV, 2024.
[arXiv] [Code]

GenAD casts end-to-end autonomous driving as a generative modeling problem.

🚙 LLM&MLLM

Instruct Large Language Models to Drive like Humans
Ruijun Zhang*, Xianda Guo*^†, Wenzhao Zheng*, Chenming Zhang , Kurt Keutzer , Long Chen
arXiv, 2024.
[arXiv] [Code]

In this paper, we propose an InstructDriver method to transform LLM into a motion planner with explicit instruction tuning to align its behavior with humans.

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving
Xianda Guo*, Ruijun Zhang* , Yiqun Duan* , Yuhang He , Chenming Zhang , Shuai Liu, Long Chen
arXiv, 2024.
[arXiv] [Code]

We introduce DriveMLLM, a benchmark specifically designed to evaluate the spatial understanding capabilities of multimodal large language models (MLLMs) in autonomous driving.

Academic Services

Conference Reviewer: ECCV 2024, ACM MM2025, NeurIPS2025

Journal Reviewer: T-IP, T-MM, T-CSVT, RAL

Website Template