Xianda Guo
I am currently a Ph.D student at Wuhan University.
I am interested in computer vision and autonomous driving. My current research focuses on:
Stereo Matching
MLLM
If you want to work with me (in person or remotely) as an intern, feel free to drop me an email at xianda_guo@163.com. I will support GPUs if we are a good fit.
Email  / 
Google Scholar  / 
GitHub
|
|
News
2025-02: One paper is accepted to T—PAMI.
2025-02: One paper is accepted to T-CSVT.
2025-01: One paper is accepted to ICRA 2025.
2024-07: Two papers are accepted to ECCV 2024.
|
*Equal contribution †Project leader/Corresponding author.
|
|
Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data
Xianda Guo*,
Chenming Zhang*,
Youmin Zhang,
Dujun Nie,
Ruilin Wang,
Wenzhao Zheng,
Matteo Poggi,
Long Chen
arXiv, 2025.
[arXiv]
[Code]
We introduce a novel synthetic dataset that complements existing data by adding variability in baselines, camera angles, and scene types. We extensively evaluate the zero-shot capabilities of our model on five public datasets, showcasing its impressive ability to generalize to new, unseen data.
|
Gait Recognition
|
Gait Recognition in the Wild: A Large-scale Benchmark and NAS-based Baseline
Xianda Guo*,
Zheng Zhu*,
Tian Yang ,
Beibei Lin ,
Junjie Huang ,
Jiankang Deng ,
Guan Huang,
Jiwen Lu ,
Jie Zhou
T-PAMI, 2025.
[arXiv]
[Code]
The proposed GREW benchmark proves to be essential for both training and evaluating gait recognizers in unconstrained scenarios. In addition, we propose the Single Path One-Shot neural architecture search with uniform sampling for Gait recognition, named SPOSGait, which is the first NAS-based gait recognition model.
|
|
GaitC3I: Robust Cross-Covariate Gait Recognition via Causal Intervention
Jilong Wang ,
Saihui Hou ,
Xianda Guo,
Yan Huang ,
Yongzhen Huang ,
Tianzhu Zhang ,
Liang Wang
T-CSVT, 2025.
[Paper]
[Code]
We propose a Cross-Covariate Causal Intervention (GaitC3I) framework, a unified causality-inspired approach aimed at enhancing the robustness of gait recognition across diverse conditions.
|
|
Gait Recognition in the Wild: A Benchmark
Zheng Zhu* ,
Xianda Guo*,
Tian Yang ,
Xin Tao ,
Junjie Huang ,
Jiankang Deng ,
Guan Huang ,
Dalong Du ,
Jiwen Lu ,
Jie Zhou
ICCV, 2021.
[Paper]
[Code]
To the best of our knowledge, this is the first large-scale dataset for gait recognition in the wild. The proposed GREW benchmark proves to be essential for both training and evaluating gait recognizers in unconstrained scenarios.
|
|
DyGait: Exploiting Dynamic Representations for High-performance Gait Recognition
Ming Wang* ,
Xianda Guo*,
Beibei Lin ,
Tian Yang ,
Borui Zhang ,
Zheng Zhu,
Lincheng Li,
Shunli Zhang,
Xin Yu
ICCV, 2023.
[arXiv]
[Code]
We propose a novel and high-performance framework named DyGait. This is the first framework on gait recognition that is designed to focus on the extraction of dynamic features.
|
🚙 Depth Estimation:
|
MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer
Chaoqiang Zhao *,
Youmin Zhang*,
Matteo Poggi,
Fabio Tosi,
Xianda Guo,
Tao Huang,
Zheng Zhu,
Guan Huang,
Tian Yang ,
Stefano Mattoccia ,
3DV, 2022.
[arXiv]
[Code]
In light of the recent successes achieved by Vision Transformers (ViTs), we propose MonoViT, a brand-new framework combining the global reasoning enabled by ViT models with the flexibility of self-supervised monocular depth estimation.
|
|
CompletionFormer: Depth Completion with Convolutions and Vision Transformers
Youmin Zhang,
Xianda Guo,
Matteo Poggi,
Zheng Zhu,
Guan Huang,
Stefano Mattoccia ,
CVPR, 2023.
[arXiv]
[Code]
This paper proposes a Joint Convolutional Attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure.
|
|
DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation
Yiqun Duan,
Xianda Guo†,
Zheng Zhu
ECCV, 2024.
[arXiv]
[Code]
We propose DiffusionDepth, a new approach that reformulates monocular depth estimation as a denoising diffusion process. It learns an iterative denoising process to `denoise' random depth distribution into a depth map with the guidance of monocular visual conditions.
|
🚙 Stereo Matching
|
Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data
Xianda Guo*,
Chenming Zhang*,
Youmin Zhang
Dujun Nie,
Ruilin Wang,
Wenzhao Zheng,
Matteo Poggi,
Long Chen
arXiv, 2025.
[arXiv]
[Code]
We introduce a novel synthetic dataset that complements existing data by adding variability in baselines, camera angles, and scene types. We extensively evaluate the zero-shot capabilities of our model on five public datasets, showcasing its impressive ability to generalize to new, unseen data.
|
|
OpenStereo: A Comprehensive Benchmark for Stereo Matching and Strong Baseline
Xianda Guo,
Chenming Zhang,
Juntao Lu,
Yiqun Duan ,
Yiqi Wang ,
Tian Yang,
Zheng Zhu,
Long Chen
arXiv, 2024.
[arXiv]
[Code]
OpenStereo includes training and inference codes of more than 10 network models, making it, to our knowledge, the most complete stereo matching toolbox available.
|
|
LightStereo: Channel Boost Is All You Need for Efficient 2D Cost Aggregation
Xianda Guo*,
Chenming Zhang*,
Youmin Zhang ,
Wenzhao Zheng,
Dujun Nie ,
Matteo Poggi ,
Long Chen
ICRA, 2025.
[arXiv]
[Code]
We present LightStereo, a cutting-edge stereo-matching network crafted to accelerate the matching process.
|
🚙 End-to-End Driving
|
MaskFuser: Masked Fusion of Joint Multi-Modal Tokenization for End-to-End Autonomous Driving
Yiqun Duan,
Xianda Guo,
Zheng Zhu,
Yao Zheng*,
Zhen Wang,
Yu-Kai Wang,
Chin-Teng Lin
arXiv, 2024.
[arXiv]
[Code]
This paper proposes MaskFuser, which tokenizes various modalities into a unified semantic feature space and provides a joint representation for further behavior cloning in driving contexts. Given the unified token representation, MaskFuser is the first work to introduce cross-modality masked auto-encoder training.
|
|
GenAD: Generative End-to-End Autonomous Driving
Wenzhao Zheng*,
Ruiqi Song* ,
Xianda Guo*†,
Chenming Zhang ,
Long Chen
ECCV, 2024.
[arXiv]
[Code]
GenAD casts end-to-end autonomous driving as a generative modeling problem.
|
🚙 LLM&MLLM
|
Instruct Large Language Models to Drive like Humans
Ruijun Zhang*,
Xianda Guo*†,
Wenzhao Zheng*,
Chenming Zhang ,
Kurt Keutzer ,
Long Chen
arXiv, 2024.
[arXiv]
[Code]
In this paper, we propose an InstructDriver method to transform LLM into a motion planner with explicit instruction tuning to align its behavior with humans.
|
|
DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving
Xianda Guo*,
Ruijun Zhang* ,
Yiqun Duan* ,
Yuhang He ,
Chenming Zhang ,
Shuai Liu,
Long Chen
arXiv, 2024.
[arXiv]
[Code]
We introduce DriveMLLM, a benchmark specifically designed to evaluate the spatial understanding capabilities of multimodal large language models (MLLMs) in autonomous driving.
|
Academic Services
Conference Reviewer: ECCV 2024, ACM MM2025, NeurIPS2025
Journal Reviewer: T-IP, T-MM, T-CSVT, RAL
|
© Xianda Guo | Last updated: Mar. 1, 2025.
|