Yuxin Xiong

Hi 👋 I’m Yuxin Xiong, a Master’s student in Computer Science at UC San Diego, advised by Prof. Julian McAuley. Before that, I received my Bachelor’s degree from Shanghai Jiao Tong University, where I was advised by Prof. Siheng Chen.

My research focuses on Reasoning, Reinforcement Learning and Large Language Model (LLM) Agents 🤖 — aiming to make language models more intelligent, interpretable, and aligned with human reasoning.

Previously, I collaborated with Prof. Zhiting Hu on large-scale reasoning model training at UC San Diego. I also interned at AWS AI Lab, working with Minjie Wang on Chain-of-Thought (CoT) agents and tool-augmented LLM reasoning.

I’m currently open to Ph.D. positions (Fall 2026) 🎓 and new grad opportunities in AI/ML research and development 🚀.

News 🔥

Jan 22, 2026	Our paper CTRLS was accepted to AISTATS 2026! 🌟
Sep 09, 2025	We contributed to the release of K2-Think — a general reasoning model! 🎉
Aug 20, 2025	Two of our papers were accepted to EMNLP 2025! 🎊
Jan 22, 2025	Our work OCEAN was accepted to ICLR 2025! 🌊
Jun 30, 2024	I graduated from Shanghai Jiao Tong University (SJTU)! 🎓
Jun 12, 2024	Our work MATRIX was selected as an ICML 2024 Spotlight! 🌟

Selected Publications

AISTATS
Ctrls: Chain-of-thought reasoning via latent state-transition

Junda Wu^*, Yuxin Xiong^*, Xintong Li, Zhengmian Hu, Tong Yu, Rui Wang, Xiang Chen, Jingbo Shang, and Julian McAuley

arXiv preprint arXiv:2507.08182, 2025

Abs Bib PDF

Chain-of-thought (CoT) reasoning enables large language models (LLMs) to break down complex problems into interpretable intermediate steps, significantly enhancing model transparency and performance in reasoning tasks. However, conventional CoT methods rely on heuristic sampling without structured modeling of reasoning transitions, constraining their ability to systematically explore and discover diverse and effective reasoning trajectories. In this work, we introduce CTRLS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions, enabling principled and state-aware exploration via distributional reinforcement learning. By modelling reasoning actions as explicit probability distributions in latent space, our approach explicitly models epistemic uncertainty, facilitating robust exploration of the reasoning space. As part of our framework, we introduce an on-policy reinforcement learning strategy incorporating epsilon-greedy exploration and entropy-based regularization to iteratively refine latent state transitions without requiring additional fine-tuning of the underlying LLM. Theoretical analyses provide evidence lower bounds (ELBO), theoretically grounding our transition-aware modeling of latent reasoning dynamics. Further experiments demonstrate improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.
@article{wu2025ctrls, title = {Ctrls: Chain-of-thought reasoning via latent state-transition}, author = {Wu, Junda and Xiong, Yuxin and Li, Xintong and Hu, Zhengmian and Yu, Tong and Wang, Rui and Chen, Xiang and Shang, Jingbo and McAuley, Julian}, journal = {arXiv preprint arXiv:2507.08182}, year = {2025}, }
EMNLP
Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics

Sheldon Yu^*, Yuxin Xiong^*, Junda Wu, Xintong Li, Tong Yu, Xiang Chen, Ritwik Sinha, Jingbo Shang, and Julian McAuley

In Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

Abs Bib PDF

Recent advances in chain-of-thought (CoT) prompting have enabled large language models (LLMs) to perform multi-step reasoning. However, the explainability of such reasoning remains limited, with prior work primarily focusing on local token-level attribution, such that the high-level semantic roles of reasoning steps and their transitions remain underexplored. In this paper, we introduce a state-aware transition framework that abstracts CoT trajectories into structured latent dynamics. Specifically, to capture the evolving semantics of CoT reasoning, each reasoning step is represented via spectral analysis of token-level embeddings and clustered into semantically coherent latent states. To characterize the global structure of reasoning, we model their progression as a Markov chain, yielding a structured and interpretable view of the reasoning process. This abstraction supports a range of analyses, including semantic role identification, temporal pattern visualization, and consistency evaluation.
@inproceedings{yu2025explainable, title = {Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics}, author = {Yu, Sheldon and Xiong, Yuxin and Wu, Junda and Li, Xintong and Yu, Tong and Chen, Xiang and Sinha, Ritwik and Shang, Jingbo and McAuley, Julian}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025}, pages = {16660--16667}, year = {2025}, }
EMNLP
Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent

Junda Wu, Yuxin Xiong, Xintong Li, Yu Xia, Ruoyu Wang, Yu Wang, Tong Yu, Sungchul Kim, Ryan A Rossi, Lina Yao, Jingbo Shang, and Julian McAuley

In Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

Abs Bib PDF

Recent MLLMs have shown emerging visual understanding and reasoning abilities after being pre-trained on large-scale multimodal datasets. Unlike pre-training, where MLLMs receive rich visual-text alignment, instruction-tuning is often text-driven with weaker visual supervision, leading to the degradation of pre-trained visual understanding and causing visual forgetting. Existing approaches, such as direct fine-tuning and continual learning methods, fail to explicitly address this issue, often compressing visual representations and prioritizing task alignment over visual retention, which further worsens visual forgetting. To overcome this limitation, we introduce a novel perspective leveraging effective rank to quantify the degradation of visual representation richness, interpreting this degradation through the information bottleneck principle as excessive compression that leads to the degradation of crucial pre-trained visual knowledge. Building on this view, we propose a modality-decoupled gradient descent (MDGD) method that regulates gradient updates to maintain the effective rank of visual representations while mitigating the over-compression effects described by the information bottleneck. By explicitly disentangling the optimization of visual understanding from task-specific alignment, MDGD preserves pre-trained visual knowledge while enabling efficient task adaptation. To enable lightweight instruction-tuning, we further develop a memory-efficient fine-tuning approach using gradient masking, which selectively updates a subset of model parameters to enable parameter-efficient fine-tuning (PEFT), reducing computational overhead while preserving rich visual representations. Extensive experiments across various downstream tasks and backbone MLLMs demonstrate that MDGD effectively mitigates visual forgetting from pre-trained tasks while enabling strong adaptation to new tasks.
@inproceedings{wu2025mitigating, title = {Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent}, author = {Wu, Junda and Xiong, Yuxin and Li, Xintong and Xia, Yu and Wang, Ruoyu and Wang, Yu and Yu, Tong and Kim, Sungchul and Rossi, Ryan A and Yao, Lina and Shang, Jingbo and McAuley, Julian}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025}, year = {2025}, }
ICML
Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation

Xianghe Pang, Shuo Tang, Rui Ye, Yuxin Xiong, Bolun Zhang, Yanfeng Wang, and Siheng Chen

In Forty-first International Conference on Machine Learning, 2024

Spotlight

Abs Bib PDF

Aligning large language models (LLMs) with human values is imperative to mitigate potential adverse effects resulting from their misuse. Drawing from the sociological insight that acknowledging all parties’ concerns is a key factor in shaping human values, this paper proposes a novel direction to align LLMs by themselves: social scene simulation. To achieve this, we present MATRIX, a novel social scene simulator that emulates realistic scenes around a user’s input query, enabling the LLM to take social consequences into account before responding. MATRIX serves as a virtual rehearsal space, akin to a Monopolylogue, where the LLM performs diverse roles related to the query and practice by itself. To inject this alignment, we fine-tune the LLM with MATRIX-simulated data, ensuring adherence to human values without compromising inference speed. We theoretically show that the LLM with MATRIX outperforms existing methods under mild assumptions. Finally, extensive experiments validate that our method outperforms over 10 baselines across 4 benchmarks. As evidenced by 875 user ratings, our tuned 13B-size LLM exceeds GPT-4 in aligning with human values. See our project page at https://shuotang123.github.io/MATRIX.
@inproceedings{pangself, title = {Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation}, author = {Pang, Xianghe and Tang, Shuo and Ye, Rui and Xiong, Yuxin and Zhang, Bolun and Wang, Yanfeng and Chen, Siheng}, booktitle = {Forty-first International Conference on Machine Learning}, year = {2024}, note = {<b>Spotlight</b>}, }
ICLR
OCEAN: Offline Chain-of-thought Evaluation and Alignment in Large Language Models

Junda Wu, Xintong Li, Ruoyu Wang, Yu Xia, Yuxin Xiong, Jianing Wang, Tong Yu, Xiang Chen, Branislav Kveton, Lina Yao, Jingbo Shang, and Julian McAuley

In The Thirteenth International Conference on Learning Representations, 2025

Abs Bib PDF

Offline evaluation of LLMs is crucial in understanding their capacities, though current methods remain underexplored in existing research. In this work, we focus on the offline evaluation of the chain-of-thought capabilities and show how to optimize LLMs based on the proposed evaluation method. To enable offline feedback with rich knowledge and reasoning paths, we use knowledge graphs (e.g., Wikidata5m) to provide feedback on the generated chain of thoughts. Due to the heterogeneity between LLM reasoning and KG structures, direct interaction and feedback from KGs on LLM behavior are challenging, as they require accurate entity linking and grounding of LLM-generated chains of thought in the KG. To address the above challenge, we propose an offline chain-of-thought evaluation framework, OCEAN, which models chain-of-thought reasoning in LLMs as an MDP and evaluate the policy’s alignment with KG preference modeling. To overcome the reasoning heterogeneity and grounding problems, we leverage on-policy KG exploration and RL to model a KG policy that generates token-level likelihood distributions for LLM-generated chain-of-thought reasoning paths, simulating KG reasoning preference. Then we incorporate the knowledge-graph feedback on the validity and alignment of the generated reasoning paths into inverse propensity scores and propose KG-IPS estimator. Theoretically, we prove the unbiasedness of the proposed KG-IPS estimator and provide a lower bound on its variance. With the off-policy evaluated value function, we can directly enable off-policy optimization to further enhance chain-of-thought alignment. Our empirical study shows that OCEAN can be efficiently optimized for generating chain-of-thought reasoning paths with higher estimated values without affecting LLMs’ general abilities in downstream tasks or their internal knowledge.
@inproceedings{wuocean, title = {OCEAN: Offline Chain-of-thought Evaluation and Alignment in Large Language Models}, author = {Wu, Junda and Li, Xintong and Wang, Ruoyu and Xia, Yu and Xiong, Yuxin and Wang, Jianing and Yu, Tong and Chen, Xiang and Kveton, Branislav and Yao, Lina and and Shang, Jingbo and McAuley, Julian}, booktitle = {The Thirteenth International Conference on Learning Representations}, year = {2025}, }
NeurIPS
Emergent Communication in Interactive Sketch Question Answering

Zixing Lei, Yiming Zhang, Yuxin Xiong, and Siheng Chen

In Thirty-seventh Conference on Neural Information Processing Systems, 2023

Abs Bib PDF

Vision-based emergent communication (EC) aims to learn to communicate through sketches and demystify the evolution of human communication. Ironically, previous works neglect multi-round interaction, which is indispensable in human communication. To fill this gap, we first introduce a novel Interactive Sketch Question Answering (ISQA) task, where two collaborative players are interacting through sketches to answer a question about an image. To accomplish this task, we design a new and efficient interactive EC system, which can achieve an effective balance among three evaluation factors, including the question answering accuracy, drawing complexity and human interpretability. Our experimental results demonstrate that multi-round interactive mechanism facilitates tar-geted and efficient communication between intelligent agents. The code will be released.
@inproceedings{leiemergent, title = {Emergent Communication in Interactive Sketch Question Answering}, author = {Lei, Zixing and Zhang, Yiming and Xiong, Yuxin and Chen, Siheng}, booktitle = {Thirty-seventh Conference on Neural Information Processing Systems}, year = {2023}, }

Services

Review for:ICLR 2026, NeurIPS 2023/2025, ICML 2024/2025, AAAI 2025/2026