publications
2025
- arXivK2-think: A parameter-efficient reasoning systemZhoujun Cheng, Richard Fan, Shibo Hao, Taylor W. Killian, Haonan Li, Suqi Sun, Hector Ren, Alexander Moreno, Daqian Zhang, Tianjun Zhong, Yuxin Xiong, Yuanzhe Hu, Yutao Xie, Xudong Han, Yuqi Wang, and 16 more authorsarXiv preprint arXiv:2509.07604, 2025
K2-Think is a reasoning system that achieves state-of-the-art performance with a 32B parameter model, matching or surpassing much larger models like GPT-OSS 120B and DeepSeek v3.1. Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware, all using publicly available open-source datasets. K2-Think excels in mathematical reasoning, achieving state-of-the-art scores on public benchmarks for open-source models, while also performing strongly in other areas such as Code and Science. Our results confirm that a more parameter-efficient model like K2-Think 32B can compete with state-of-the-art systems through an integrated post-training recipe that includes long chain-of-thought training and strategic inference-time enhancements, making open-source reasoning systems more accessible and affordable. K2-Think is freely available at this http URL, offering best-in-class inference speeds of over 2,000 tokens per second per request via the Cerebras Wafer-Scale Engine.
@article{cheng2025k2, title = {K2-think: A parameter-efficient reasoning system}, author = {{Cheng}, Zhoujun and {Fan}, Richard and {Hao}, Shibo and {Killian}, Taylor W. and {Li}, Haonan and {Sun}, Suqi and {Ren}, Hector and {Moreno}, Alexander and {Zhang}, Daqian and {Zhong}, Tianjun and {Xiong}, Yuxin and {Hu}, Yuanzhe and {Xie}, Yutao and {Han}, Xudong and {Wang}, Yuqi and {Pimpalkhute}, Varad and {Zhuang}, Yonghao and {Singh}, Aaryamonvikram and {Liang}, Xuezhi and {Xie}, Anze and {She}, Jianshu and {Fan}, Desai and {Gao}, Chengqian and {Ma}, Liqun and {Yurochkin}, Mikhail and {Maggs}, John and {Ma}, Xuezhe and {He}, Guowei and {Hu}, Zhiting and {Liu}, Zhengzhong and {Xing}, Eric P.}, journal = {arXiv preprint arXiv:2509.07604}, year = {2025}, } - AISTATSCtrls: Chain-of-thought reasoning via latent state-transitionJunda Wu*, Yuxin Xiong*, Xintong Li, Zhengmian Hu, Tong Yu, Rui Wang, Xiang Chen, Jingbo Shang, and Julian McAuleyarXiv preprint arXiv:2507.08182, 2025
Chain-of-thought (CoT) reasoning enables large language models (LLMs) to break down complex problems into interpretable intermediate steps, significantly enhancing model transparency and performance in reasoning tasks. However, conventional CoT methods rely on heuristic sampling without structured modeling of reasoning transitions, constraining their ability to systematically explore and discover diverse and effective reasoning trajectories. In this work, we introduce CTRLS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions, enabling principled and state-aware exploration via distributional reinforcement learning. By modelling reasoning actions as explicit probability distributions in latent space, our approach explicitly models epistemic uncertainty, facilitating robust exploration of the reasoning space. As part of our framework, we introduce an on-policy reinforcement learning strategy incorporating epsilon-greedy exploration and entropy-based regularization to iteratively refine latent state transitions without requiring additional fine-tuning of the underlying LLM. Theoretical analyses provide evidence lower bounds (ELBO), theoretically grounding our transition-aware modeling of latent reasoning dynamics. Further experiments demonstrate improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.
@article{wu2025ctrls, title = {Ctrls: Chain-of-thought reasoning via latent state-transition}, author = {Wu, Junda and Xiong, Yuxin and Li, Xintong and Hu, Zhengmian and Yu, Tong and Wang, Rui and Chen, Xiang and Shang, Jingbo and McAuley, Julian}, journal = {arXiv preprint arXiv:2507.08182}, year = {2025}, } - EMNLPExplainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning DynamicsSheldon Yu*, Yuxin Xiong*, Junda Wu, Xintong Li, Tong Yu, Xiang Chen, Ritwik Sinha, Jingbo Shang, and Julian McAuleyIn Findings of the Association for Computational Linguistics: EMNLP 2025, 2025
Recent advances in chain-of-thought (CoT) prompting have enabled large language models (LLMs) to perform multi-step reasoning. However, the explainability of such reasoning remains limited, with prior work primarily focusing on local token-level attribution, such that the high-level semantic roles of reasoning steps and their transitions remain underexplored. In this paper, we introduce a state-aware transition framework that abstracts CoT trajectories into structured latent dynamics. Specifically, to capture the evolving semantics of CoT reasoning, each reasoning step is represented via spectral analysis of token-level embeddings and clustered into semantically coherent latent states. To characterize the global structure of reasoning, we model their progression as a Markov chain, yielding a structured and interpretable view of the reasoning process. This abstraction supports a range of analyses, including semantic role identification, temporal pattern visualization, and consistency evaluation.
@inproceedings{yu2025explainable, title = {Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics}, author = {Yu, Sheldon and Xiong, Yuxin and Wu, Junda and Li, Xintong and Yu, Tong and Chen, Xiang and Sinha, Ritwik and Shang, Jingbo and McAuley, Julian}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025}, pages = {16660--16667}, year = {2025}, } - EMNLPMitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient DescentJunda Wu, Yuxin Xiong, Xintong Li, Yu Xia, Ruoyu Wang, Yu Wang, Tong Yu, Sungchul Kim, Ryan A Rossi, Lina Yao, Jingbo Shang, and Julian McAuleyIn Findings of the Association for Computational Linguistics: EMNLP 2025, 2025
Recent MLLMs have shown emerging visual understanding and reasoning abilities after being pre-trained on large-scale multimodal datasets. Unlike pre-training, where MLLMs receive rich visual-text alignment, instruction-tuning is often text-driven with weaker visual supervision, leading to the degradation of pre-trained visual understanding and causing visual forgetting. Existing approaches, such as direct fine-tuning and continual learning methods, fail to explicitly address this issue, often compressing visual representations and prioritizing task alignment over visual retention, which further worsens visual forgetting. To overcome this limitation, we introduce a novel perspective leveraging effective rank to quantify the degradation of visual representation richness, interpreting this degradation through the information bottleneck principle as excessive compression that leads to the degradation of crucial pre-trained visual knowledge. Building on this view, we propose a modality-decoupled gradient descent (MDGD) method that regulates gradient updates to maintain the effective rank of visual representations while mitigating the over-compression effects described by the information bottleneck. By explicitly disentangling the optimization of visual understanding from task-specific alignment, MDGD preserves pre-trained visual knowledge while enabling efficient task adaptation. To enable lightweight instruction-tuning, we further develop a memory-efficient fine-tuning approach using gradient masking, which selectively updates a subset of model parameters to enable parameter-efficient fine-tuning (PEFT), reducing computational overhead while preserving rich visual representations. Extensive experiments across various downstream tasks and backbone MLLMs demonstrate that MDGD effectively mitigates visual forgetting from pre-trained tasks while enabling strong adaptation to new tasks.
@inproceedings{wu2025mitigating, title = {Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent}, author = {Wu, Junda and Xiong, Yuxin and Li, Xintong and Xia, Yu and Wang, Ruoyu and Wang, Yu and Yu, Tong and Kim, Sungchul and Rossi, Ryan A and Yao, Lina and Shang, Jingbo and McAuley, Julian}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025}, year = {2025}, } - OCEAN: Offline Chain-of-thought Evaluation and Alignment in Large Language ModelsJunda Wu, Xintong Li, Ruoyu Wang, Yu Xia, Yuxin Xiong, Jianing Wang, Tong Yu, Xiang Chen, Branislav Kveton, Lina Yao, Jingbo Shang, and Julian McAuleyIn The Thirteenth International Conference on Learning Representations, 2025
Offline evaluation of LLMs is crucial in understanding their capacities, though current methods remain underexplored in existing research. In this work, we focus on the offline evaluation of the chain-of-thought capabilities and show how to optimize LLMs based on the proposed evaluation method. To enable offline feedback with rich knowledge and reasoning paths, we use knowledge graphs (e.g., Wikidata5m) to provide feedback on the generated chain of thoughts. Due to the heterogeneity between LLM reasoning and KG structures, direct interaction and feedback from KGs on LLM behavior are challenging, as they require accurate entity linking and grounding of LLM-generated chains of thought in the KG. To address the above challenge, we propose an offline chain-of-thought evaluation framework, OCEAN, which models chain-of-thought reasoning in LLMs as an MDP and evaluate the policy’s alignment with KG preference modeling. To overcome the reasoning heterogeneity and grounding problems, we leverage on-policy KG exploration and RL to model a KG policy that generates token-level likelihood distributions for LLM-generated chain-of-thought reasoning paths, simulating KG reasoning preference. Then we incorporate the knowledge-graph feedback on the validity and alignment of the generated reasoning paths into inverse propensity scores and propose KG-IPS estimator. Theoretically, we prove the unbiasedness of the proposed KG-IPS estimator and provide a lower bound on its variance. With the off-policy evaluated value function, we can directly enable off-policy optimization to further enhance chain-of-thought alignment. Our empirical study shows that OCEAN can be efficiently optimized for generating chain-of-thought reasoning paths with higher estimated values without affecting LLMs’ general abilities in downstream tasks or their internal knowledge.
@inproceedings{wuocean, title = {OCEAN: Offline Chain-of-thought Evaluation and Alignment in Large Language Models}, author = {Wu, Junda and Li, Xintong and Wang, Ruoyu and Xia, Yu and Xiong, Yuxin and Wang, Jianing and Yu, Tong and Chen, Xiang and Kveton, Branislav and Yao, Lina and and Shang, Jingbo and McAuley, Julian}, booktitle = {The Thirteenth International Conference on Learning Representations}, year = {2025}, }
2024
- ICLR WorkshopSelf-alignment of large language models via multi-agent social simulationXianghe Pang, Shuo Tang, Rui Ye, Yuxin Xiong, Bolun Zhang, Yanfeng Wang, and Siheng ChenIn ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024
Aligning large language models (LLMs) with human values is imperative to mitigate potential adverse effects resulting from their misuse. Drawing from the sociological insight that acknowledging all parties’ concerns is a key factor in shaping human values, this paper proposes a novel direction to align LLMs by themselves: social scene simulation. To achieve this, we present MATRIX, a novel social scene simulator that emulates realistic scenes around a user’s input query, enabling the LLM to take social consequences into account before responding. MATRIX serves as a virtual rehearsal space, akin to a Monopolylogue, where the LLM performs diverse roles related to the query and practice by itself. To inject this alignment, we fine-tune the LLM with MATRIX-simulated data, ensuring adherence to human values without compromising inference speed. We theoretically show that the LLM with MATRIX outperforms Constitutional AI under mild assumptions. Finally, extensive experiments validate that our method outperforms over 10 baselines across 4 benchmarks. As evidenced by 875 user ratings, our tuned 13B-size LLM exceeds GPT-4 in aligning with human values. Code will be available.
@inproceedings{pang2024self, title = {Self-alignment of large language models via multi-agent social simulation}, author = {Pang, Xianghe and Tang, Shuo and Ye, Rui and Xiong, Yuxin and Zhang, Bolun and Wang, Yanfeng and Chen, Siheng}, booktitle = {ICLR 2024 Workshop on Large Language Model (LLM) Agents}, year = {2024}, } - Self-Alignment of Large Language Models via Monopolylogue-based Social Scene SimulationXianghe Pang, Shuo Tang, Rui Ye, Yuxin Xiong, Bolun Zhang, Yanfeng Wang, and Siheng ChenIn Forty-first International Conference on Machine Learning, 2024Spotlight
Aligning large language models (LLMs) with human values is imperative to mitigate potential adverse effects resulting from their misuse. Drawing from the sociological insight that acknowledging all parties’ concerns is a key factor in shaping human values, this paper proposes a novel direction to align LLMs by themselves: social scene simulation. To achieve this, we present MATRIX, a novel social scene simulator that emulates realistic scenes around a user’s input query, enabling the LLM to take social consequences into account before responding. MATRIX serves as a virtual rehearsal space, akin to a Monopolylogue, where the LLM performs diverse roles related to the query and practice by itself. To inject this alignment, we fine-tune the LLM with MATRIX-simulated data, ensuring adherence to human values without compromising inference speed. We theoretically show that the LLM with MATRIX outperforms existing methods under mild assumptions. Finally, extensive experiments validate that our method outperforms over 10 baselines across 4 benchmarks. As evidenced by 875 user ratings, our tuned 13B-size LLM exceeds GPT-4 in aligning with human values. See our project page at https://shuotang123.github.io/MATRIX.
@inproceedings{pangself, title = {Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation}, author = {Pang, Xianghe and Tang, Shuo and Ye, Rui and Xiong, Yuxin and Zhang, Bolun and Wang, Yanfeng and Chen, Siheng}, booktitle = {Forty-first International Conference on Machine Learning}, year = {2024}, note = {<b>Spotlight</b>}, }
2023
- Emergent Communication in Interactive Sketch Question AnsweringZixing Lei, Yiming Zhang, Yuxin Xiong, and Siheng ChenIn Thirty-seventh Conference on Neural Information Processing Systems, 2023
Vision-based emergent communication (EC) aims to learn to communicate through sketches and demystify the evolution of human communication. Ironically, previous works neglect multi-round interaction, which is indispensable in human communication. To fill this gap, we first introduce a novel Interactive Sketch Question Answering (ISQA) task, where two collaborative players are interacting through sketches to answer a question about an image. To accomplish this task, we design a new and efficient interactive EC system, which can achieve an effective balance among three evaluation factors, including the question answering accuracy, drawing complexity and human interpretability. Our experimental results demonstrate that multi-round interactive mechanism facilitates tar-geted and efficient communication between intelligent agents. The code will be released.
@inproceedings{leiemergent, title = {Emergent Communication in Interactive Sketch Question Answering}, author = {Lei, Zixing and Zhang, Yiming and Xiong, Yuxin and Chen, Siheng}, booktitle = {Thirty-seventh Conference on Neural Information Processing Systems}, year = {2023}, }