publications | Zhen Wu

2025

R^2-CoD: Understanding Text-Graph Complementarity in Relational Reasoning via Knowledge Co-Distillation

Zhen Wu, Ritam Dutt, Luke M. Breitfeller , and 3 more authors

2025

Under submission

Abs HTML

Relational reasoning lies at the core of many NLP tasks, often drawing on complementary signals from text and graphs. While this complementarity has been leveraged for many research objectives, the field lacks a detailed and systematic understanding of text-graph interplay and its effect on hybrid models. In this work, we take an analysis-driven approach to investigate text–graph representation complementarity via a unified architecture that supports knowledge co-distillation (CoD). We study five tasks involving relational reasoning that each differs in how text and graph structures encode and support the respective task objectives. By tracking how text and graph-based representations evolve over training epochs, we uncover interpretable patterns of alignment and divergence, and provide insights into when and why their integration through CoD is most beneficial.
On Code-Induced Reasoning in LLMs

Zhen Wu, Abdul Waheed, Carolyn Rosé , and 1 more author

2025

Under submission

Abs HTML

Code data has been shown to enhance the reasoning capabilities of large language models (LLMs), but it remains unclear which aspects of code are most responsible. We investigate this question with a systematic, data-centric framework. We construct parallel instruction datasets in ten programming languages and apply controlled perturbations that selectively disrupt structural or semantic properties of code. We then finetune LLMs from five model families and eight scales on each variant and evaluate their performance on natural language, math, and code tasks. Across 3,331 experiments, our results show that LLMs are more vulnerable to structural perturbations than semantic ones, particularly on math and code tasks. Appropriate abstractions like pseudocode and flowcharts can be as effective as code, while encoding the same information with fewer tokens without adhering to original syntax can often retain or even improve performance. Remarkably, even corrupted code with misleading signals remains competitive when surface-level regularities persist. Finally, syntactic styles also shape task-specific gains with Python favoring natural language reasoning and lower-level languages such as Java and Rust favoring math. Through our systematic framework, we aim to provide insight into how different properties of code influence reasoning and inform the design of training data for enhancing LLM reasoning capabilities.
Can KBQA Models Predict Their Reasoning Paths? Isomorphism Prediction Task as a Proxy

Zhen Wu, Ritam Dutt, Dhruv Gupta , and 1 more author

2025

Under submission

Abs HTML

Despite achieving correct answers, we find that existing Knowledge Base Question Answering (KBQA) models struggle to follow the expected reasoning structures. We introduce the task of isomorphism prediction to enhance reasoning fidelity beyond answer generation, with a focus on generalization. We propose a contrastive knowledge co-distillation framework that unifies textual and graphical KBQA paradigms, improving overall isomorphism prediction and model generalization. Furthermore, incorporating isomorphism prediction as an auxiliary task also improves KBQA performance.
VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding

Abdul Waheed, Zhen Wu, Dareen Alharthi , and 2 more authors

2025

Under submission

Abs HTML

Precisely evaluating video understanding models remains challenging: commonly used metrics such as BLEU, ROUGE, and BERTScore fail to capture the fineness of human judgment, while obtaining such judgments through manual evaluation is costly. Recent work has explored using large language models (LLMs) or multimodal LLMs (MLLMs) as evaluators, but their extension to video understanding remains relatively unexplored. In this work, we introduce VideoJudge, a 3B and 7B-sized MLLM judge specialized to evaluate outputs from video understanding models (i.e., text responses conditioned on videos). To train VideoJudge, our recipe builds on the interplay between a generator and an evaluator: the generator is prompted to produce responses conditioned on a target rating, and responses not matching the evaluator’s rating are discarded. Across three out of four meta-evaluation benchmarks, VideoJudge-7B outperforms larger MLLM judge baselines such as Qwen2.5-VL (32B and 72B). Notably, we find that LLM judges (Qwen3) models perform worse than MLLM judges (Qwen2.5-VL) and long chain-of-thought reasoning does not improve performance, indicating that providing video inputs is crucial for evaluation of video understanding tasks.
CSCL

LLM Bazaar: A Service Design for Supporting Collaborative Learning with an LLM-Powered Multi-Party Collaboration Infrastructure

Zhen Wu, Jiaxin Shi, R Charles Murray , and 2 more authors

In Proceedings of the 18th International Conference on Computer-Supported Collaborative Learning-CSCL 2025, pp. 108-115 , 2025

Abs HTML

For nearly two decades, conversational agents have played a critical role in structuring interactions in collaborative learning, shaping group dynamics, and supporting student engagement. The recent integration of large language models (LLMs) into these agents offers new possibilities for fostering critical thinking and collaborative problem solving. In this work, we begin with an open source collaboration support architecture called Bazaar and integrate an LLM-agent shell that enables introduction of LLM-empowered, real time, context sensitive collaborative support for group learning. This design and infrastructure paves the way for exploring how tailored LLM-empowered environments can reshape collaborative learning outcomes and interaction patterns.

2024

ACL
Leveraging Machine-Generated Rationales to Facilitate Social Meaning Detection in Conversations

Ritam Dutt, Zhen Wu, Kelly Shi , and 3 more authors

In , 2024

Abs Bib HTML

We present a generalizable classification approach that leverages Large Language Models (LLMs) to facilitate the detection of implicitly encoded social meaning in conversations. We design a multi-faceted prompt to extract a textual explanation of the reasoning that connects visible cues to underlying social meanings. These extracted explanations or rationales serve as augmentations to the conversational text to facilitate dialogue understanding and transfer. Our empirical results over 2,340 experimental settings demonstrate the significant positive impact of adding these rationales. Our findings hold true for in-domain classification, zero-shot, and few-shot domain transfer for two different social meaning detection tasks, each spanning two different corpora.
@inproceedings{dutt2024leveragingmachinegeneratedrationalesfacilitate, title = {Leveraging Machine-Generated Rationales to Facilitate Social Meaning Detection in Conversations}, author = {Dutt, Ritam and Wu, Zhen and Shi, Kelly and Sheth, Divyanshu and Gupta, Prakhar and Rose, Carolyn Penstein}, year = {2024}, eprint = {2406.19545}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, }
HuCLLM, ACL
Evaluating Large Language Models on Social Signal Sensitivity: An Appraisal Theory Approach

Zhen Wu, Ritam Dutt, and Carolyn Rosé

In Proc. of the 1st Human-Centered Large Language Modeling Workshop at ACL , 2024

Abs Bib HTML

We present a framework to assess the sensitivity of Large Language Models (LLMs) to textually embedded social signals using an Appraisal Theory perspective. We report on an experiment that uses prompts encoding three dimensions of social signals: Affect, Judgment, and Appreciation. In response to the prompt, an LLM generates both an analysis (Insight) and a conversational Response, which are analyzed in terms of sensitivity to the signals. We quantitatively evaluate the output text through topical analysis of the Insight and predicted social intelligence scores of the Response in terms of empathy and emotional polarity. Key findings show that LLMs are more sensitive to positive signals. The personas impact Responses but not the Insight. We discuss how our framework can be extended to a broader set of social signals, personas, and scenarios to evaluate LLM behaviors under various conditions.
@inproceedings{wuhucllm, title = {Evaluating Large Language Models on Social Signal Sensitivity: An Appraisal Theory Approach}, author = {Wu, Zhen and Dutt, Ritam and Rosé, Carolyn}, booktitle = {Proc. of the 1st Human-Centered Large Language Modeling Workshop at ACL}, year = {2024}, }