Enabling Conversational Behavior Reasoning Capabilities in Full-Duplex Speech
Shuchang Pan, Siddharth Banerjee, Dhruv Hebbar, Siddhant Patel, Akshaj Gupta, Kan Jen Cheng, Hanjo Kim, Zeyi Austin Li, Martin Q. Ma, Tingle Li, Gopala Anumanchipalli, Jiachen Lian

TL;DR
This paper presents a novel framework for modeling and reasoning about conversational behaviors in full-duplex speech systems using causal inference and a Graph-of-Thoughts, enabling more natural and interpretable dialogue interactions.
Contribution
It introduces a hierarchical causal inference model with a Graph-of-Thoughts structure for conversational reasoning, trained on a hybrid corpus of simulated and real dialogues.
Findings
Robust behavior detection in synthetic and real dialogues
Interpretable reasoning chains for conversational actions
Foundation for benchmarking conversational reasoning
Abstract
Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this causal pathway is key to building natural full-duplex interactive systems. We introduce a framework that enables reasoning over conversational behaviors by modeling this process as causal inference within a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a hybrid corpus that pairs controllable, event-rich simulations with human-annotated rationales and real conversational speech. The GoT framework structures streaming predictions as an evolving graph, enabling a multimodal transformer to forecast the next speech act, generate concise justifications for its…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper proposes a meaningful shift from black-box sequence prediction to causal reasoning over conversational behavior, arguing that next-behavior reasoning is a more human-aligned formulation for full-duplex systems. The paper carefully constructs a dataset combining a simulation corpus with real data.
1. Methodology The method section is difficult to follow in parts. It is unclear how OpenIE triples (subject–relation–object) are incorporated into the graph and how they interact with the speech-act nodes. The paper describes multiple node types (text nodes, high-level acts, low-level acts) but does not provide a clear illustrative example showing how these are connected or how causal dependencies are inferred. The rationale generation mechanism could be better illustrated with an explicit e
The paper introduces a clear conceptual distinction between pattern matching and reasoning in dialogue systems. The proposed "Perception → Reasoning → Generation" framework provides structure to the problem of interpretable dialogue modeling. The hierarchical behavior taxonomy is grounded in established linguistic theory.
**Mismatch Between Claims and Implementation** This is correlation mining, not causal inference in the formal sense. The paper repeatedly claims to perform "causal inference" but the implementation uses frequency-based co-occurrence graphs.The adjacency matrix is symmetric (undirected graph), but causation has inherent directionality (A causes B does not imply B causes A). Without directed edges, the graph cannot represent causal pathways. Furthermore, observational correlation does not disting
Original technical contribution of reframing the full-duplex challenge from a black-box prediction task (next segment/token) to an explicit, interpretable reasoning task (perceive -> reason -> act). Also, the application of a Graph-of-Thoughts (GoT) framework to model the evolving conversational state is a novel and well-motivated architectural choice. The authors have developed and released a substantial new dataset, complete with a detailed analysis of its statistical properties compared to
The framework's representation of conversation is coarse, both temporally and semantically. Quantizing the dialogue into one-second chunks and assigning a single discrete speech-act label per chunk oversimplifies the fluid and often ambiguous nature of human interaction, a limitation the authors acknowledge.
Incorporating comprehensive reasoning in duplex systems is challenging due to its latency issues. While many previous works focus on low-level signal-based evaluation, this paper properly argues the importance of addressing cognitive and high-level features.
1. **Data synthesis**: - The paper needs to provide a more detailed and systematic description of the data synthesis pipeline, including how reproducibility and verification are ensured. Several key implementation details are omitted, for instance, how candidate backchannels are generated (e.g., whether they rely entirely on GPT prompts). The statement "we deliberately introduce controlled overlap between speakers" is also ambiguous and should be clarified. Evaluation on unverified datasets
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Multimodal Machine Learning Applications
