Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding

Di Wu; Liting Jiang; Ruiyu Fang; Bianjing; Hongyan Xie; Haoxiang Su; Hao Huang; Zhongjiang He; Shuangyong Song; Xuelong Li

arXiv:2511.19005·cs.AI·November 25, 2025

Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding

Di Wu, Liting Jiang, Ruiyu Fang, Bianjing, Hongyan Xie, Haoxiang Su, Hao Huang, Zhongjiang He, Shuangyong Song, Xuelong Li

PDF

Open Access 1 Datasets

TL;DR

This paper introduces VRSLU, a new SLU dataset that incorporates visual context and explicit reasoning to improve real-world applicability, interpretability, and performance of spoken language understanding models.

Contribution

The paper presents VRSLU, a novel dataset integrating visual images and reasoning, and proposes LR-Instruct, a two-step instruction-based approach for better SLU performance and interpretability.

Findings

01

Visual information improves SLU accuracy.

02

Explicit reasoning enhances model interpretability.

03

Two-step instruction approach reduces reasoning bias.

Abstract

Spoken Language Understanding (SLU) consists of two sub-tasks: intent detection (ID) and slot filling (SF). Given its broad range of real-world applications, enhancing SLU for practical deployment is increasingly critical. Profile-based SLU addresses ambiguous user utterances by incorporating context awareness (CA), user profiles (UP), and knowledge graphs (KG) to support disambiguation, thereby advancing SLU research toward real-world applicability. However, existing SLU datasets still fall short in representing real-world scenarios. Specifically, (1) CA uses one-hot vectors for representation, which is overly idealized, and (2) models typically focuses solely on predicting intents and slot labels, neglecting the reasoning process that could enhance performance and interpretability. To overcome these limitations, we introduce VRSLU, a novel SLU dataset that integrates both Visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Tele-AI/TeleVRSLU
dataset· 32 dl
32 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling