VLASCD: A Visual Language Action Model for Simultaneous Chatting and Decision Making

Zuojin Tang; Bin Hu; Chenyang Zhao; De Ma; Gang Pan; Bin Liu

arXiv:2410.15885·cs.AI·August 26, 2025

VLASCD: A Visual Language Action Model for Simultaneous Chatting and Decision Making

Zuojin Tang, Bin Hu, Chenyang Zhao, De Ma, Gang Pan, Bin Liu

PDF

Open Access 1 Video

TL;DR

This paper introduces MIMO-VLA (VLASCD), a novel framework enabling simultaneous multi-task processing in multimodal models, overcoming limitations of traditional single-output architectures and improving performance in concurrent tasks like dialogue and decision-making.

Contribution

The paper proposes MIMO-VLA, a unified training framework that supports parallel multi-task outputs, inspired by human cognition, and demonstrates superior performance in MIMO scenarios.

Findings

01

MIMO-VLA outperforms state-of-the-art MISO models in MIMO tasks.

02

It enables concurrent dialogue generation and decision-making.

03

Experimental results on CARLA show significant performance gains.

Abstract

Recent large pretrained models such as LLMs (e.g., GPT series) and VLAs (e.g., OpenVLA) have achieved notable progress on multimodal tasks, yet they are built upon a multi-input single-output (MISO) paradigm. We show that this paradigm fundamentally limits performance in multi-input multi-output (MIMO) scenarios, where parallel task execution is required. In MISO architectures, tasks compete for a shared output channel, creating mutual exclusion effects that cause unbalanced optimization and degraded performance. To address this gap, we introduce MIMO-VLA (VLASCD), a unified training framework that enables concurrent multi-task outputs, exemplified by simultaneous dialogue generation and decision-making. Inspired by human cognition, MIMO-VLA eliminates interference between tasks and supports efficient parallel processing. Experiments on the CARLA autonomous driving platform demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VLASCD: A Visual Language Action Model for Simultaneous Chatting and Decision Making· underline

Taxonomy

TopicsSpeech and dialogue systems

MethodsEntropy Regularization · Proximal Policy Optimization · CARLA: An Open Urban Driving Simulator