BiTAgent: A Task-Aware Modular Framework for Bidirectional Coupling between Multimodal Large Language Models and World Models

Yu-Wei Zhan; Xin Wang; Pengzhe Mao; Tongtong Feng; Ren Wang; Wenwu Zhu

arXiv:2512.04513·cs.AI·December 5, 2025

BiTAgent: A Task-Aware Modular Framework for Bidirectional Coupling between Multimodal Large Language Models and World Models

Yu-Wei Zhan, Xin Wang, Pengzhe Mao, Tongtong Feng, Ren Wang, Wenwu Zhu

PDF

Open Access

TL;DR

BiTAgent is a novel framework that tightly couples multimodal large language models with world models, enabling task-aware, bidirectional interaction for improved embodied agent performance across diverse tasks.

Contribution

It introduces a task-aware, bidirectional coupling mechanism between MLLMs and WMs, enhancing multi-task learning and generalization in embodied agents.

Findings

01

Outperforms state-of-the-art baselines in multi-task settings

02

Demonstrates improved stability and generalization

03

Enables effective semantic and dynamic integration

Abstract

Building generalist embodied agents requires a unified system that can interpret multimodal goals, model environment dynamics, and execute reliable actions across diverse real-world tasks. Multimodal large language models (MLLMs) offer strong semantic priors and cross-modal generalization, while world models (WMs) provide actionable latent dynamics for prediction and control. Their combination holds promise for open-ended embodied intelligence, yet introduces two key challenges: (1) establishing a tight coupling between the semantic intent from MLLMs and the dynamic state representations within the WM's latent space, and (2) achieving task-aware adaptability that supports multi-task learning and cross-environment generalization. To address these limitations, we propose BiTAgent, a task-aware dynamic joint framework that enables bidirectional coupling between MLLMs and WMs. BiTAgent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning