TokenCom: Vision-Language Model for Multimodal and Multitask Token Communications

Feibo Jiang; Siwei Tu; Li Dong; Xiaolong Li; Kezhi Wang; Cunhua Pan; Zhu Han; Jiangzhou Wang

arXiv:2603.00482·cs.CV·March 3, 2026

TokenCom: Vision-Language Model for Multimodal and Multitask Token Communications

Feibo Jiang, Siwei Tu, Li Dong, Xiaolong Li, Kezhi Wang, Cunhua Pan, Zhu Han, Jiangzhou Wang

PDF

Open Access

TL;DR

This paper introduces TaiChi, a novel vision-language model framework that enhances token communication by capturing multi-scale visual features and achieving precise cross-modal alignment, improving multimodal understanding and task performance.

Contribution

The paper presents TaiChi, a new VLM framework with dual-visual tokenization, Bilateral Attention Network, and KAN-based modality projection for improved multimodal token communication.

Findings

01

TaiChi outperforms existing models in visual understanding tasks.

02

The system demonstrates effective multimodal and multitask token communication.

03

Experimental results confirm the model's superior performance and feasibility.

Abstract

Visual-Language Models (VLMs), with their strong capabilities in image and text understanding, offer a solid foundation for intelligent communications. However, their effectiveness is constrained by limited token granularity, overlong visual token sequences, and inadequate cross-modal alignment. To overcome these challenges, we propose TaiChi, a novel VLM framework designed for token communications. TaiChi adopts a dual-visual tokenizer architecture that processes both high- and low-resolution images to collaboratively capture pixel-level details and global conceptual features. A Bilateral Attention Network (BAN) is introduced to intelligently fuse multi-scale visual tokens, thereby enhancing visual understanding and producing compact visual tokens. In addition, a Kolmogorov Arnold Network (KAN)-based modality projector with learnable activation functions is employed to achieve precise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Data Compression Techniques