ModRWKV: Transformer Multimodality in Linear Time
Jiale Kang, Ziyin Yue, Qingyu Yin, Jiang Rui, Weile Li, Zening Lu, Zhouran Ji

TL;DR
ModRWKV introduces a lightweight, efficient RNN-based multimodal framework leveraging pretrained RWKV7 weights, demonstrating competitive performance and faster training compared to traditional Transformer-based models in multimodal tasks.
Contribution
This work presents the first effective multimodal framework based on RNN architectures, specifically RWKV7, with a decoupled design and extensive experiments validating its efficiency and performance.
Findings
ModRWKV achieves a good balance between performance and computational efficiency.
Pretrained RWKV7 weights significantly improve multimodal understanding.
The architecture's configuration can be optimized systematically for best results.
Abstract
Currently, most multimodal studies are based on large language models (LLMs) with quadratic-complexity Transformer architectures. While linear models like RNNs enjoy low inference costs, their application has been largely limited to the text-only modality. This work explores the capabilities of modern RNN architectures in multimodal contexts. We propose ModRWKV-a decoupled multimodal framework built upon the RWKV7 architecture as its LLM backbone-which achieves multi-source information fusion through dynamically adaptable heterogeneous modality encoders. We designed the multimodal modules in ModRWKV with an extremely lightweight architecture and, through extensive experiments, identified a configuration that achieves an optimal balance between performance and computational efficiency. ModRWKV leverages the pretrained weights of the RWKV7 LLM for initialization, which significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsIndustrial Technology and Control Systems · Power Systems and Technologies
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Softmax
