ModRWKV: Transformer Multimodality in Linear Time

Jiale Kang; Ziyin Yue; Qingyu Yin; Jiang Rui; Weile Li; Zening Lu; Zhouran Ji

arXiv:2505.14505·cs.CL·May 21, 2025

ModRWKV: Transformer Multimodality in Linear Time

Jiale Kang, Ziyin Yue, Qingyu Yin, Jiang Rui, Weile Li, Zening Lu, Zhouran Ji

PDF

Open Access 1 Repo 1 Video

TL;DR

ModRWKV introduces a lightweight, efficient RNN-based multimodal framework leveraging pretrained RWKV7 weights, demonstrating competitive performance and faster training compared to traditional Transformer-based models in multimodal tasks.

Contribution

This work presents the first effective multimodal framework based on RNN architectures, specifically RWKV7, with a decoupled design and extensive experiments validating its efficiency and performance.

Findings

01

ModRWKV achieves a good balance between performance and computational efficiency.

02

Pretrained RWKV7 weights significantly improve multimodal understanding.

03

The architecture's configuration can be optimized systematically for best results.

Abstract

Currently, most multimodal studies are based on large language models (LLMs) with quadratic-complexity Transformer architectures. While linear models like RNNs enjoy low inference costs, their application has been largely limited to the text-only modality. This work explores the capabilities of modern RNN architectures in multimodal contexts. We propose ModRWKV-a decoupled multimodal framework built upon the RWKV7 architecture as its LLM backbone-which achieves multi-source information fusion through dynamically adaptable heterogeneous modality encoders. We designed the multimodal modules in ModRWKV with an extremely lightweight architecture and, through extensive experiments, identified a configuration that achieves an optimal balance between performance and computational efficiency. ModRWKV leverages the pretrained weights of the RWKV7 LLM for initialization, which significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jl-er/modrwkv
pytorchOfficial

Videos

ModRWKV: Transformer Multimodality in Linear Time· underline

Taxonomy

TopicsIndustrial Technology and Control Systems · Power Systems and Technologies

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Softmax