Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding

Boyu Chen; Siran Chen; Kunchang Li; Qinglin Xu; Yu Qiao; Yali Wang

arXiv:2506.07576·cs.CV·December 23, 2025

Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding

Boyu Chen, Siran Chen, Kunchang Li, Qinglin Xu, Yu Qiao, Yali Wang

PDF

Open Access

TL;DR

The paper introduces a Super Encoding Network (SEN) that recursively associates multi-modal encoders to enhance video understanding across various tasks, significantly improving performance over existing models.

Contribution

It proposes a novel SEN framework that fuses multi-modal encoders recursively, enabling deeper multimodal interactions for improved video understanding performance.

Findings

01

Boosts pixel-level tracking accuracy by 2.7% Jaccard index

02

Reduces temporal coherence error by 8.8%

03

Improves one-shot video editing alignment by 6.4%

Abstract

Video understanding has been considered as one critical step towards world modeling, which is an important long-term problem in AI research. Recently, multimodal foundation models have shown such potential via large-scale pretraining. These models effectively align encoders of different modalities via contrastive learning. To further enhance performance on complex target movements and diversified video scenes, we propose to augment this alignment with deeper multimodal interactions, which are critical for understanding complex target movements with diversified video scenes. To fill this gap, we propose a unified Super Encoding Network (SEN) for video understanding, which builds up such distinct interactions through the recursive association of multimodal encoders in the foundation models. Specifically, we creatively treat those well-trained encoders as ``super neurons" in our SEN. Via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis

MethodsALIGN