Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding
Boyu Chen, Siran Chen, Kunchang Li, Qinglin Xu, Yu Qiao, Yali Wang

TL;DR
The paper introduces a Super Encoding Network (SEN) that recursively associates multi-modal encoders to enhance video understanding across various tasks, significantly improving performance over existing models.
Contribution
It proposes a novel SEN framework that fuses multi-modal encoders recursively, enabling deeper multimodal interactions for improved video understanding performance.
Findings
Boosts pixel-level tracking accuracy by 2.7% Jaccard index
Reduces temporal coherence error by 8.8%
Improves one-shot video editing alignment by 6.4%
Abstract
Video understanding has been considered as one critical step towards world modeling, which is an important long-term problem in AI research. Recently, multimodal foundation models have shown such potential via large-scale pretraining. These models effectively align encoders of different modalities via contrastive learning. To further enhance performance on complex target movements and diversified video scenes, we propose to augment this alignment with deeper multimodal interactions, which are critical for understanding complex target movements with diversified video scenes. To fill this gap, we propose a unified Super Encoding Network (SEN) for video understanding, which builds up such distinct interactions through the recursive association of multimodal encoders in the foundation models. Specifically, we creatively treat those well-trained encoders as ``super neurons" in our SEN. Via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
MethodsALIGN
