TL;DR
This paper investigates the mechanisms behind video understanding in large multimodal models, introduces key insights to improve training efficiency, and presents Apollo, a new family of models achieving state-of-the-art performance in long video perception.
Contribution
The paper uncovers transferability principles in video-LMMs, analyzes critical design choices, and introduces Apollo models that outperform existing benchmarks in video understanding tasks.
Findings
Scaling consistency allows transfer of design decisions from small to large models.
fps sampling during training significantly improves video representation.
Apollo models achieve state-of-the-art results on multiple long-video benchmarks.
Abstract
Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs. We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗manysuch-cases/Apollo-Github-Filesmodel· ♡ 10♡ 10
- 🤗GoodiesHere/Apollo-LMMs-Apollo-1_5B-t32model· 11 dl· ♡ 1011 dl♡ 10
- 🤗GoodiesHere/Apollo-LMMs-Apollo-3B-t32model· 11 dl· ♡ 2111 dl♡ 21
- 🤗GoodiesHere/Apollo-LMMs-Apollo-7B-t32model· 24 dl· ♡ 5724 dl♡ 57
- 🤗Sri-Vigneshwar-DJ/Apollo-LMMs-Apollo-1.5B-t32model· 6 dl· ♡ 16 dl♡ 1
- 🤗Sri-Vigneshwar-DJ/Apollo-LMMs-Apollo-3B-t32model· 9 dl9 dl
- 🤗Sri-Vigneshwar-DJ/Apollo-LMMs-Apollo-7B-t32model· 6 dl· ♡ 16 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAdaptive Parameter-wise Diagonal Quasi-Newton Method
