OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models
Minseok Kang, Minhyeok Lee, Jungho Lee, Minjung Kim, Donghyeong Kim, Dayeon Lee, Heeseung Choi, Ig-jae Kim, Sangyoun Lee

TL;DR
OTT-Vid introduces a transport-based token compression framework for Video-LLMs that adaptively preserves semantic tokens and reduces inference costs while maintaining high performance.
Contribution
It proposes a novel optimal transport-based method for semantic-aware, adaptive temporal token compression in video models, outperforming existing methods.
Findings
Preserves 95.8% of VQA performance with only 10% tokens.
Retains 73.9% of VTG performance with 10% tokens.
Outperforms state-of-the-art training-free compression methods.
Abstract
As Video Large Language Models (Video-LLMs) scale to longer and more complex videos, their inference cost grows rapidly due to the large volume of visual tokens accumulated across frames. Training-free token compression has emerged as a practical solution to this bottleneck. However, existing temporal compression methods rely primarily on cross-frame token similarity or segmentation heuristics, overlooking each token's semantic role within its frame and failing to adapt compression strength to the compressibility of each frame pair. In this work, we propose OTT-Vid, a transport-derived allocation framework for temporal token compression. Our approach consists of two stages: spatial pruning identifies representative content within each frame, and optimal transport (OT) is then solved between neighboring frames to estimate temporal compressibility. We formulate this OT with non-uniform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
