FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding
Janghoon Cho, Jungsoo Lee, Munawar Hayat, Kyuwoong Hwang, Fatih Porikli, Sungha Choi

TL;DR
FLoC is a training-free, model-agnostic visual token compression method for long video understanding that efficiently selects representative tokens, significantly reducing computational load while maintaining high performance across various benchmarks.
Contribution
The paper introduces FLoC, a novel facility location-based token compression framework that is efficient, versatile, and guarantees near-optimal selection without additional training.
Findings
Outperforms recent compression techniques on multiple benchmarks.
Drastically reduces visual tokens while maintaining performance.
Efficiently integrates with diverse video-LMMs and workflows.
Abstract
Recent studies in long video understanding have harnessed the advanced visual-language reasoning capabilities of Large Multimodal Models (LMMs), driving the evolution of video-LMMs specialized for processing extended video sequences. However, the scalability of these models is severely limited by the overwhelming volume of visual tokens generated from extended video sequences. To address this challenge, we propose FLoC, an efficient visual token compression framework based on the facility location function, a principled approach that swiftly selects a compact yet highly representative and diverse subset of visual tokens within a predefined budget on the number of visual tokens. By integrating the lazy greedy algorithm, our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens, drastically reducing the number of visual tokens while guaranteeing…
Peer Reviews
Decision·ICLR 2026 Poster
- The motivation is solid and the paper is easy to follow. - The proposed algorithm achieves the good performance on various benchmarks such as Video-MME, MLVU, and LVBench.
- The citation is mis-formatted across the entire paper. For example, in L124-125, \citep should be used in the LaTeX. - In addition to Algorithm 1, a more detailed explanation of the optimal subset search should also be provided in the main text. - I think the novelty of the proposed method is somewhat limited. Sampling a token subset is not a new idea, and simply applying the well-known lazy greedy algorithm for this sampling seems to offer only a modest contribution. If there are additiona
1. Submodular Optimization via Facility Location: FLoC is the first visual token compression algorithm based on the facility location function and submodular optimization for long video understanding. This interprets token selection as maximizing a utility (or coverage) function that rewards tokens for preserving the essential information and diversity of the entire visual token set within a strict budget constraint (K). 2. Targeting Rare Information Loss: The facility location framework is spec
1. The primary operational weakness of FLoC is the reliance on the empirical determination of a single hyperparameter: the block length ($T$). The paper explicitly states that the choice of T involves a critical trade-off that impacts both performance and computational efficiency. The optimal setting for the block length is acknowledged to be content-dependent; for instance, a static video (e.g., a lecture) benefits from a longer block length, while a highly dynamic video requires a shorter one.
- The compression process does not rely on specific model architectures, nor is it tailored to particular queries or tasks; a single compression can support multiple downstream applications. - A visualization video of the lazy-greedy algorithm is provided in the supplementary material, making it more persuasive. - The experiments are sufficient, with the method’s performance verified on multiple models and benchmarks.
- The method performs diversity selection within individual blocks, with no information interaction between blocks. Therefore, the choice of hyper-parameter T is crucial: a T that is too large may cause redundant computation, while a T that is too small may lead to redundant similar tokens across blocks. In Tab. 4, the optimal T differs under different compression rates, making it difficult to select an appropriate T across settings with different models, compression rates, and benchmarks. - FLo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques
