# EMCompress: Video-LLMs with Endomorphic Multimodal Compression

**Authors:** Zheyu Fan, Jiateng Liu, Yuji Zhang, Zihan Wang, Yi R. Fung, Manling Li, Heng Ji

arXiv: 2508.21094 · 2026-04-28

## TL;DR

This paper introduces EMCompress, a novel multimodal compression method inspired by cognitive principles, to improve VideoQA by preserving essential information while reducing input complexity.

## Contribution

It formulates Endomorphic Multimodal Compression (EMC) as a structurally constrained transformation that maintains answer invariance and integrates seamlessly into existing VideoQA pipelines.

## Key findings

- EMC surpasses prior methods by 0.40 F-1 on a new benchmark.
- Integrating EMC improves training efficiency by 7.33%.
- Inference efficiency increases by 33.7% with EMC.

## Abstract

Video-LLMs face a fundamental tension in long-video reasoning: static, sparse frame sampling either dilutes evidence across task-irrelevant segments at significant cost or misses fine-grained temporal semantics altogether. We propose a novel, cognitively-inspired task -- Endomorphic Multimodal Compression (EMC) -- as a structurally-constrained sufficient-statistic problem for VideoQA, and formulate it as an endomorphic transformation F_EMC : (V, Q) -> (v, q) that compresses the multimodal input while preserving answer invariance across reasonable downstream models. The endomorphic form keeps the compressed output in the downstream pipeline's native task space -- a structural mirror of the filter-then-reason mechanism in the cognitive literature motivating EMC -- distinguishing it from latent-code compression (IB / VIB) and making the formulation extensible to other multimodal settings. Under the Markov chain A -> (V, Q) -> (v, q), EMC realizes the classical sufficiency condition I((v, q); A) = I((V, Q); A) in its VideoQA-natural form. As a modular front-end, EMC plugs into both Video Instruction Tuning and Video Question Answering pipelines. We release the first dedicated benchmark and propose ReSimplifyIt, an EMC baseline surpassing prior methods by 0.40 F-1 with competitive query rewriting. Integrating EMC yields relative gains of 7.33% in training and 33.7% in inference for video-language understanding.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21094/full.md

## Figures

17 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21094/full.md

## References

64 references — full list in the complete paper: https://tomesphere.com/paper/2508.21094/full.md

---
Source: https://tomesphere.com/paper/2508.21094