OmniEncoder: See, Hear, and Feel Continuous Motion Like Humans With One Encoder
Detao Bai, Shimin Yao, Weixuan Chen, Chengen Lai, Yuanming Li, Zhiheng Ma, Xihan Wei

TL;DR
OmniEncoder is a unified Transformer model that processes visual and audio signals at 25 fps, enabling more holistic and fine-grained understanding of continuous motion similar to human perception.
Contribution
The paper introduces Omni-Encoder, a novel architecture that co-embeds visual and audio signals at 25 fps, improving cross-modal interaction and motion understanding.
Findings
Outperforms modality-specific baselines on sign language recognition.
Achieves better results on fine-grained sports action analysis.
Maintains competitive performance on AVQA and speaker localization.
Abstract
Recent advances in omni-modal large language models have enabled remarkable progress in joint vision-audio understanding. However, prevailing architectures rely on modality-specific encoders with a \emph{video-coarse, audio-dense} design -- sampling visual frames at 1--2 fps while processing audio waveforms at 25 fps -- resulting in systems that perceive video \emph{frame by frame, modality by modality} rather than holistically as humans do. Such a discrepancy leaves models with impoverished cross-modal interaction during encoding and an inability to capture fine-grained visual motion. To bridge this gap, we present \textbf{Omni-Encoder, a unified Transformer backbone designed to co-embed visual and audio signals at a symmetrical 25 fps} within a shared latent space. This architecture leverages three core innovations -- the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
