OmniEncoder: See, Hear, and Feel Continuous Motion Like Humans With One Encoder

Detao Bai; Shimin Yao; Weixuan Chen; Chengen Lai; Yuanming Li; Zhiheng Ma; Xihan Wei

arXiv:2605.01506·cs.CV·May 5, 2026

OmniEncoder: See, Hear, and Feel Continuous Motion Like Humans With One Encoder

Detao Bai, Shimin Yao, Weixuan Chen, Chengen Lai, Yuanming Li, Zhiheng Ma, Xihan Wei

PDF

TL;DR

OmniEncoder is a unified Transformer model that processes visual and audio signals at 25 fps, enabling more holistic and fine-grained understanding of continuous motion similar to human perception.

Contribution

The paper introduces Omni-Encoder, a novel architecture that co-embeds visual and audio signals at 25 fps, improving cross-modal interaction and motion understanding.

Findings

01

Outperforms modality-specific baselines on sign language recognition.

02

Achieves better results on fine-grained sports action analysis.

03

Maintains competitive performance on AVQA and speaker localization.

Abstract

Recent advances in omni-modal large language models have enabled remarkable progress in joint vision-audio understanding. However, prevailing architectures rely on modality-specific encoders with a \emph{video-coarse, audio-dense} design -- sampling visual frames at 1--2 fps while processing audio waveforms at 25 fps -- resulting in systems that perceive video \emph{frame by frame, modality by modality} rather than holistically as humans do. Such a discrepancy leaves models with impoverished cross-modal interaction during encoding and an inability to capture fine-grained visual motion. To bridge this gap, we present \textbf{Omni-Encoder, a unified Transformer backbone designed to co-embed visual and audio signals at a symmetrical 25 fps} within a shared latent space. This architecture leverages three core innovations -- the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.