HyperCLOVA X 8B Omni
NAVER Cloud HyperCLOVA X Team

TL;DR
HyperCLOVA X 8B Omni is a pioneering omni-modal model supporting text, audio, and vision inputs and outputs, unifying multimodal understanding and generation in a single scalable framework for versatile AI applications.
Contribution
It introduces the first any-to-any omni-modal model in the HyperCLOVA X family, integrating multiple modalities into one unified model with competitive performance.
Findings
Achieves competitive results across text, audio, and vision tasks.
Supports both Korean and English inputs and outputs.
Demonstrates effective multimodal understanding and generation.
Abstract
In this report, we present HyperCLOVA X 8B Omni, the first any-to-any omnimodal model in the HyperCLOVA X family that supports text, audio, and vision as both inputs and outputs. By consolidating multimodal understanding and generation into a single model rather than separate modality-specific pipelines, HyperCLOVA X 8B Omni serves as an 8B-scale omni-pathfinding point toward practical any-to-any omni assistants. At a high level, the model unifies modalities through a shared next-token prediction interface over an interleaved multimodal sequence, while vision and audio encoders inject continuous embeddings for fine-grained understanding and grounding. Empirical evaluations demonstrate competitive performance against comparably sized models across diverse input-output combinations spanning text, audio, and vision, in both Korean and English. We anticipate that the open-weight release of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · ICT in Developing Communities
