Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs

Hongcheng Liu; Yuhao Wang; Zhe Chen; Pingjie Wang; Zhiyuan Zhu; Yixuan Hou; Yanfeng Wang; Yu Wang

arXiv:2604.05522·cs.CL·April 8, 2026

Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs

Hongcheng Liu, Yuhao Wang, Zhe Chen, Pingjie Wang, Zhiyuan Zhu, Yixuan Hou, Yanfeng Wang, Yu Wang

PDF

TL;DR

This paper introduces CrossOmni, a dataset and methods to improve cross-modal coreference understanding in Omni-LLMs, addressing a key gap in multi-modal reasoning capabilities.

Contribution

It formalizes cross-modal coreference as a new challenge, provides a dedicated dataset, and proposes both training-free and training-based methods to enhance this ability in Omni-LLMs.

Findings

01

Experiments reveal systematic weaknesses in cross-modal coreference in existing Omni-LLMs.

02

Both proposed methods significantly improve cross-modal coreference performance.

03

Enhanced models generalize better to collaborative reasoning tasks.

Abstract

Omni Large Language Models (Omni-LLMs) have demonstrated impressive capabilities in holistic multi-modal perception, yet they consistently falter in complex scenarios requiring synergistic omni-modal reasoning. Beyond understanding global multimodal context, effective reasoning also hinges on fine-grained cross-modal alignment, especially identifying shared referents across modalities, yet this aspect has been largely overlooked. To bridge this gap, we formalize the challenge as a cross-modal coreference problem, where a model must localize a referent in a source modality and re-identify it in a target modality. Building on this paradigm, we introduce CrossOmni, a dataset comprising nine tasks equipped with human-designed reasoning rationales to evaluate and enhance this capability. Experiments on 13 Omni-LLMs reveal systematic weaknesses in cross-modal coreference, which we attribute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.