Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs
Hongcheng Liu, Yuhao Wang, Zhe Chen, Pingjie Wang, Zhiyuan Zhu, Yixuan Hou, Yanfeng Wang, Yu Wang

TL;DR
This paper introduces CrossOmni, a dataset and methods to improve cross-modal coreference understanding in Omni-LLMs, addressing a key gap in multi-modal reasoning capabilities.
Contribution
It formalizes cross-modal coreference as a new challenge, provides a dedicated dataset, and proposes both training-free and training-based methods to enhance this ability in Omni-LLMs.
Findings
Experiments reveal systematic weaknesses in cross-modal coreference in existing Omni-LLMs.
Both proposed methods significantly improve cross-modal coreference performance.
Enhanced models generalize better to collaborative reasoning tasks.
Abstract
Omni Large Language Models (Omni-LLMs) have demonstrated impressive capabilities in holistic multi-modal perception, yet they consistently falter in complex scenarios requiring synergistic omni-modal reasoning. Beyond understanding global multimodal context, effective reasoning also hinges on fine-grained cross-modal alignment, especially identifying shared referents across modalities, yet this aspect has been largely overlooked. To bridge this gap, we formalize the challenge as a cross-modal coreference problem, where a model must localize a referent in a source modality and re-identify it in a target modality. Building on this paradigm, we introduce CrossOmni, a dataset comprising nine tasks equipped with human-designed reasoning rationales to evaluate and enhance this capability. Experiments on 13 Omni-LLMs reveal systematic weaknesses in cross-modal coreference, which we attribute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
