MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention

Yuqi Pang; Bowen Yang; Yun Cao; Rong Fan; Xiaoyu Li; Chen He

arXiv:2507.22805·cs.CV·November 18, 2025

MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention

Yuqi Pang, Bowen Yang, Yun Cao, Rong Fan, Xiaoyu Li, Chen He

PDF

Open Access

TL;DR

MoCHA is a novel vision-language framework that combines multiple visual backbones with dynamic expert selection and hierarchical attention to enhance visual understanding while reducing costs.

Contribution

It introduces a multi-backbone visual extraction method with MoECs and HGA modules, improving performance and robustness in vision-language tasks.

Findings

01

Outperforms state-of-the-art models on various benchmarks.

02

Reduces hallucination and improves visual instruction following.

03

Demonstrates robustness through ablation studies.

Abstract

Vision large language models (VLLMs) are focusing primarily on handling complex and fine-grained visual information by incorporating advanced vision encoders and scaling up visual models. However, these approaches face high training and inference costs, as well as challenges in extracting visual details, effectively bridging across modalities. In this work, we propose a novel visual framework, MoCHA, to address these issues. Our framework integrates four vision backbones (i.e., CLIP, SigLIP, DINOv2 and ConvNeXt) to extract complementary visual features and is equipped with a sparse Mixture of Experts Connectors (MoECs) module to dynamically select experts tailored to different visual dimensions. To mitigate redundant or insufficient use of the visual information encoded by the MoECs module, we further design a Hierarchical Group Attention (HGA) with intra- and inter-group operations and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)