Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games

Hanling Yi; Feng Lin; Mao Luo; Yifan Yang; Xiaotian Yu; Rong Xiao

arXiv:2604.16785·cs.CV·April 21, 2026

Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games

Hanling Yi, Feng Lin, Mao Luo, Yifan Yang, Xiaotian Yu, Rong Xiao

PDF

TL;DR

HyMOR is a hybrid framework combining large language models and CLIP to improve multi-granularity object recognition for educational games, enabling accurate perception across semantic levels.

Contribution

The paper introduces HyMOR, a novel hybrid approach integrating MLLMs and CLIP models for open-ended, multi-granularity object recognition in educational and interactive scenarios.

Findings

01

HyMOR narrows the fine-grained recognition gap to 0.2%.

02

HyMOR improves general object recognition by 2.5%.

03

HyMOR achieves a 23.2% overall improvement in semantic similarity.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled open-ended object recognition, yet they struggle with fine-grained tasks. In contrast, CLIP-style models excel at fine-grained recognition but lack broad coverage of general object categories. To bridge this gap, we propose \textbf{HyMOR}, a \textbf{Hy}brid \textbf{M}ulti-granularity open-ended \textbf{O}bject \textbf{R}ecognition framework that integrates an MLLM with a CLIP model. In HyMOR, the MLLM performs open-ended and coarse-grained object recognition, while the CLIP model specializes in fine-grained identification of domain-specific objects such as animals and plants. This hybrid design enables accurate object understanding across multiple semantic granularities, serving as a robust perceptual foundation for downstream multi-modal content generation and interactive gameplay. To support evaluation in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.