Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games
Hanling Yi, Feng Lin, Mao Luo, Yifan Yang, Xiaotian Yu, Rong Xiao

TL;DR
HyMOR is a hybrid framework combining large language models and CLIP to improve multi-granularity object recognition for educational games, enabling accurate perception across semantic levels.
Contribution
The paper introduces HyMOR, a novel hybrid approach integrating MLLMs and CLIP models for open-ended, multi-granularity object recognition in educational and interactive scenarios.
Findings
HyMOR narrows the fine-grained recognition gap to 0.2%.
HyMOR improves general object recognition by 2.5%.
HyMOR achieves a 23.2% overall improvement in semantic similarity.
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have enabled open-ended object recognition, yet they struggle with fine-grained tasks. In contrast, CLIP-style models excel at fine-grained recognition but lack broad coverage of general object categories. To bridge this gap, we propose \textbf{HyMOR}, a \textbf{Hy}brid \textbf{M}ulti-granularity open-ended \textbf{O}bject \textbf{R}ecognition framework that integrates an MLLM with a CLIP model. In HyMOR, the MLLM performs open-ended and coarse-grained object recognition, while the CLIP model specializes in fine-grained identification of domain-specific objects such as animals and plants. This hybrid design enables accurate object understanding across multiple semantic granularities, serving as a robust perceptual foundation for downstream multi-modal content generation and interactive gameplay. To support evaluation in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
