PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration
Chen Duan, Zhentao Guo, Pei Fu, Zining Wang, Kai Zhou, Pengfei Yan

TL;DR
PositionOCR is a hybrid model that combines text spotting specialists with large language models to enhance positional reasoning in multi-modal tasks like text grounding and spotting, achieving superior performance efficiently.
Contribution
This paper introduces PositionOCR, a parameter-efficient hybrid architecture that integrates specialist positional modules with LLMs for improved multi-modal visual-text reasoning.
Findings
Outperforms traditional MLLMs in text grounding and spotting tasks
Uses only 131M trainable parameters for efficient training
Demonstrates strong multi-modal processing capabilities
Abstract
In recent years, Multi-modal Large Language Models (MLLMs) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across varied contexts. However, these MLLMs rely on a Large Language Model (LLM) as the decoder, which is primarily designed for linguistic processing, and thus inherently lacks the positional reasoning required for precise visual tasks, such as text spotting and text grounding. Additionally, the extensive parameters of MLLMs necessitate substantial computational resources and large-scale data for effective training. Conversely, text spotting specialists achieve state-of-the-art coordinate predictions but lack semantic reasoning capabilities. This dichotomy motivates our key research question: Can we synergize the efficiency of specialists with the contextual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques
