PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration

Chen Duan; Zhentao Guo; Pei Fu; Zining Wang; Kai Zhou; Pengfei Yan

arXiv:2602.19188·cs.CV·February 24, 2026

PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration

Chen Duan, Zhentao Guo, Pei Fu, Zining Wang, Kai Zhou, Pengfei Yan

PDF

Open Access

TL;DR

PositionOCR is a hybrid model that combines text spotting specialists with large language models to enhance positional reasoning in multi-modal tasks like text grounding and spotting, achieving superior performance efficiently.

Contribution

This paper introduces PositionOCR, a parameter-efficient hybrid architecture that integrates specialist positional modules with LLMs for improved multi-modal visual-text reasoning.

Findings

01

Outperforms traditional MLLMs in text grounding and spotting tasks

02

Uses only 131M trainable parameters for efficient training

03

Demonstrates strong multi-modal processing capabilities

Abstract

In recent years, Multi-modal Large Language Models (MLLMs) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across varied contexts. However, these MLLMs rely on a Large Language Model (LLM) as the decoder, which is primarily designed for linguistic processing, and thus inherently lacks the positional reasoning required for precise visual tasks, such as text spotting and text grounding. Additionally, the extensive parameters of MLLMs necessitate substantial computational resources and large-scale data for effective training. Conversely, text spotting specialists achieve state-of-the-art coordinate predictions but lack semantic reasoning capabilities. This dichotomy motivates our key research question: Can we synergize the efficiency of specialists with the contextual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques