HAUR: Human Annotation Understanding and Recognition Through Text-Heavy   Images

Yuchen Yang; Haoran Yan; Yanhao Chen; Qingqiang Wu; Qingqi Hong

arXiv:2412.18327·cs.CV·December 25, 2024

HAUR: Human Annotation Understanding and Recognition Through Text-Heavy Images

Yuchen Yang, Haoran Yan, Yanhao Chen, Qingqiang Wu, Qingqi Hong

PDF

Open Access

TL;DR

This paper introduces HAUR, a new task and dataset focused on understanding human annotations in text-heavy images for vision question answering, along with a model that outperforms existing approaches.

Contribution

The paper presents the HAUR task, the HAUR-5 dataset, and the OCR-Mix model, addressing limitations of current models in understanding human annotations on text-heavy images.

Findings

01

OCR-Mix outperforms other models in HAUR tasks.

02

The HAUR-5 dataset covers five common annotation types.

03

The dataset and model will be publicly released.

Abstract

Vision Question Answering (VQA) tasks use images to convey critical information to answer text-based questions, which is one of the most common forms of question answering in real-world scenarios. Numerous vision-text models exist today and have performed well on certain VQA tasks. However, these models exhibit significant limitations in understanding human annotations on text-heavy images. To address this, we propose the Human Annotation Understanding and Recognition (HAUR) task. As part of this effort, we introduce the Human Annotation Understanding and Recognition-5 (HAUR-5) dataset, which encompasses five common types of human annotations. Additionally, we developed and trained our model, OCR-Mix. Through comprehensive cross-model comparisons, our results demonstrate that OCR-Mix outperforms other models in this task. Our dataset and model will be released soon .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Handwritten Text Recognition Techniques