HAUR: Human Annotation Understanding and Recognition Through Text-Heavy Images
Yuchen Yang, Haoran Yan, Yanhao Chen, Qingqiang Wu, Qingqi Hong

TL;DR
This paper introduces HAUR, a new task and dataset focused on understanding human annotations in text-heavy images for vision question answering, along with a model that outperforms existing approaches.
Contribution
The paper presents the HAUR task, the HAUR-5 dataset, and the OCR-Mix model, addressing limitations of current models in understanding human annotations on text-heavy images.
Findings
OCR-Mix outperforms other models in HAUR tasks.
The HAUR-5 dataset covers five common annotation types.
The dataset and model will be publicly released.
Abstract
Vision Question Answering (VQA) tasks use images to convey critical information to answer text-based questions, which is one of the most common forms of question answering in real-world scenarios. Numerous vision-text models exist today and have performed well on certain VQA tasks. However, these models exhibit significant limitations in understanding human annotations on text-heavy images. To address this, we propose the Human Annotation Understanding and Recognition (HAUR) task. As part of this effort, we introduce the Human Annotation Understanding and Recognition-5 (HAUR-5) dataset, which encompasses five common types of human annotations. Additionally, we developed and trained our model, OCR-Mix. Through comprehensive cross-model comparisons, our results demonstrate that OCR-Mix outperforms other models in this task. Our dataset and model will be released soon .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Handwritten Text Recognition Techniques
