Describe Anything Model for Visual Question Answering on Text-rich Images

Yen-Linh Vu; Dinh-Thang Duong; Truong-Binh Duong; Anh-Khoi Nguyen; Thanh-Huy Nguyen; Le Thien Phuc Nguyen; Jianhua Xing; Xingjian Li; Tianyang Wang; Ulas Bagci; Min Xu

arXiv:2507.12441·cs.CV·August 5, 2025

Describe Anything Model for Visual Question Answering on Text-rich Images

Yen-Linh Vu, Dinh-Thang Duong, Truong-Binh Duong, Anh-Khoi Nguyen, Thanh-Huy Nguyen, Le Thien Phuc Nguyen, Jianhua Xing, Xingjian Li, Tianyang Wang, Ulas Bagci, Min Xu

PDF

Open Access 1 Repo 1 Datasets

TL;DR

The paper introduces DAM-QA, a novel framework that leverages the region-aware Describe Anything Model for improved visual question answering on text-rich images, especially those with dense textual information.

Contribution

It develops DAM-QA, a new approach that enhances VQA performance on text-rich images by utilizing DAM's region-aware capabilities with a specialized aggregation mechanism.

Findings

01

Outperforms baseline DAM with over 7 points gain on DocVQA

02

Achieves best performance among region-aware models with fewer parameters

03

Narrowing the gap with strong generalist VLMs

Abstract

Recent progress has been made in region-aware vision-language modeling, particularly with the emergence of the Describe Anything Model (DAM). DAM is capable of generating detailed descriptions of any specific image areas or objects without the need for additional localized image-text alignment supervision. We hypothesize that such region-level descriptive capability is beneficial for the task of Visual Question Answering (VQA), especially in challenging scenarios involving images with dense text. In such settings, the fine-grained extraction of textual information is crucial to producing correct answers. Motivated by this, we introduce DAM-QA, a framework with a tailored evaluation protocol, developed to investigate and harness the region-aware capabilities from DAM for the text-rich VQA problem that requires reasoning over text-based information within images. DAM-QA incorporates a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Linvyl/DAM-QA
pytorchOfficial

Datasets

VLAI-AIVN/DAM-QA-annotations
dataset· 10 dl
10 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling