ROSA: Addressing text understanding challenges in photographs via ROtated SAmpling

Hern\'an Maina; Guido Ivetta; Mateo Lione Stuto; Julian Martin Eisenschlos; Jorge S\'anchez; Luciana Benotti

arXiv:2506.03665·cs.CL·June 5, 2025

ROSA: Addressing text understanding challenges in photographs via ROtated SAmpling

Hern\'an Maina, Guido Ivetta, Mateo Lione Stuto, Julian Martin Eisenschlos, Jorge S\'anchez, Luciana Benotti

PDF

Open Access

TL;DR

ROSA is a decoding strategy designed to improve text recognition in images with misaligned text, significantly enhancing VQA system performance for visually impaired users.

Contribution

The paper introduces ROSA, a novel decoding method that addresses orientation challenges in text recognition within VQA systems, filling a gap in existing benchmarks.

Findings

01

ROSA outperforms Greedy decoding by 11.7 points in accuracy.

02

It improves VQA performance on images with rotated or misaligned text.

03

The approach benefits visually impaired users by better interpreting their environment.

Abstract

Visually impaired people could benefit from Visual Question Answering (VQA) systems to interpret text in their surroundings. However, current models often struggle with recognizing text in the photos taken by this population. Through in-depth interviews with visually impaired individuals, we identified common framing conventions that frequently result in misaligned text. Existing VQA benchmarks primarily feature well-oriented text captured by sighted users, under-representing these challenges. To address this gap, we introduce ROtated SAmpling (ROSA), a decoding strategy that enhances VQA performance in text-rich images with incorrectly oriented text. ROSA outperforms Greedy decoding by 11.7 absolute points in the best-performing model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Text Readability and Simplification