ROSA: Addressing text understanding challenges in photographs via ROtated SAmpling
Hern\'an Maina, Guido Ivetta, Mateo Lione Stuto, Julian Martin Eisenschlos, Jorge S\'anchez, Luciana Benotti

TL;DR
ROSA is a decoding strategy designed to improve text recognition in images with misaligned text, significantly enhancing VQA system performance for visually impaired users.
Contribution
The paper introduces ROSA, a novel decoding method that addresses orientation challenges in text recognition within VQA systems, filling a gap in existing benchmarks.
Findings
ROSA outperforms Greedy decoding by 11.7 points in accuracy.
It improves VQA performance on images with rotated or misaligned text.
The approach benefits visually impaired users by better interpreting their environment.
Abstract
Visually impaired people could benefit from Visual Question Answering (VQA) systems to interpret text in their surroundings. However, current models often struggle with recognizing text in the photos taken by this population. Through in-depth interviews with visually impaired individuals, we identified common framing conventions that frequently result in misaligned text. Existing VQA benchmarks primarily feature well-oriented text captured by sighted users, under-representing these challenges. To address this gap, we introduce ROtated SAmpling (ROSA), a decoding strategy that enhances VQA performance in text-rich images with incorrectly oriented text. ROSA outperforms Greedy decoding by 11.7 absolute points in the best-performing model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Text Readability and Simplification
