TextMatcher: Cross-Attentional Neural Network to Compare Image and Text
Valentina Arrigoni, Luisa Repele, Dario Marino Saccavino

TL;DR
TextMatcher is a novel neural network model that uses cross-attention to accurately compare images containing text with candidate transcriptions, improving performance and speed in multimodal text matching tasks.
Contribution
The paper introduces the first machine-learning model specifically designed for image-text text matching, utilizing cross-attention mechanisms for enhanced comparison accuracy.
Findings
Outperforms existing models on the IAM dataset.
Achieves higher accuracy across various configurations.
Runs faster during inference.
Abstract
We study a novel multimodal-learning problem, which we call text matching: given an image containing a single-line text and a candidate text transcription, the goal is to assess whether the text represented in the image corresponds to the candidate text. We devise the first machine-learning model specifically designed for this problem. The proposed model, termed TextMatcher, compares the two inputs by applying a cross-attention mechanism over the embedding representations of image and text, and it is trained in an end-to-end fashion. We extensively evaluate the empirical performance of TextMatcher on the popular IAM dataset. Results attest that, compared to a baseline and existing models designed for related problems, TextMatcher achieves higher performance on a variety of configurations, while at the same time running faster at inference time. We also showcase TextMatcher in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Topic Modeling · Natural Language Processing Techniques
