# Learning to Associate Words and Images Using a Large-scale Graph

**Authors:** Heqing Ya, Haonan Sun, Jeffrey Helt, Tai Sing Lee

arXiv: 1705.07768 · 2017-05-23

## TL;DR

This paper introduces an unsupervised method leveraging a large-scale graph to associate words and images, successfully solving Chinese railroad captcha without labeled data or pre-trained models, achieving 77% accuracy in seconds.

## Contribution

The authors develop a novel unsupervised association learning approach using a large graph, enabling recognition of Chinese phrases and images without supervision or pre-trained networks.

## Key findings

- Achieved 77% captcha solving accuracy within 2 seconds
- Constructed a 6 million vertex association graph from 2.6 million captchas
- Demonstrated the effectiveness of unsupervised association learning for practical tasks

## Abstract

We develop an approach for unsupervised learning of associations between co-occurring perceptual events using a large graph. We applied this approach to successfully solve the image captcha of China's railroad system. The approach is based on the principle of suspicious coincidence. In this particular problem, a user is presented with a deformed picture of a Chinese phrase and eight low-resolution images. They must quickly select the relevant images in order to purchase their train tickets. This problem presents several challenges: (1) the teaching labels for both the Chinese phrases and the images were not available for supervised learning, (2) no pre-trained deep convolutional neural networks are available for recognizing these Chinese phrases or the presented images, and (3) each captcha must be solved within a few seconds. We collected 2.6 million captchas, with 2.6 million deformed Chinese phrases and over 21 million images. From these data, we constructed an association graph, composed of over 6 million vertices, and linked these vertices based on co-occurrence information and feature similarity between pairs of images. We then trained a deep convolutional neural network to learn a projection of the Chinese phrases onto a 230-dimensional latent space. Using label propagation, we computed the likelihood of each of the eight images conditioned on the latent space projection of the deformed phrase for each captcha. The resulting system solved captchas with 77% accuracy in 2 seconds on average. Our work, in answering this practical challenge, illustrates the power of this class of unsupervised association learning techniques, which may be related to the brain's general strategy for associating language stimuli with visual objects on the principle of suspicious coincidence.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1705.07768/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/1705.07768/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/1705.07768/full.md

---
Source: https://tomesphere.com/paper/1705.07768