Disentanglement and Compositionality of Letter Identity and Letter   Position in Variational Auto-Encoder Vision Models

Bruno Bianchi; Aakash Agrawal; Stanislas Dehaene; Emmanuel Chemla,; Yair Lakretz

arXiv:2412.10446·cs.CV·December 17, 2024

Disentanglement and Compositionality of Letter Identity and Letter Position in Variational Auto-Encoder Vision Models

Bruno Bianchi, Aakash Agrawal, Stanislas Dehaene, Emmanuel Chemla,, Yair Lakretz

PDF

Open Access

TL;DR

This paper investigates whether deep neural models can disentangle letter identity and position in written words, revealing current models' limitations compared to human capabilities and introducing a new benchmark for evaluation.

Contribution

The study introduces CompOrth, a novel benchmark for assessing disentanglement and compositionality in visual models of orthography, and evaluates the performance of beta-VAE models on this benchmark.

Findings

01

Models effectively disentangle surface features like retinal location.

02

Models fail to disentangle letter position from letter identity.

03

Models lack understanding of word length and compositional structure.

Abstract

Human readers can accurately count how many letters are in a word (e.g., 7 in ``buffalo''), remove a letter from a given position (e.g., ``bufflo'') or add a new one. The human brain of readers must have therefore learned to disentangle information related to the position of a letter and its identity. Such disentanglement is necessary for the compositional, unbounded, ability of humans to create and parse new strings, with any combination of letters appearing in any positions. Do modern deep neural models also possess this crucial compositional ability? Here, we tested whether neural models that achieve state-of-the-art on disentanglement of features in visual input can also disentangle letter position and letter identity when trained on images of written words. Specifically, we trained beta variational autoencoder ( $β$ -VAE) to reconstruct images of letter strings and evaluated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Handwritten Text Recognition Techniques · Image Processing and 3D Reconstruction

MethodsSparse Evolutionary Training