TextTeacher: What Can Language Teach About Images?

Tobias Christian Nauen; Stanislav Frolov; Brian Bernhard Moser; Federico Raue; Ahmed Anwar; Andreas Dengel

arXiv:2605.22098·cs.CV·May 22, 2026

TextTeacher: What Can Language Teach About Images?

Tobias Christian Nauen, Stanislav Frolov, Brian Bernhard Moser, Federico Raue, Ahmed Anwar, Andreas Dengel

PDF

1 Repo

TL;DR

TextTeacher leverages language models to inject semantic knowledge into vision models via text embeddings, improving accuracy and efficiency in image classification without altering inference-time models.

Contribution

Introduces TextTeacher, a simple auxiliary objective that uses text embeddings to enhance vision model training, outperforming knowledge distillation with minimal overhead.

Findings

01

TextTeacher improves ImageNet accuracy by up to +2.7 percentage points.

02

It yields consistent transfer gains averaging +1.0 percentage point.

03

Outperforms vision knowledge distillation in accuracy and speed.

Abstract

The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities. Motivated by this, we ask: Can the semantic knowledge of a language model efficiently improve a vision model? As an answer, we introduce TextTeacher, a simple auxiliary objective that injects text embeddings as additional information into image classification training. TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged. On ImageNet with standard ViT backbones, TextTeacher improves accuracy by up to +2.7 percentage points (p.p.) and yields consistent transfer gains (on average +1.0 p.p.) under the same recipe and compute. It outperforms vision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://nauen-it.de/publications/text-teacher
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.