TL;DR
TextTeacher leverages language models to inject semantic knowledge into vision models via text embeddings, improving accuracy and efficiency in image classification without altering inference-time models.
Contribution
Introduces TextTeacher, a simple auxiliary objective that uses text embeddings to enhance vision model training, outperforming knowledge distillation with minimal overhead.
Findings
TextTeacher improves ImageNet accuracy by up to +2.7 percentage points.
It yields consistent transfer gains averaging +1.0 percentage point.
Outperforms vision knowledge distillation in accuracy and speed.
Abstract
The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities. Motivated by this, we ask: Can the semantic knowledge of a language model efficiently improve a vision model? As an answer, we introduce TextTeacher, a simple auxiliary objective that injects text embeddings as additional information into image classification training. TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged. On ImageNet with standard ViT backbones, TextTeacher improves accuracy by up to +2.7 percentage points (p.p.) and yields consistent transfer gains (on average +1.0 p.p.) under the same recipe and compute. It outperforms vision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
