Does learning the right latent variables necessarily improve in-context learning?
Sarthak Mittal, Eric Elmoznino, Leo Gagnon, Sangnie Bhardwaj, Tom Marty, Dhanya Sridhar, Guillaume Lajoie

TL;DR
This paper investigates whether explicitly inferring task-relevant latent variables in Transformers improves in-context learning, finding that such structured approaches do not significantly enhance out-of-distribution performance or robustness.
Contribution
The study introduces a minimal bottleneck modification to Transformers to explicitly extract task latents and compares it with standard models, revealing limited benefits for generalization.
Findings
Bottleneck effectively extracts task latents from context.
No significant out-of-distribution performance gain from latent inference.
Transformers struggle to utilize inferred latents for robust prediction.
Abstract
Large autoregressive models like Transformers can solve tasks through in-context learning (ICL) without learning new weights, suggesting avenues for efficiently solving new tasks. For many tasks, e.g., linear regression, the data factorizes: examples are independent given a task latent that generates the data, e.g., linear coefficients. While an optimal predictor leverages this factorization by inferring task latents, it is unclear if Transformers implicitly do so or if they instead exploit heuristics and statistical shortcuts enabled by attention layers. Both scenarios have inspired active ongoing work. In this paper, we systematically investigate the effect of explicitly inferring task latents. We minimally modify the Transformer architecture with a bottleneck designed to prevent shortcuts in favor of more structured solutions, and then compare performance against standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Statistics Education and Methodologies
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
