Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA
Hai Huang, Yann LeCun, Randall Balestriero

TL;DR
This paper introduces Semantic Tube Prediction (STP), a geometric regularizer that improves language model data efficiency by constraining hidden states to geodesic paths, enabling models to achieve comparable accuracy with significantly less training data.
Contribution
The paper proposes a novel geometric regularizer, STP, based on the Geodesic Hypothesis, which enhances LLM data efficiency and challenges existing scaling laws.
Findings
STP improves signal-to-noise ratio in LLM training.
Models with STP match baseline accuracy using 16x less data.
STP violates the data efficiency bounds of existing scaling laws.
Abstract
Large Language Models (LLMs) obey consistent scaling laws -- empirical power-law fits that predict how loss decreases with compute, data, and parameters. While predictive, these laws are descriptive rather than prescriptive: they characterize typical training, not optimal training. Surprisingly few works have successfully challenged the data-efficiency bounds implied by these laws -- which is our primary focus. To that end, we introduce the Geodesic Hypothesis, positing that token sequences trace geodesics on a smooth semantic manifold and are therefore locally linear. Building on this principle, we propose a novel Semantic Tube Prediction (STP) task, a JEPA-style regularizer that confines hidden-state trajectories to a tubular neighborhood of the geodesic. STP generalizes JEPA to language without requiring explicit multi-view augmentations. We show this constraint improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Topic Modeling · Multimodal Machine Learning Applications
