Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA

Hai Huang; Yann LeCun; Randall Balestriero

arXiv:2602.22617·cs.LG·February 27, 2026

Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA

Hai Huang, Yann LeCun, Randall Balestriero

PDF

Open Access

TL;DR

This paper introduces Semantic Tube Prediction (STP), a geometric regularizer that improves language model data efficiency by constraining hidden states to geodesic paths, enabling models to achieve comparable accuracy with significantly less training data.

Contribution

The paper proposes a novel geometric regularizer, STP, based on the Geodesic Hypothesis, which enhances LLM data efficiency and challenges existing scaling laws.

Findings

01

STP improves signal-to-noise ratio in LLM training.

02

Models with STP match baseline accuracy using 16x less data.

03

STP violates the data efficiency bounds of existing scaling laws.

Abstract

Large Language Models (LLMs) obey consistent scaling laws -- empirical power-law fits that predict how loss decreases with compute, data, and parameters. While predictive, these laws are descriptive rather than prescriptive: they characterize typical training, not optimal training. Surprisingly few works have successfully challenged the data-efficiency bounds implied by these laws -- which is our primary focus. To that end, we introduce the Geodesic Hypothesis, positing that token sequences trace geodesics on a smooth semantic manifold and are therefore locally linear. Building on this principle, we propose a novel Semantic Tube Prediction (STP) task, a JEPA-style regularizer that confines hidden-state trajectories to a tubular neighborhood of the geodesic. STP generalizes JEPA to language without requiring explicit multi-view augmentations. We show this constraint improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Topic Modeling · Multimodal Machine Learning Applications