Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds

Yunbei Xu; Yuzhe Yuan; Ruohan Zhan

arXiv:2605.12316·cs.LG·May 13, 2026

Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds

Yunbei Xu, Yuzhe Yuan, Ruohan Zhan

PDF

TL;DR

This paper provides a comprehensive theoretical analysis of autoregressive sequence learning under model misspecification using joint KL divergence, revealing bounds on approximation and estimation errors that depend on sequence length.

Contribution

It establishes the first complete characterization of long-horizon error behavior under joint KL, including matching upper and lower bounds and horizon-free approximation factors.

Findings

01

Joint KL admits a horizon-free approximation factor, unlike Hellinger-based metrics.

02

Fundamental lower bound of order H for estimation error, matching upper bounds.

03

Joint KL guarantees imply policy learning regret bounds similar to existing imitation learning results.

Abstract

We study the fundamental and timely problem of learning long sequences in autoregressive modeling and next-token prediction under model misspecification, measured by the joint Kullback--Leibler (KL) divergence. Our goal is to characterize how the sequence horizon \(H\) affects both approximation and estimation errors in this joint-distribution, sequence-level regime. By establishing matching upper and lower bounds, we provide, to our knowledge, the first complete characterization of long-horizon error behavior under the natural joint KL objective, with improved rates and optimality justification relative to existing work. On the approximation side, we show that joint KL admits a horizon-free approximation factor, in sharp contrast to Hellinger-based analyses that exhibit an \(\Omega(H)\) dependence for computationally efficient methods; this isolates the choice of divergence as the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.