Agnostic Language Identification and Generation

Mikael M{\o}ller H{\o}gsgaard; Chirag Pabbaraju

arXiv:2601.23258·cs.LG·April 23, 2026

Agnostic Language Identification and Generation

Mikael M{\o}ller H{\o}gsgaard, Chirag Pabbaraju

PDF

TL;DR

This paper introduces a new framework for language identification and generation that removes previous assumptions about data distribution, providing novel characterizations and nearly optimal rates in this more general setting.

Contribution

It relaxes the realizability assumption in language tasks, offering a more general approach with new theoretical insights and tight performance bounds.

Findings

01

Develops objectives for agnostic language identification and generation.

02

Provides novel characterizations of the problems.

03

Achieves nearly tight statistical rates.

Abstract

Recent works on language identification and generation have established tight statistical rates at which these tasks can be achieved. These works typically operate under a strong realizability assumption: that the input data is drawn from an unknown distribution necessarily supported on some language in a given collection. In this work, we relax this assumption of realizability entirely, and impose no restrictions on the distribution of the input data. We propose objectives to study both language identification and generation in this more general "agnostic" setup. Across both problems, we obtain novel interesting characterizations and nearly tight rates.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.