MLRegTest: A Benchmark for the Machine Learning of Regular Languages

Sam van der Poel; Dakotah Lambert; Kalina Kostyszyn; Tiantian Gao,; Rahul Verma; Derek Andersen; Joanne Chau; Emily Peterson; Cody St. Clair,; Paul Fodor; Chihiro Shibata; Jeffrey Heinz

arXiv:2304.07687·cs.LG·September 4, 2024·1 cites

MLRegTest: A Benchmark for the Machine Learning of Regular Languages

Sam van der Poel, Dakotah Lambert, Kalina Kostyszyn, Tiantian Gao,, Rahul Verma, Derek Andersen, Joanne Chau, Emily Peterson, Cody St. Clair,, Paul Fodor, Chihiro Shibata, Jeffrey Heinz

PDF

Open Access 1 Repo

TL;DR

MLRegTest is a comprehensive benchmark for evaluating machine learning models on sequence classification tasks involving 1,800 regular languages, focusing on their ability to learn long-distance dependencies.

Contribution

Introduces MLRegTest, a new benchmark with diverse formal languages organized by logical complexity to systematically assess ML models' capacity to learn long-distance dependencies.

Findings

01

Performance varies significantly across neural network architectures.

02

Long-distance dependencies pose a challenge for ML systems.

03

Different language classes affect model generalization.

Abstract

Synthetic datasets constructed from formal languages allow fine-grained examination of the learning and generalization capabilities of machine learning systems for sequence classification. This article presents a new benchmark for machine learning systems on sequence classification called MLRegTest, which contains training, development, and test sets from 1,800 regular languages. Different kinds of formal languages represent different kinds of long-distance dependencies, and correctly identifying long-distance dependencies in sequences is a known challenge for ML systems to generalize successfully. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or monomial expressions) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

heinz-jeffrey/subregular-learning
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Machine Learning in Bioinformatics · Topic Modeling

MethodsTest · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Gated Recurrent Unit