Filtered Corpus Training (FiCT) Shows that Language Models can   Generalize from Indirect Evidence

Abhinav Patil; Jaap Jumelet; Yu Ying Chiu; Andy Lapastora and; Peter Shen; Lexie Wang; Clevis Willrich; Shane Steinert-Threlkeld

arXiv:2405.15750·cs.CL·August 8, 2024

Filtered Corpus Training (FiCT) Shows that Language Models can Generalize from Indirect Evidence

Abhinav Patil, Jaap Jumelet, Yu Ying Chiu, Andy Lapastora and, Peter Shen, Lexie Wang, Clevis Willrich, Shane Steinert-Threlkeld

PDF

Open Access 1 Repo 1 Video

TL;DR

This study introduces Filtered Corpus Training to evaluate if language models can generalize linguistically from indirect evidence, revealing that both LSTM and Transformer models perform well on such tasks despite differences in perplexity.

Contribution

The paper presents a novel training method, Filtered Corpus Training, to assess linguistic generalization capabilities of language models from indirect evidence.

Findings

01

Transformers outperform LSTMs in perplexity.

02

Both models perform equally well on linguistic generalization.

03

Language models can generalize from indirect evidence.

Abstract

This paper introduces Filtered Corpus Training, a method that trains language models (LMs) on corpora with certain linguistic constructions filtered out from the training data, and uses it to measure the ability of LMs to perform linguistic generalization on the basis of indirect evidence. We apply the method to both LSTM and Transformer LMs (of roughly comparable size), developing filtered corpora that target a wide range of linguistic phenomena. Our results show that while transformers are better qua LMs (as measured by perplexity), both models perform equally and surprisingly well on linguistic generalization measures, suggesting that they are capable of generalizing from indirect evidence.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

clmbrs/corpus-filtering
noneOfficial

Videos

Filtered Corpus Training (FiCT) Shows that Language Models can Generalize from Indirect Evidence· underline

Taxonomy

TopicsTopic Modeling

MethodsAttention Is All You Need · Sigmoid Activation · Linear Layer · Tanh Activation · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention