TL;DR
This paper evaluates pre-trained language models on eight reasoning tasks to understand their symbolic reasoning capabilities, revealing differences among models and limitations in their reasoning abilities, especially in abstract contexts.
Contribution
It introduces a comprehensive evaluation protocol combining zero-shot and fine-tuning learning curves to analyze LM reasoning skills and differences among models.
Findings
RoBERTa outperforms BERT in reasoning tasks.
Models are context-dependent and lack abstract reasoning.
Half of the reasoning tasks see complete failure across models.
Abstract
Recent success of pre-trained language models (LMs) has spurred widespread interest in the language capabilities that they possess. However, efforts to understand whether LM representations are useful for symbolic reasoning tasks have been limited and scattered. In this work, we propose eight reasoning tasks, which conceptually require operations such as comparison, conjunction, and composition. A fundamental challenge is to understand whether the performance of a LM on a task should be attributed to the pre-trained representations or to the process of fine-tuning on the task data. To address this, we propose an evaluation protocol that includes both zero-shot evaluation (no fine-tuning), as well as comparing the learning curve of a fine-tuned LM to the learning curve of multiple controls, which paints a rich picture of the LM capabilities. Our main findings are that: (a) different LMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · RoBERTa · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece
