Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning
William T. Redman, Erik C. Johnson, Brian Robinson

TL;DR
This paper investigates how Transformer models like BERT and ALBERT perform in continual learning, revealing that BERT learns shortcut solutions limiting generalization, while ALBERT's recurrent structure offers better potential for continual compositional reasoning.
Contribution
The study extends the LEGO framework to continual learning, systematically comparing feedforward and recurrent Transformers and analyzing their ability to learn and generalize across experiences.
Findings
BERT learns shortcut solutions that hinder generalization.
ALBERT, a recurrent Transformer, learns more robust solutions.
Training strategies can improve ALBERT's continual learning performance.
Abstract
Identifying and exploiting common features across domains is at the heart of the human ability to make analogies, and is believed to be crucial for the ability to continually learn. To do this successfully, general and flexible computational strategies must be developed. While the extent to which Transformer neural network models can perform compositional reasoning has been the subject of intensive recent investigation, little work has been done to systematically understand how well these models can leverage their representations to learn new, related experiences. To address this gap, we expand the previously developed Learning Equality and Group Operations (LEGO) framework to a continual learning (CL) setting ("continual LEGO"). Using this continual LEGO experimental paradigm, we study the capability of feedforward and recurrent Transformer models to perform CL. We find that BERT, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
