TL;DR
This paper investigates how large language models like BERT and GPT-2 understand grammatical roles in sentences where lexical cues are insufficient, revealing that word order becomes crucial in non-prototypical cases and is processed in later model layers.
Contribution
The study systematically probes the reliance on word order in LLMs for non-prototypical sentences, highlighting the importance of context in grammatical role classification beyond lexical semantics.
Findings
Early layer embeddings are mostly lexical.
Word order influences later-layer representations.
Models use context critically when lexical cues are ambiguous.
Abstract
Because meaning can often be inferred from lexical semantics alone, word order is often a redundant cue in natural language. For example, the words chopped, chef, and onion are more likely used to convey "The chef chopped the onion," not "The onion chopped the chef." Recent work has shown large language models to be surprisingly word order invariant, but crucially has largely considered natural prototypical inputs, where compositional meaning mostly matches lexical expectations. To overcome this confound, we probe grammatical role representation in English BERT and GPT-2, on instances where lexical expectations are not sufficient, and word order knowledge is necessary for correct classification. Such non-prototypical instances are naturally occurring English sentences with inanimate subjects or animate objects, or sentences where we systematically swap the arguments to make sentences…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Byte Pair Encoding · Linear Warmup With Cosine Annealing · Dense Connections · Residual Connection · Weight Decay · Discriminative Fine-Tuning
