How much do contextualized representations encode long-range context?
Simeng Sun, Cheng-Ping Hsieh

TL;DR
This paper investigates how well neural language models encode long-range context, revealing differences across architectures and suggesting avenues for enhancing their long-term information processing capabilities.
Contribution
It introduces a novel metric and methodology to quantify long-range contextualization in neural language models, providing insights into their encoding capacities and architectural differences.
Findings
Recurrent models rely heavily on local context.
Hybrid models better encode entire sequence structure.
Long-range encoding varies significantly across architectures.
Abstract
We analyze contextual representations in neural autoregressive language models, emphasizing long-range contexts that span several thousand tokens. Our methodology employs a perturbation setup and the metric \emph{Anisotropy-Calibrated Cosine Similarity}, to capture the degree of contextualization of long-range patterns from the perspective of representation geometry. We begin the analysis with a case study on standard decoder-only Transformers, demonstrating that similar perplexity can exhibit markedly different downstream task performance, which can be explained by the difference in contextualization of long-range content. Next, we extend the analysis to other models, covering recent novel architectural designs and various training configurations. The representation-level results illustrate a reduced capacity for high-complexity (i.e., less compressible) sequences across architectures,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
