CausalLM is not optimal for in-context learning
Nan Ding, Tomer Levinboim, Jialin Wu, Sebastian Goodman, Radu Soricut

TL;DR
This paper provides a theoretical analysis showing that prefix language models outperform causal language models in in-context learning, with empirical evidence confirming the superiority of prefixLM across various tasks.
Contribution
It offers a theoretical explanation for the observed empirical performance gap, demonstrating that prefixLM converges to the optimal solution while causalLM does not.
Findings
PrefixLM converges to the optimal linear regression solution.
CausalLM's convergence resembles online gradient descent, which may not be optimal.
Empirical results show causalLM underperforms compared to prefixLM in all tested settings.
Abstract
Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of…
Peer Reviews
Decision·ICLR 2024 poster
Primarily a study of ICL's limitations in standard transformer architectures, which use causal self-attention masking and hence causal language modeling objectives. prefixLMs' superiority is also described and the empirical results are interesting.
While I find the results and the paper itself interesting, as I mentioned in my summary of the paper, I'm unsure of the relevance of the problem. I think it would help improve the paper's impact if the motivations were clarified better: specifically, what models actually use a PrefixLM architecture in current literature and demonstrating or citing papers which show these models have a different qualitative behavior that a standard causalLM architecture.
1. Theoretical understanding of different solutions found by prefix and casual LMs are provided under the linear regression setting, which seems novel. 2. The paper also provides an empirical study to compare the solution found by prefix and causal LMs in different tasks, which verifies the theoretical intuitions.
1. The experimental setting is limited to a given number of in-context examples which seems to naturally favor prefix LMs. Casual LMs would train the model with different numbers of in-context samples simultaneously while prefix LM using all possible in-context and query partitions with the same in-context length. Testing with fewer in-context examples could be beneficial to provide more comprehensive results. 2. Another unfairness in the experimental setting is that the nature of casual LMs wou
The paper follows an emerging line of papers showing the equivalence of gradient descent to in context learning in a very specific setup where the self-attention is linear, the objective is linear regression and the parameter matrices are hand-constructed. Abstracting away the limited setup, the paper does a good job extending the theory of Von Onswal et al. The theoretical argument is clear in my opinion and the evidences supporting the main thesis of the work are convincing enough. In particu
My understanding is that this work aims at demonstrating that prefixLM is superior to causalLM for in context learning. While I believe they do a good job at it, I am under the impression that most projects already used prefixLM when possible. For example, InstructBLIP and Llama2 use prefixLM as far as I can understand. If this is the case that most influential language model or VLM already use prefixLM then I am unclear about the intended impact of this work. The models used in this work are i
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
