Bitune: Leveraging Bidirectional Attention to Improve Decoder-Only LLMs
Dawid J. Kopiczko, Tijmen Blankevoort, Yuki M. Asano

TL;DR
Bitune introduces bidirectional attention into decoder-only large language models, significantly enhancing their ability to process prompts and improve performance across reasoning and understanding tasks.
Contribution
It is the first to incorporate bidirectional attention into decoder-only LLMs, improving their expressiveness and task performance without extensive retraining.
Findings
Improved performance on commonsense reasoning tasks
Enhanced arithmetic and language understanding results
Compatible with various finetuning techniques
Abstract
Decoder-only large language models typically rely solely on masked causal attention, which limits their expressiveness by restricting information flow to one direction. We propose Bitune, a method that enhances pretrained decoder-only LLMs by incorporating bidirectional attention into prompt processing. We evaluate Bitune in instruction-tuning and question-answering settings, showing significant improvements in performance on commonsense reasoning, arithmetic, and language understanding tasks. Furthermore, extensive ablation studies validate the role of each component of the method, and demonstrate that Bitune is compatible with various parameter-efficient finetuning techniques and full model finetuning.
Peer Reviews
Decision·Submitted to ICLR 2025
- The proposed method is simple but effective. - Comprehensive ablation results with reasonable baselines.
Some existing works focus on prefix language models (non-causal decoder-only model) that utilize bidirectional attention for processing prefixes (i.e., inputs and instructions). Representative ones include U-PaLM [1] and UniLM [2]. How does the proposed method differ from these? While these models incorporate bidirectional attention during pre-training, the proposed method applies bidirectional attention specifically during instruction tuning. What would happen if we applied simple instruction t
The paper includes a lot of ablations that are very convincing of the efficacy of their approach. As I was reading, I keep thinking, I wonder if the see effect is because of X or Y instead of the bidirectional attention, and each time they had an ablation covering that possibility!
In Figure 2 they show that the initial value of the mixing ratio can cause large differences in the final average ratio once convergence is reached. Figure 3 shows some histograms of mixing ratios at different layers, and Table 6 has experiments that explore how the initial value affects performance. However, the scale of these experiments is small, and the results are inconclusive. Given the initial value causes such a large difference in the ratio, it would have been nice to see a deeper dive
1. The proposed idea addresses the innate limitation of decoder-only LLMs while preserving their generation efficiency. 2. The experiments are comprehensive, covering various LLMs and downstream tasks. 3. The ablation study helps identify the contribution of different design choices.
1. As indicated by the Related Work section, there are several prior works investigating the idea of bi-directional attention in decoder-only LMs. The novelty of this work is thus somewhat limited. It might be helpful if the authors could further discuss how their method stands out. 2. It is unclear why decoder-only LLMs would still face the limitation of representing language using causal attention given that they have undergone extensive pre-training and fine-tuning based on causal attention.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
