Bitune: Leveraging Bidirectional Attention to Improve Decoder-Only LLMs

Dawid J. Kopiczko; Tijmen Blankevoort; Yuki M. Asano

arXiv:2405.14862·cs.CL·August 29, 2025·1 cites

Bitune: Leveraging Bidirectional Attention to Improve Decoder-Only LLMs

Dawid J. Kopiczko, Tijmen Blankevoort, Yuki M. Asano

PDF

Open Access 3 Reviews

TL;DR

Bitune introduces bidirectional attention into decoder-only large language models, significantly enhancing their ability to process prompts and improve performance across reasoning and understanding tasks.

Contribution

It is the first to incorporate bidirectional attention into decoder-only LLMs, improving their expressiveness and task performance without extensive retraining.

Findings

01

Improved performance on commonsense reasoning tasks

02

Enhanced arithmetic and language understanding results

03

Compatible with various finetuning techniques

Abstract

Decoder-only large language models typically rely solely on masked causal attention, which limits their expressiveness by restricting information flow to one direction. We propose Bitune, a method that enhances pretrained decoder-only LLMs by incorporating bidirectional attention into prompt processing. We evaluate Bitune in instruction-tuning and question-answering settings, showing significant improvements in performance on commonsense reasoning, arithmetic, and language understanding tasks. Furthermore, extensive ablation studies validate the role of each component of the method, and demonstrate that Bitune is compatible with various parameter-efficient finetuning techniques and full model finetuning.

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 4

Strengths

- The proposed method is simple but effective. - Comprehensive ablation results with reasonable baselines.

Weaknesses

Some existing works focus on prefix language models (non-causal decoder-only model) that utilize bidirectional attention for processing prefixes (i.e., inputs and instructions). Representative ones include U-PaLM [1] and UniLM [2]. How does the proposed method differ from these? While these models incorporate bidirectional attention during pre-training, the proposed method applies bidirectional attention specifically during instruction tuning. What would happen if we applied simple instruction t

Reviewer 02Rating 10Confidence 4

Strengths

The paper includes a lot of ablations that are very convincing of the efficacy of their approach. As I was reading, I keep thinking, I wonder if the see effect is because of X or Y instead of the bidirectional attention, and each time they had an ablation covering that possibility!

Weaknesses

In Figure 2 they show that the initial value of the mixing ratio can cause large differences in the final average ratio once convergence is reached. Figure 3 shows some histograms of mixing ratios at different layers, and Table 6 has experiments that explore how the initial value affects performance. However, the scale of these experiments is small, and the results are inconclusive. Given the initial value causes such a large difference in the ratio, it would have been nice to see a deeper dive

Reviewer 03Rating 5Confidence 4

Strengths

1. The proposed idea addresses the innate limitation of decoder-only LLMs while preserving their generation efficiency. 2. The experiments are comprehensive, covering various LLMs and downstream tasks. 3. The ablation study helps identify the contribution of different design choices.

Weaknesses

1. As indicated by the Related Work section, there are several prior works investigating the idea of bi-directional attention in decoder-only LMs. The novelty of this work is thus somewhat limited. It might be helpful if the authors could further discuss how their method stands out. 2. It is unclear why decoder-only LLMs would still face the limitation of representing language using causal attention given that they have undergone extensive pre-training and fine-tuning based on causal attention.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications