LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

Yukang Chen; Shengju Qian; Haotian Tang; Xin Lai; Zhijian Liu; Song; Han; Jiaya Jia

arXiv:2309.12307·cs.CL·March 11, 2024·39 cites

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song, Han, Jiaya Jia

PDF

Open Access 4 Repos 10 Models 1 Datasets 3 Reviews

TL;DR

LongLoRA introduces an efficient method for extending the context size of large language models using sparse attention and parameter-efficient fine-tuning, significantly reducing computational costs while maintaining performance.

Contribution

The paper presents a novel approach combining shifted sparse attention and LoRA for efficient long-context fine-tuning of LLMs, enabling large context extension with minimal additional training complexity.

Findings

01

Extends Llama2 7B from 4k to 100k context length.

02

Achieves long-context extension on 70B models with only two lines of code.

03

Maintains similar performance to vanilla attention-based fine-tuning.

Abstract

We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shifted sparse attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of…

Peer Reviews

Decision·ICLR 2024 oral

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The authors propose an extremely simple method, that performs well and is applicable to existing pretrained models

Weaknesses

- The authors only evaluate perplexity and retrieval setting

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

- The proposed method builds on previous work and shows strong empirical results on long lange language modelling and a retrieval task - The proposed approach is conceptually simple and can be implemented in a few lines of code (as demonstrated by the authors) - The proposed approach can be combined with existing approaches for context extension such as positional interpolation - The authors provide a detailed discussion of related work

Weaknesses

- The efficiency aspect of the could could be more prominently discussed in the main body of the paper - The presentation of the work could be improved. See below for suggestions

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

(1) The method seems useful and impactful, and the evaluation is thorough with strong results. (2) The authors perform very thorough ablations and isolate key design decisions (attention shift, modifying the norm & embedding layers) that enable the method to match full fine-tuning. (3) The paper is well-written.

Weaknesses

No major weaknesses.

Code & Models

Repositories

Models

Datasets

Yukang/LongAlpaca-12k
dataset· 1.0k dl
1.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings