Enhancing and Accelerating Large Language Models via Instruction-Aware   Contextual Compression

Haowen Hou; Fei Ma; Binwen Bai; Xinxin Zhu; Fei Yu

arXiv:2408.15491·cs.CL·August 29, 2024

Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression

Haowen Hou, Fei Ma, Binwen Bai, Xinxin Zhu, Fei Yu

PDF

Open Access 1 Repo

TL;DR

This paper proposes Instruction-Aware Contextual Compression to filter irrelevant information in retrieval-augmented LLMs, significantly reducing costs and latency while maintaining comparable performance.

Contribution

It introduces a novel method for filtering context in retrieval-augmented LLMs, improving efficiency without sacrificing accuracy.

Findings

01

50% reduction in context-related costs

02

2.2-fold increase in inference speed

03

0.047 Rouge-1 score drop

Abstract

Large Language Models (LLMs) have garnered widespread attention due to their remarkable performance across various tasks. However, to mitigate the issue of hallucinations, LLMs often incorporate retrieval-augmented pipeline to provide them with rich external knowledge and context. Nevertheless, challenges stem from inaccurate and coarse-grained context retrieved from the retriever. Supplying irrelevant context to the LLMs can result in poorer responses, increased inference latency, and higher costs. This paper introduces a method called Instruction-Aware Contextual Compression, which filters out less informative content, thereby accelerating and enhancing the use of LLMs. The experimental results demonstrate that Instruction-Aware Contextual Compression notably reduces memory consumption and minimizes generation latency while maintaining performance levels comparable to those achieved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

howard-hou/instruction-aware-contextual-compressor
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need