QuickLLaMA: Query-aware Inference Acceleration for Large Language Models

Jingyao Li; Han Shi; Xin Jiang; Zhenguo Li; Hong Xu; Jiaya Jia

arXiv:2406.07528·cs.LG·August 23, 2024

QuickLLaMA: Query-aware Inference Acceleration for Large Language Models

Jingyao Li, Han Shi, Xin Jiang, Zhenguo Li, Hong Xu, Jiaya Jia

PDF

Open Access 1 Repo

TL;DR

QuickLLaMA introduces a query-aware inference method for large language models that enhances long-context understanding and accuracy without additional training, significantly improving performance on multiple benchmarks.

Contribution

It presents Q-LLM, a system that processes long sequences by focusing on query-relevant memory, seamlessly integrating with existing LLMs and improving accuracy without extra training.

Findings

01

Achieves 7.17% improvement on LLaMA3 benchmarks.

02

Improves 3.26% on Mistral $ o$ $ ext{infinity}$-bench.

03

Enhances performance on Needle-in-a-Haystack and BABILong tasks.

Abstract

The capacity of Large Language Models (LLMs) to comprehend and reason over long contexts is pivotal for advancements in diverse fields. Yet, they still stuggle with capturing long-distance dependencies within sequences to deeply understand semantics. To address this issue, we introduce Query-aware Inference for LLMs (Q-LLM), a system designed to process extensive sequences akin to human cognition. By focusing on memory data relevant to a given query, Q-LLM can accurately capture pertinent information within a fixed window size and provide precise answers to queries. It doesn't require extra training and can be seamlessly integrated with any LLMs. Q-LLM using LLaMA3 (QuickLLaMA) can read Harry Potter within 30s and accurately answer the questions. On widely recognized benchmarks, Q-LLM improved by 7.17% compared to the current state-of-the-art on LLaMA3, and by 3.26% on Mistral on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dvlab-research/q-llm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Data Quality and Management