Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Wu, Haoyi; Tu, Kewei

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2405

Change to browse by:

Computer Science > Computation and Language

Title: Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Authors: Haoyi Wu, Kewei Tu

(Submitted on 17 May 2024)

Abstract: Huge memory consumption has been a major bottleneck for deploying high-throughput large language models in real-world applications. In addition to the large number of parameters, the key-value (KV) cache for the attention mechanism in the transformer architecture consumes a significant amount of memory, especially when the number of layers is large for deep language models. In this paper, we propose a novel method that only computes and caches the KVs of a small number of layers, thus significantly saving memory consumption and improving inference throughput. Our experiments on large language models show that our method achieves up to 26$\times$ higher throughput than standard transformers and competitive performance in language modeling and downstream tasks. In addition, our method is orthogonal to existing transformer memory-saving techniques, so it is straightforward to integrate them with our model, achieving further improvement in inference efficiency. Our code is available at this https URL

Comments:	Accepted to ACL2024 main conference
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2405.10637 [cs.CL]
	(or arXiv:2405.10637v1 [cs.CL] for this version)

Submission history

From: Haoyi Wu [view email]
[v1] Fri, 17 May 2024 08:59:46 GMT (8097kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2405.10637

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Submission history