We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Authors: Haoyi Wu, Kewei Tu
Abstract: Huge memory consumption has been a major bottleneck for deploying high-throughput large language models in real-world applications. In addition to the large number of parameters, the key-value (KV) cache for the attention mechanism in the transformer architecture consumes a significant amount of memory, especially when the number of layers is large for deep language models. In this paper, we propose a novel method that only computes and caches the KVs of a small number of layers, thus significantly saving memory consumption and improving inference throughput. Our experiments on large language models show that our method achieves up to 26$\times$ higher throughput than standard transformers and competitive performance in language modeling and downstream tasks. In addition, our method is orthogonal to existing transformer memory-saving techniques, so it is straightforward to integrate them with our model, achieving further improvement in inference efficiency. Our code is available at this https URL
Comments: Accepted to ACL2024 main conference
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:2405.10637 [cs.CL]
  (or arXiv:2405.10637v1 [cs.CL] for this version)

Submission history

From: Haoyi Wu [view email]
[v1] Fri, 17 May 2024 08:59:46 GMT (8097kb,D)

Link back to: arXiv, form interface, contact.