NVIDIA researchers introduce dynamic memory sparsification (DMS) for 8 × KV cache compression in the LLMS transformer

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

As the demand for heavy tasks of reasoning increases, large language models (LLM) should increasingly generate longer sequences or parallel reasoning chains. However, the performance of the inference time is seriously limited by the memory imprint of the key-value cache (KV), not only the number of token. In a recent article, researchers from Nvidia and the University of Edinburgh present Sparsification of dynamic memory (DMS)—A data and user -friendly economic method that compresses the KV and unlocked covers Hyper-scale of inference time Without degrading model precision.

The bottleneck: KV hide in the transformer's inference

Models based on transformers like GPT, Llama and Qwen use KV covers to store representations of tokens spent for autoregressive generation. This cache increases linearly with the length and sequence width (parallel threads), consuming large amounts of GPU memory and leading to slower inference due to access to frequent memory.

The existing techniques for optimizing the KV cache are based either on a heuristic without training, such as the expulsion of token based on the weight of attention – or require strong post -workout renovations such as dynamic memory compression (DMC). The two have important drawbacks: the first tends to harm precision, while the second is expensive in calculation.

Sparsification of dynamic memory DMS: compression without compromise

Sparsification of dynamic memory The DMS approaches these limits with a hybrid approach: it sparsifies the KV cache such as traditional pruning methods, but the fact with minimum training training (~ 1,000 stages) and Delayed expulsionwhich keeps the tokens temporarily after being marked for withdrawal. This design preserves important context information and avoids abrupt precision drops.

The main idea is to make the expulsion decisions differentiated during training using a sampling mechanism based on Gumbel-Sigmoid. The tokens provided for a future expulsion remain usable for a sliding window period before being thrown, allowing the model to absorb their information value more effectively.

Effective modification with minimum data

Unlike DMC, which requires thousands of training stages and a complex optimization based on the gradient, DMS introduces no additional parameter per head of attention. It reuses a small part of the attention mechanism (a single neuron) to predict expulsion. This makes the DMS ideal for the modernization of existing models without architectural modifications.

Empirical results show that with as little as 1K training stagesDMs can achieve 8 × kV cache compressionPreserve or even improve the performance of the model through reasoning tasks.

Reference results: scaling performance at no cost of scaling

The research team tested the DMS on reasoning references such as:

  • Like 2025 (advanced mathematics)
  • Math 500 (Mathematical problem solving)
  • GPQA diamond (Hard Science Qa)
  • Livecodebench (code generation)

Through model sizes – QWEN -R1 1.5B, 7B and 32B – DMS improved the performance of the exact correspondence by 9.1 points on the love,, 7.6 On GPQAAnd 9.6 on LivecodebenchAll in the same memory and calculates budgets.

Compared to the most efficient basic lines such as Quest and Tova, DMS has systematically surpassed them in both KV Cache Read Efficiency (execution proxy) and Use of maximum memorymaking better borders of Pareto.

General utility

The DMS also withstands undeveloped tasks. On short context benchmarks like MMLU, GSM8K and Hellaswag, Acute DMS performance with compression reports until 4 × With minimum degradation (~ 3.5 points). On long-context tasks such as the needle in a haystack and monitoring of variables, DMs have even exceeded vanilla models, suggesting its potential to mitigate problems such as information over-sequences in long sequences.

Conclusion

In conclusion, the dynamic sparsification of memory (DMS) has a practical and evolutionary solution to improve the efficiency of the inference time of language models based on transformers. By intelligently compressing the KV cache with minimal recycling, DMS allows models to reason on longer sequences or in parallel without increasing requests for execution or memory. Its consistent gains through a range of reasoning tasks and for general use highlight its versatility and efficiency. As the LLMs are increasingly deployed in resource -related environments, DMS offers a convincing path – compression, precision and ease of integration for workloads of the real world.


Discover the Paper. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 99K + ML Subreddit and subscribe to Our newsletter.


Nikhil is an intern consultant at Marktechpost. It pursues a double degree integrated into materials at the Indian Kharagpur Institute of Technology. Nikhil is an IA / ML enthusiast who is still looking for applications in fields like biomaterials and biomedical sciences. With a strong experience in material science, he explores new progress and creates opportunities to contribute.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.