RWKV-X combines sparse attention and recurrent memory to allow an effective decoding of 1m with linear complexity

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

The LLMs built on transformers' architectures are faced with major scaling challenges due to their quadratic complexity in the length of sequence during the treatment of long context entries. Methods such as linear attention models, state space models such as Mamba, linear RNNs like Deltanet and RWKV solve this problem. However, these linear architectures fight with the understanding of the long -term context. For example, RWKV-7 (2.9b) reaches great precision on the recovery of the button passed up to 28k tokens, but undergoes a rapid degradation of performance beyond this point. Even with a continuous sample using 128K length data, long -term context limitations persist. This problem extends beyond RWKV to other architectures like Mamba, representing a fundamental challenge for this class of models.

Linear complexity language models have become alternatives to architectures based on transformers who suffer from quadratic calculation requests when processing long sequences. The RWKV model series combines the parallelisability of the transformer during training with a recurring state representation of the RNN type. RWKV has evolved through multiple iterations, from the base of the RWKV-4 base to RWKV-5 to RWKV-6 in RWKV-7. Hybrid language models, including Jamba, Zamba and Minimax, improve hybrid conceptions uniquely. In addition, the sparse native attention organizes the tokens in temporal blocks with three distinct attention paths: compressed coarse grain tokens, fine grain chips selectively retained and sliding windows for local contextual information. The other attention includes seeratication and attention of attention (MOBA).

Researchers from the Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Hohai University, Nanjing, Shenzhen University and Qinghai University, Xining, proposed a new hybrid architecture called RWKV-X which combines the efficiency of RWKV for the context of the short range. Unlike previous hybrid approaches, RWKV-X reaches a linear complexity during the training and the complexity of constant times during the decoding of inference. It shows an almost perfect precision on the 64K Passkey recovery reference when pre-trained on 64k-token sequences continuously. The model constantly surpasses the previous RWKV-7 models on long context references while retaining high performance on short context tasks.

RWKV-X is a hybrid architecture that incorporates the RWKV-7 blocks with sparse attention blocks. Rather than training from zero, RWKV-X is based on existing models using an intertwined block extension approach and a zero-initiation mechanism inspired by Llama Pro. The training follows a two -step process:

  • First of all, the model trains on short contexts 1024-TOKEN from the minipile data set while freezing all the parameters with the exception of the newly added blocks.
  • The second step involves a continuous long-context sample using the 64k-tokens' prolonger-64k data set, treating approximately 1 billion tokens in total. During this phase, all parameters are not protected and optimized jointly. The training uses a loss of long context (loin) crossing, which dynamically lays tokens according to their importance.

The valuation of the short context reveals that RWKV-X maintains competitive performance through standard references. The smallest RWKV-X (0.22b) reaches an average score of 51.0, comparable to 51.8 from RWKV-7. On a larger scale, RWKV-X (3.6b) reaches 71.9, corresponding closely to RWKV-7 (2.9b, 72.8) and QWEN2.5-3B (71.4), while exceeding LLAMA3.2-3B (69.7). These results confirm the effectiveness of RWKV-X as a LLM skeleton for general use without sacrificing performance in shorter contexts. In addition, the analysis of efficiency demonstrates the upper scale of RWKV-X for long sequences. At 128k tokens, RWKV-X achieves an acceleration of 1.37 times compared to the V3 of flash attention, this advantage developing as the length of the context increases.

In this article, researchers have introduced RWKV-X, which appears to be a hybrid language model that successfully combines the effectiveness of RWKV for short-range dependencies modeling with a new sparse attention mechanism designed specifically for long-range context modeling. While RWKV-X shows solid performance and efficiency in long-context language modeling, several limitations remain. First, its sparse attention mechanism, which is based on the selection of upper pieces, uses a heuristic approach that can ignore semantically relevant dependencies. Second, the current implementation shows a clear attention decoding in progress more slowly than RWKV vanilla, which indicates that other engineering efforts are necessary to optimize performance.


Discover the Paper. Also, don't forget to follow us Twitter.

Here is a brief overview of what we build on Marktechpost:

ML new community – R / MACHINAGNINGNEWS (92K + members)

Bulletin- Airesearchinsights.com/(30k + subscribers)

Minicon AI events – minicon.marktechpost.com

AI reports and magazines – Magazine.Marktechpost.com

Ai Dev & Research News – marktechpost.com (1m + monthly readers)


Sajjad Ansari is a last year's first year of the Kharagpur Iit. As a technology enthusiast, he plunges into AI's practical applications by emphasizing the understanding of the impact of AI technologies and their real implications. It aims to articulate complex AI concepts in a clear and accessible way.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.