This article AI presents Parscale (parallel scaling): a parallel calculation method for a deployment of effective and evolving language model

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Over time, the search for better performance of language models has prompted researchers to develop them, which generally involves increasing the number of parameters or extending their calculation capacity. Consequently, the development and deployment of linguistic models are now strongly depended on the availability of substantial and memory computing resources.

Despite the advances, the increase in the size of the model or the generation of more tokens to improve reasoning capacities leads to significant challenges. The scale methods of parameters such as dense scaling and the scale of the expert mixture, which involve increasing the number of driven weights, require much greater memory resources. Meanwhile, the time scaling, on the other hand, requires models to generate longer sequences or perform several stages of reasoning, which introduces latency and makes the deployment slower. Although effective, these approaches are not adaptable in all scenarios and fail to treat the effectiveness of deployment for low -resources parameters such as mobile devices or integrated systems.

Researchers from the University of Zhejiang and the Alibaba group have proposed a new approach called Parscale, which means a parallel scaling. This method goes from the increase in the size of the model or the output length to the increase in parallel calculations of the model during training and inference. By applying several learning transformations to the entrance, the model performs several front passes in parallel and aggregates their dynamically outputs. Parscale retains the number of original parameters of the model and stimulates the diversity of calculation, making it an adaptable solution for various tasks and model architectures without requiring data of data or changes specialized in training protocols.

At the technical level, the Parcale adds several distinct and learning prefixes to the same entry, producing several parallel versions. The model treats them simultaneously, and the outputs are aggregated using a dynamic weighted sum calculated by a multilayer piercetron. This structure introduces only approximately 0.2% additional parameters per flow, a minor addition compared to settings parameters. The model uses prefix adjustment to distinguish each parallel flow via single key value caches, allowing effective reuse of memory. The approach also benefits from the user -friendly parallelization of the GPU, which helps maintain low latency despite the additional calculation. This design guarantees scalability without modifying the basic architecture and allows the application even in the models collected frozen by only forming the new prefix and aggregation parameters.

The researchers conducted in -depth experiments on models ranging from 0.5b to 4.4b of parameters with parallel flows P set from 1 to 8. During training with 42 billion tokens, models with P = 8 have demonstrated performances equivalent to models with up to 4.4 billion parameters, but significantly required less memory and latency. More specifically, on a 1.6b model, the Parscale used 22 × increase in memory and an increase of 6 × less latency compared to the parameter scale for the same performance. On downstream tasks, Parscale has given up to an improvement of 34% compared to GSM8K and 23% on MMLU. The coding performance has improved significantly – the models with parameters of 1.6b and p = 8 have obtained results comparable to those of a 4.4b settings model. The method has also proven to be effective during post-training training and the fine-efficient setting in parameters, now high performance even when the basic model parameters have remained unchanged.

This article has introduced a strategy that rethinks how language models can be set up. Instead of inflating the size of the model or the inference stages, it focuses on the effective reuse of existing calculation. The researchers' approach addresses the ineffectiveness of time and memory while retaining or improving performance. This demonstrates a convincing discrepancy in scaling methods and defines a direction for the deployment of advanced models in constrained environments by effectively using parallel calculation.


Discover the Paper. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit and subscribe to Our newsletter.


Nikhil is an intern consultant at Marktechpost. It pursues a double degree integrated into materials at the Indian Kharagpur Institute of Technology. Nikhil is an IA / ML enthusiast who is still looking for applications in fields like biomaterials and biomedical sciences. With a strong experience in material science, he explores new progress and creates opportunities to contribute.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.