Video generation models have become basic technology to create dynamic content by transforming text prompts into high quality video sequences. Diffusion models, in particular, have established themselves as a leading approach for this task. These models work by starting with a random noise and by iteratively refining it in realistic video frames. Video text models (T2V) extend this capacity by incorporating temporal elements and aligning the content generated with textual prompts, producing videos that are both visually convincing and semantically precise. Despite the progress of architecture design, such as latent diffusion models and aware attention modules, a significant challenge remains: ensuring a coherent and high quality video generation through different series, in particular when the only change is the initial random noise. This challenge highlighted the need for more intelligent noise selection strategies devoted to the model to avoid unpredictable results and wasted computer resources.
The central problem lies in how distribution models initialize their generation process from Gaussian noise. The specific noise seed used can have a considerable impact on final video quality, temporal consistency and rapid fidelity. For example, the same text prompt can generate entirely different videos depending on the random noise seed. Current approaches often try to solve this problem using noise priors made by hand or frequency -based adjustments. Methods like Freeinit and Freqprior apply external filtering techniques, while others like Pyoco introduce structured noise models. However, these methods are based on hypotheses that may not hold on different sets of data or models, require several complete sampling pass (resulting in high calculation costs) and do not take advantage of the internal attention signals of the model, which could indicate which seeds are the most promising for production. Consequently, there is a need for a more aware of principle of principle of principle that can guide the selection of noise without incurring high calculation penalties or relying on hand -made Priors.
The Samsung Research research team introduced Handle (Active Noise SElection for GeNeing), an active noise selection frame for video broadcasting models. Anse tackles the noise selection problem using internal model signals, in particular attention -based uncertainty estimates, to guide the selection of seed noise. At the heart of Anse is Bansa (selection of active Bayesian noise for the attention)A new acquisition function that quantifies the consistency and confidence of the model's attention cards under stochastic disturbances. The research team designed Bansa to operate effectively during inference by approximating their calculations by sampling masked attention from Bernoulli, which introduces a random directly into the calculation of attention without requiring multiple before before. This stochastic method allows the model to estimate the stability of its attention behavior through different noise seeds and to select those that promote more confident and coherent attention models, which are empirically linked to improved video quality.

Bansa works by evaluating entropy in attention cards, which are generated at specific layers during the first stages of clearing. The researchers identified that the 14 layers for the Cogvideox-2B model and the 19-5b layer 19 for the Cogvideox-5B model provided sufficient correlation (above a threshold of 0.7) with the estimation of the full layer uncertainty, considerably reducing the general calculation costs. The BANSA score is calculated by comparing the average entropy of individual attention cards to the entropy of their average, where a lower bansa score indicates higher confidence and consistency in attention models. This score is used to classify the noise seeds of candidates from a pool of 10 (m = 10), each evaluated using 10 passes before stochastic (K = 10). The noise seed with the lowest bansa score is then used to generate the final video, reaching improved quality without requiring the model recycling or external priors.
On the cogvideox-2B model, the total vbench score increased from 81.03 to 81.66 (+0.63), with a gain of +0.48 in quality score and a gain of +1.23 in semantic alignment. On the larger cogvideox-5b model, Anse has increased the total vbench score from 81.52 to 81.71 (+0.25), with a gain of +0.17 in quality and a gain of +0.60 in semantic alignment. In particular, these improvements came with only an increase of 8.68% of the inference time for cogvideox-2B and 13.78% for cogvideox-5b. On the other hand, previous methods, such as Freeinit and Freqprior, required a 200% increase in inference time, which makes the handle much more effective. Qualitative assessments have also highlighted the advantages, showing that handle has improved visual clarity, semantic coherence and movement representation. For example, videos of “a koala playing the piano” and “a zebra race” showed a more natural and anatomically correct movement under handle, while in guests like “explode”, videos generated by Anse have captured dynamic transitions more effectively.

The research also explored different acquisition functions, comparing Bansa to the random noise selection and the methods based on entropy. Bansa using masked attention to Bernoulli reached the highest total scores (81.66 for Cogvideox-2B), surpassing random (81.03) and entropy (81,13) methods. The study also revealed that the increase in the number of passes before stochastic (K) improved performance until K = 10, beyond which the gains have flattened. Similarly, the performance saturated at a noise pool size (M) of 10. A control experience where the model intentionally selected the seeds with the highest bansa scores led to degraded video quality, confirming that the lower bansa scores are correlated with better generation results.

Although it improves noise selection, it does not change the generation process itself, which means that certain low ban seeds can always cause suboptimal videos. The team recognized this limitation and suggested that Bansa is better considered as a practical substitute for more intensive calculation methods, such as seed sampling with post-hoc filtering. They also proposed that future work could integrate the theoretical information of information or active learning strategies to further improve the quality of the generation.
Several key research dishes include:
- Anse improves the total Vbench scores for the generation of videos: from 81.03 to 81.66 on cogvideox-2B and from 81.52 to 81.71 on cogvideox-5B.
- Quality and semantic alignment gains are +0.48 and +1.23 for cogvideox-2B, and +0.17 and +0.60 for cogvideox-5b, respectively.
- Inference time increases are modest: + 8.68% for cogvideox-2B and + 13.78% for cogvideox-5b.
- Bansa scores derived from masked attention to Bernoulli surpass random methods and based on entropy for noise selection.
- The selection strategy of layers reduces the calculation load by calculating uncertainty with layers 14 and 19 for Cogvideox-2B and Cogvideox-5B, respectively.
- Anse performs efficiency by avoiding several complete sampling passes, unlike methods like Freeinit, which require 200% in addition to inference time.
- Research confirms that low bansa scores are reliably correlation with higher video quality, making it an effective criterion for selecting seeds.
In conclusion, the research has taken up the challenge of the generation of unpredictable videos in the diffusion models by introducing a frame of selection of noise sensitive to the model which takes advantage of the internal attention signals. By quantifying uncertainty via Bansa and selecting noise seeds that minimize this uncertainty, researchers have provided an effective and effective method to improve video quality and semantic alignment in video text models. Anse design, which combines the estimation of attention -based uncertainty with calculation efficiency, allows it to extend to different model sizes without incurring significant execution costs, providing a practical solution to improve the generation of videos in T2V systems.
Discover the Paper And Project page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit and subscribe to Our newsletter.
Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.
