Meet Yambda: the largest set of event data in the world to speed up recommendation systems

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Yandex recently made a significant contribution to the community of recommendation systems by releasing YambdaThe largest data set accessible to the public in the world for the research and development of the recommendation system. This data set is designed to fill the difference between university research and industry -scale applications, offering nearly 5 billion anonymized user interaction events of Yandex Music – one of the company's flagship streaming services with more than 28 million monthly users.

Why Yambda counts: fill a critical data gap in recommendation systems

The recommendation systems underlie personalized experiences of many digital services today, electronic networks and social networks with streaming platforms. These systems are strongly based on massive volumes of behavioral data, such as clicks, tastes and listenings, to deduce user preferences and provide tailor -made content.

However, the field of recommendation systems has lagged behind other areas of AI, such as natural language treatment, largely due to the rarity of large openly accessible data sets. Unlike major language models (LLMS), which learn text sources accessible to the public, recommendation systems need sensitive behavioral data – which are commercially precious and difficult to anonymize. Consequently, companies have traditionally kept this data closely, limiting researchers' access to data sets on a real scale.

Existing data sets such as the Million Playlist data set of Spotify, Netflix price data and Criteo click newspapers are either too small, lack temporal details or are poorly documented to develop production quality recommendation models. Liberation of Yandex from Yambda meets these challenges by providing a high -quality extensive set of extensive data with a rich set of features and anonymization guarantees.

What Yambda contains: scale, wealth and privacy

THE Yambda The data set includes 4.79 billion anonymized user interactions collected over a period of 10 months. These events come from around 1 million users by interacting with nearly 9.4 million tracks on Yandex music. The data set includes:

  • User interactions: Both an implicit feedback (listening) and an explicit feedback (tastes, disgusts and their eliminations).
  • Anonymized audio integrated: Vector representations of the tracks derived from networks of convolutional neurons, allowing models to take advantage of the similarity of audio content.
  • Organic interaction options: An “IS_ORGANIC” indicator indicates whether users have discovered a track independently or via recommendations, facilitating behavioral analysis.
  • Precise timestampes: Each event is horodometric to preserve the temporal order, crucial for the modeling of the sequential behavior of the user.

All user and track identifiers are anonymized using digital IDs to comply with confidentiality standards, ensuring that no personally identifiable information is exposed.

The data set is supplied in Apache Parquet format, which is optimized for frames for the processing of megadata such as Apache Spark and Hadoop, and also compatible with analytical libraries such as pandas and fleeces. This makes Yambda accessible to researchers and developers working in various environments.

Evaluation method: global temporal division

A key innovation in the Yandex data set is the adoption of a Global time split (GTS) Evaluation strategy. In typical recommendation system research, the widely used starting method deletes the last interaction of each user for tests. However, this approach disrupts the temporal continuity of user interactions, creating unrealistic training conditions.

GTS, on the other hand, divides data according to horodatages, preserving the entire sequence of events. This approach more closely imitates the scenarios of recommendation of the real world because it prevents future data from fleeing training and makes it possible to test models on chronologically more chronologically interactions.

This temporal evaluation is essential for the comparative analysis of algorithms under realistic constraints and understanding their practical efficiency.

Reference models and metrics included

To support comparative analysis and accelerate innovation, Yandex provides basic recommendation models implemented on the data set, in particular:

  • Mostpop: A model based on popularity recommending the most popular elements.
  • Decaypop: A model of popularity devoid in time.
  • Itemknn: A method of collaborative filtering based on the district.
  • IALS: Factorization of the least implicit alternative matrix.
  • BPR: Bayesian personalized classification, a method of classification by pair.
  • Sansa and Sasrec: Sequences models taking advantage of self-management mechanisms.

These basic lines are evaluated using standard recommendation measures such as:

  • Ndcg @ k (cumulative gain at normalized reduced price): Measures The quality classification emphasizing the position of the relevant elements.
  • Reminder @ k: Assesses the fraction of the relevant elements recovered.
  • Cover @ k: Indicates the diversity of recommendations in the catalog.

The supply of these benchmarks helps researchers quickly assess the performance of new algorithms compared to established methods.

Large applicability beyond musical streaming

While the data set comes from a musical streaming service, its value extends far beyond this area. The types of interaction, the dynamics of user behavior and the large scale make Yambda a universal reference for recommendation systems in sectors such as electronic commerce, video platforms and social networks. The algorithms validated on this set of data can be generalized or adapted to various recommendation tasks.

Advantages for different stakeholders

  • University: Allows rigorous tests of theories and new algorithms on a relevant scale in the industry.
  • Startups and SMEs: Offers a resource comparable to what technology giants have, leveling the rules of the game and accelerating the development of advanced recommendation engines.
  • End users: The indirectly benefits of more intelligent recommendation algorithms that improve the discovery of content, reduce research time and increase engagement.

My wave: the personalized recommendation system of Yandex

Yandex music uses a owner recommendation system called My wavewhich incorporates deep neural networks and an AI to personalize musical suggestions. My wave analysis of thousands of factors, in particular:

  • User interaction and listening history sequences.
  • Customizable preferences such as mood and language.
  • Real -time musical analysis of spectrograms, rhythm, vocal, frequency ranges and genres.

This system dynamically adapts to individual tastes by identifying audio similarities and predicting preferences, demonstrating the type of complex recommendation pipeline which benefits from large -scale data sets like Yambda.

Ensure confidentiality and ethical use

The release of Yambda underlines the importance of privacy in research on the system of recommendations. Yandex anonymous all data with digital IDs and omits personally identifiable information. The data set contains only interaction signals without revealing exact user identities or sensitive attributes.

This balance between openness and privacy makes it possible to seek solid research while protecting data from individual users, a critical consideration for the ethical advancement of AI technologies.

Access and versions

Yandex offers the Yambda data set in three sizes to adapt to different research and calculation capacities:

  • Full version: ~ 5 billion events.
  • Average version: ~ 500 million events.
  • Small version: ~ 50 million events.

All versions are accessible via FaceA popular platform to accommodate data sets and automatic learning models, allowing easy integration in research workflows.

Conclusion

The release of Yandex from Yambda The data set marks a central moment in research on the recommendations system. By providing an unprecedented scale of anonymized interaction data associated with an assessment and temporal references, it establishes a new standard for comparative analysis and accelerate innovation. Researchers, startups and businesses can now explore and develop recommendation systems that better reflect real world use and offer improved personalization.

While recommendation systems continue to influence countless online experiences, data sets like Yambda play a fundamental role by pushing the limits of personalization powered by AI.

Discover the Yambda Face data set.


Note: Thank you to the Yandex team for leadership / opinion resources for this article. The Yandex team supported and sponsored this content / article.


Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.