Stochastic cluster maintenance – a new method to view large data sets

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

An amazing characteristic of the human brain is the ability to find differences even in a huge amount of visual information. When you study large amounts of data, this capacity is very useful, because the content of the data must be compressed in an understandable form for human intelligence. For visual analysis, the problem of reducing dimensionality remains the main thing.

Scientists from the University of Aalto and the University of Helsinki in the Finnish Artificial Intelligence (FCAI) led a study Where they have tested the functionality of the best known visual analysis methods and found that none of them works when the data volume increases considerably. For example, the methods of T-SNE, Grandeviz and Umap could no longer distinguish groups of extremely strong signals from observations in data, when the number of observations takes place in hundreds of thousands. T-SNE, Largeviz and UMAP methods no longer work properly.

Researchers have developed a new method of reducing non -linear dimensionality called Stochastic Cluster Integration (SCE) for better visualization in clusters. It aims to visualize the data sets as clearly as possible and is designed to visualize data clusters and other macroscopic functionalities so that they are as distinct, easy to observe and understand as possible. SCE uses graphic acceleration similar to modern artificial intelligence methods for calculation in neural networks.

The discovery of the Higgs boson was the basis of the invention of this algorithm. The data set for the experiences associated with more than 11 million functional vectors. And these data required a clear and practical visualization. This inspired scientists to develop a new method.

The researchers generalized the SNE using a family of divergences I, configured by a factor of scale S, between non -standardized similarities in the entry and exit space. SNE is a special case in the family where it is chosen as a standardization factor for the similarity of the results. However, during the tests, it was found that the best value of S for visualization in clusters often differs from the value chosen by the SNE. Consequently, to overcome the T-SNE gap, the new SCE method uses a different approach that mixes input similarities when calculating S. The coefficient is adjusted in an adaptive manner when optimizing the new learning objective and therefore the data points are better grouped. Researchers have also developed an effective optimization algorithm using asynchronous stochastic descents on block coordinates. The new algorithm can use parallel IT devices and suit mega-scale tasks with large amounts of data.

During project development, scientists tested the method on various acts of real data and compared it to other modern NLDR methods. Users participating in the tests selected the most appropriate visualizations that corresponded to the S values ​​range for the visualization of clusters. The researchers then compared the values ​​resulting in SCE and T-SNE to see what was closer to human choice. The four smallest IJCNN, Tomoradar, Shuttle and Mnist data sets were used for tests. For each data set, test participants were presented with a series of visualizations where they used a cursor to indicate a value and tested the corresponding pre -accompanied visualization. The user has chosen the favorite value of S for the visualization of the cluster.

The test results clearly demonstrate that the S chosen by SNE is to the right of the human median (solid green line) for all data sets. This suggests that for humans, the GSNE with a smaller S is often better than you for visualization in clusters. On the other hand, the SCE selection (red dotted lines) is closer to the human median for the four data sets.

By applying the stochastic cluster incorporation method to the data on the Higgs boson, their most important physical characteristics have been clearly identified. The new method of reducing nonlinear dimensionality incorporation of stochastic grouping for better visualization in clusters works several orders of magnitude faster than previous methods, and is also much more reliable in complex applications. It modifies T-SNE using an adaptive and effective compromise between attraction and repulsion. Experimental results have shown that the method can systematically identify internal clusters. In addition, scientists have provided a simple and fast optimization algorithm which can be easily implemented on modern parallel computer platforms. Effective software has been developed which uses a descent of asynchronous stochastic block gradient to optimize a new family of objective functions. The experimental results have shown that the method improves in a coherent and significantly visualization of data clusters compared to the integration approaches of modern stochastic stochastic.

The method of the method is accessible to the public to github.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.