Summit Supercomputer algorithm breaks the Exabyte Barrier
A machine-learning algorithm demonstrated the capability to process data that exceeds a computerβs available memory by identifying a massive data setβs key features and dividing them into manageable batches that donβt choke computer hardware.
Developed at Los Alamos National Laboratory, the algorithm set a world record for factorizing huge data sets during a test run on Oak Ridge National Laboratoryβs Summit, the worldβs fifth-fastest supercomputer.
Equally efficient on laptops and supercomputers, the highly scalable algorithm solves hardware bottlenecks that prevent processing information from data-rich applications in cancer research, satellite imagery, social media networks, national security science and earthquake research, to name just a few.
βWe developed an βout-of-memoryβ implementation of the non-negative matrix factorization method that allows you to factorize larger data sets than previously possible on a given hardware,β said Ismael Boureima, a computational physicist at Los Alamos National Laboratory. Boureima is first author of theΒ paperΒ in The Journal of Supercomputing on the record-breaking algorithm. βOur implementation simply breaks down the big data into smaller units that can be processed with the available resources. Consequently, itβs a useful tool for keeping up with exponentially growing data sets.β
βTraditional data analysis demands that data fit within memory constraints. Our approach challenges this notion,β said Manish Bhattarai, a machine learning scientist at Los Alamos and co-author of the paper. βWe have introduced an out-of-memory solution. When the data volume exceeds the available memory, our algorithm breaks it down into smaller segments. It processes these segments one at a time, cycling them in and out of the memory. This technique equips us with the unique ability to manage and analyze extremely large data sets efficiently.β
The distributed algorithm for modern and heterogeneous high-performance computer systems can be useful on hardware as small as a desktop computer, or as large and complex as Chicoma, Summit or the upcoming Venado supercomputers, Boureima said.
βThe question is no longer whether it is possible to factorize a larger matrix, rather how long is the factorization going to take,β Boureima said.
The Los Alamos implementation takes advantage of hardware features such as GPUs to accelerate computation and fast interconnect to efficiently move data between computers. At the same time, the algorithm efficiently gets multiple tasks done simultaneously.
Non-negative matrix factorization is another installment of the high-performance algorithms developed under the SmartTensors project at Los Alamos.
In machine learning, non-negative matrix factorization can be used as a form of unsupervised learning to pull meaning from data, Boureima said. βThatβs very important for machine learning and data analytics because the algorithm can identify explainable latent features in the data that have a particular meaning to the user.β
The record-breaking run
In the record-breaking run by the Los Alamos team, the algorithm processed a 340-terabyte dense matrix and an 11-exabyte sparse matrix, using 25,000 GPUs.
βWeβre reaching exabyte factorization, which no one else has done, to our knowledge,β said Boian Alexandrov, a co-author of the new paper and a theoretical physicist at Los Alamos who led the team that developed the SmartTensors artificial intelligence platform.
Decomposing or factoring data is a specialized data-mining technique aimed at extracting pertinent information, simplifying the data into understandable formats.
Bhattarai further emphasized the scalability of their algorithm, remarking, βIn contrast, conventional methods often grapple with bottlenecks, mainly due to the lag in data transfer between a computerβs processors and its memory.”
βWe also showed you donβt necessarily need big computers,β Boureima said. βScaling to 25,000 GPUs is great if you can afford it, but our algorithm will be useful on desktop computers for something you couldnβt process before.β
The paper:Β βDistributed Out-of-Memory NMF on CPU/GPU Architectures.β The Journal of Supercomputing.Β DOI: 10.1007/s11227-023-05587-4
The funding:Β This research was funded by DNN R&D and by the Laboratory Directed Research and Development program at Los Alamos National Laboratory.
















