Alluxio Enhances ML/AI Support for Its Multi-Cloud Data Orchestration Platform

Alluxio has announced the immediate availability of Data Orchestration Platform version 2.7. According to the company, this new release has contributed to providing 5x I/O efficiency for machine learning (ML) training by simultaneously working data pre-processing, training, and data loading.

Alluxio 2.7 offers enhanced performance insights and support for open table formats, including Iceberg and Apache Hudi, in order to scale access to data lakes for faster Presto and Spark-based analytics. Here we will have a look at the new capabilities of Alluxio 2.7 community and enterprise edition features:

Insight-Driven Dynamic Cache Sizing for Presto

The new shadow Cache feature makes a considerable balance between cost and high performance by dynamically providing insights to measure the impact of cache size on response time. This new feature by Alluxio would reduce the management overhead with self-managing capabilities for multi-tenant Presto environments at scale.

Alluxio and NVIDIA’s DALI for ML

NVIDIA’s Data Loading Library (DALI) is a generally used python library that supports GPU and CPU execution for data loading and processing to boost deep learning. Alluxio 2.7 has been optimized to work with DALI for python-based ML applications, which consists of data loading and preprocessing steps as a precursor to model training and inference.

By boosting I/O heavy stages and enabling parallel processing of the following compute-intensive training, end-to-end training on the Alluxio data platform would achieve high-performance gains over traditional solutions. The solution is scale-out as opposed to other solutions suitable for smaller data set sizes.

Ease of Use on Kubernetes

Alluxio also supports a native Container Storage Interface (CSI) Driver for Kubernetes and a Kubernetes operator for ML, making it easy to operate ML pipelines on the Alluxio platform in containerized environments. The Alluxio volume type is now natively available for Kubernetes environments. The brand is more focused on offering agility and ease-of-use with the release of Alluxio 2.7.

Data loading at scale

Haoyuan Li, Founder and CEO, Alluxio
“Alluxio 2.7 further strengthens Alluxio’s position as a key component for AI, Machine Learning, and deep learning in the cloud,” said Haoyuan Li, Founder and CEO, Alluxio.

Alluxio is completely dedicated towards offering data management capabilities with caching and unification of disparate data sources. As the use of Alluxio has scaled up for storage and compute spanning multiple geographical locations, the software continues to evolve to keep growing using a technique for batching data management jobs. Batching jobs are basically used to perform an embedded execution engine for tasks such as data loading. It also reduces the resource requirements for the management controller lowering the cost of provision.

Founder and CEO at Alluxio, Haoyuan Li, said that Alluxio 2.7 further strengthens Alluxio‘s position as a key component for AI, deep learning, and Machine Learning in the cloud. He further added that with the age of growing datasets and increased computing power from CPUs and GPUs, machine learning and deep learning had become popular techniques for AI. He concluded by saying that the rise of these technical advances is state-of-the-art for AI and exposes some challenges for access to data and storage systems.

“Data teams with large-scale analytics and AI/ML computing frameworks are under increasing pressure to make a growing number of data sources more easily accessible, while also maintaining performance levels as data locality, network IO, and rising costs come into play,” said Mike Leone, Analyst, ESG. “Organizations want to use more affordable and scalable storage options like cloud object stores, but they want peace of mind knowing they don’t have to make costly application changes or experience new performance issues. Alluxio is helping organizations address these challenges by abstracting away storage details while bringing data closer to compute, especially in hybrid cloud and multi-cloud environments.”