In the age of AI, efficient data-centric storage architectures have become crucial to any organization’s data strategy. The rise of large-scale AI workloads demands storage that can keep up – not just in terms of raw capacity but also with performance and flexibility. Without the right storage infrastructure, the data fueling AI initiatives can become a bottleneck rather than an asset.
Data-Centric Storage Relevancy for AI
The journey to efficient data-centric storage for AI isn’t without its challenges. Many organizations still think “data-driven” is only about structured data, such as databases, ERPs, etc. But in the AI world, unstructured data such as text, images, video, and so on, is increasingly valuable because AI training models heavily rely on massive amounts of unstructured data during training stages.
To address these needs, AI workloads require a shift towards data-centric design, where the focus is on making data available where and when it’s needed. Furthermore, AI data flows are heterogeneous in nature from a storage protocol standpoint. Activities such as data ingestion and archiving usually rely on object storage (S3 protocol), while data processing steps tend to rely on file protocols (usually NFS). AI data flow pipelines need the flexibility to work with both protocols seamlessly, but many infrastructures struggle to support this, especially when it comes to concurrent multi-protocol access.
This shift brings additional complexity, as different personas within an organization approach this challenge from various angles: AI and application teams have a different perspective than infrastructure teams. In addition, data compliance initiatives (often related to either internal compliance or external regulations) are nowadays helmed by the CDO (Chief Data Officer) which, while also focused on meeting business objectives, has a different approach to data and storage.
Lastly, when we look at GPU servers, local storage (especially in configurations like NVIDIA DGX systems) often ends up siloed and underutilized. This means that valuable, high-performance storage capacity sits idle and consumes power, turning into a sunken cost.
Characteristics of Data-Centric Storage for AI
To overcome these issues, organizations need to rethink how they manage data across their storage infrastructure, and should look for the following capabilities:
- Seamless Data Orchestration: Moving beyond traditional storage to more flexible, data-centric architectures is crucial to maximizing data availability, while ensuring that data is stored according to its value and criticality.
- Concurrent Multi-Protocol Access: With AI pipelines spanning ingestion, processing, and archiving, ensuring smooth transitions between different storage protocols is essential. Datasets may need to be accessed via different protocols through the same namespace, and with the same security access control lists, regardless of the method of access.
- Cost Efficiency: Leveraging modern orchestration tools and management platforms can turn unused capacity and pool it into active, valuable storage resources for AI workloads.
How Hammerspace Optimizes Data Storage for AI
Hammerspace addresses these challenges with a storage solution that is built for the demands of AI and data-centric workflows. The solution is completely transparent to users, who continue to see their folders and file structure as usual, while in the backend administrators have the ability to configure policies, storage tiers, and determine the lifecycle of data.
Hammerspace is able to cover very broad storage needs thanks to its multiprotocol access and support for cloud workloads, but its ability to support ultra-fast storage tiers (thanks to pNFS 4.2 support) makes it very-well suited for AI workloads, due to two characteristics:
- Objective-Based Policy Engine: Hammerspace’s policy engine enables organizations to set data lifecycle objectives across storage types and locations. The policy engine sports a built-in cost optimization mechanism that intelligently assesses the best storage tier based on performance, cost, and compliance requirements. This approach allows Hammerspace to automate data placement based on policies and lifecycle needs, thus optimizing where and how it’s stored.
- Global Data Orchestration: Hammerspace empowers administrators with global control over data orchestration, enabling them to manage storage resources and policies without disrupting user access. This results in a more agile, responsive data infrastructure that’s aligned with AI’s unique demands.
Building a Tier-0 Ultra-Fast Shared Storage with Hammerspace
One of the most compelling use cases for Hammerspace is transforming underutilized GPU server storage into Tier-0 ultra-fast shared storage. For example, Hammerspace presented a use case where a large-scale AI training setup with 1000 GPU servers and 8000 GPUs.
In a typical environment, local storage would not be used, and a Tier-1 NVMe flash storage system would be attached to the cluster. In addition, ready-to-use GPU systems such as NVIDIA DGX / HDX are often sold with their storage slots fully populated.
In the case presented by Hammerspace, a cluster of this size would come with 20 PB of external storage if used without Hammerspace. However, thanks to Hammerspace capabilities, organizations can leverage and pool the local storage on each GPU server and create a new Tier-0 ultra-fast shared storage system that would account for 30 PB of storage (in the use case presented to TECHunplugged, slides will be shared once received).
Furthermore, due to the solution’s data orchestration capabilities, Hammerspace will also use the Tier-1 external storage and present users with a 50 PB unified storage namespace, consisting of 30 PB Tier-0 and 20PB Tier-1. Data can flow across both tiers, without any limitation or disruption to users: data is automatically protected, and placement is seamlessly orchestrated by Hammerspace’s policy engine.
Key benefits to customers include:
- Better TCO by unlocking previously unusable local storage, potentially avoiding the costs of external storage, or procuring better-sized external storage.
- Improved AI architecture efficiency by significantly reducing checkpointing times (up to 20x faster compared to solutions relying only on external storage).
- Power consumption reduction by leveraging ultra-fast storage that is already present and would otherwise remain idle.
- Faster time-to-value by immediately pooling available local storage, without having to wait on external storage.
Those benefits increase when operating clusters at scale: for clusters in the range of hundreds or thousands of nodes (such as NVIDIA SuperPOD architectures), savings can reach millions of dollars and represent a significant impact on energy efficiency, also freeing up power consumption to add up to 10% – 15% GPU capacity annually.
Conclusion
Building efficient, data-centric storage for AI is a complex but essential task. With Hammerspace, organizations can turn unused storage into a powerful resource, saving on costs, reducing energy consumption, and accelerating their AI initiatives. By focusing on data orchestration, multi-protocol support, and resource optimization, enterprises can ensure they’re well-positioned to handle the demands of AI at scale.
Efficient AI storage is ultimately about more than just hardware – it’s about creating a seamless, accessible, and performant environment where data flows without bottlenecks, whether performance-based or due to compatibility constraints. In a competitive landscape, having the right storage infrastructure can be a key differentiator.