A few weeks ago, we attended AI Data Infrastructure Field Day 1, an online event where vendors showcased their latest innovations in the AI storage data infrastructure space. A presentation that particularly caught my attention was from HPE, with the company introducing their Private Cloud AI solution. Truth be said, both Arjan Timmerman (the other half of TECHunplugged) and I were remote delegates, and it was late evening, so it took some time for the initial understanding to sink in.
What the heck is an AI Stack?
Designing and deploying infrastructures is hardly a challenge these days, but when it comes to AI, things get a bit more complicated: it is no longer only about the usual hardware stack, but we must also factor in GPUs and fast Ethernet interconnect. It’s nothing really new to anyone who has already built HPC infrastructures, but those who did also know that the challenge doesn’t stops with the hardware layer: having a cluster manager and a grid scheduler now also becomes your problem, and you might also have to think about other middleware components, specialized software, etc.
Similar challenges exist in the AI world, and what matters to developers and scientists is not only the hardware infrastructure, but also the AI stack. So what the heck is an AI stack? In addition to the hardware (or cloud platform), it consists in a bunch of applications that are each in charge of fulfilling a specific activity within the stack to deliver the desired business outcome.
To roughly simplify, if we take GenAI as an example, you will need a base model (for example GPT or Llama), but also additional components to perform training, inferencing, and fine-tuning. You will also need specific databases, and software for deployment and observability. It is a brave new world, so you can head up to source such as this and this article to understand a bit more.
The Challenge of Building and Maintaining an AI stack
Organizations may either want to build their own stack, or to use an existing stack. When it comes to existing stacks, there is always the possibility to rely on an external platform to use these components pre-packaged in a coherent stack and readily available, or the organization can decide that it is in its best interest (for data privacy compliance reasons, or for a variety of other reasons) to build and maintain such a stack in house.
Now, taking the route of building and maintaining your own stack is not necessarily a crown of thorns, but there needs to be a solid rationale behind it. Perhaps the organization primarily focuses on delivering an AI development and has dedicated teams and skills to maintain the stack.
Furthermore, it is simplistic to talk about a single stack, when multiple teams may each using different tools or different variations of a given stack (anyone who has hands-on experience with the CNCF lansdcape will know – by the way, have you heard about the CNAI?). Not only complexity is growing, but there now also needs to be an owner for those stacks, either someone or a team of individuals dedicated in nothing else than maintaining those stacks up to date, troubleshooting issues, performing dependency validation, and so on. What was initially seen as a cost saving measure (or a good attempt at following data privacy requirements) is now turning into a liability.
HPE Private Cloud AI: Hassle-free AI Consumption
Enter HPE Private Cloud AI, a solution delivered as an as-a-service offering via HPE Greenlake. HPE Private Cloud AI is a full AI stack solution co-engineered by HPE and NVIDIA which brings the best of both worlds for organizations, i.e. the ability to run a fully-managed AI infrastructure stack on premises, without the hassles of management, while remaining compliant with stringent data privacy requirements.
Management activities such as maintenance, patching, and maintaining AI stacks up to date are handled by HPE in the background and users of the AI platform can focus on consuming its capabilities without having to worry about hardware and stack-related dependencies, and without having to maintain anything in the environment, while data infrastructure teams can focus their efforts on other management activities within the organization.
HPE Private Cloud AI is available in Small, Medium, and Large configurations, with each configuration currently offering a subset of two configurations. The Small configurations are best suited for AI inference; Medium configurations handle Inference and Retrieval Augmented Generation (RAG), while Large configurations also handle Model Fine Tuning.
The configurations all come with NVIDIA Spectrum-X based networking, starting at 100 GbE for the smallest deployments, and reaching 800 GbE for top-of-the-line configurations. As the configuration table below shows, configuration subsets differ from the count of GPU cards, it is also noteworthy that Small and Medium configurations come with NVIDIA L40S GPUs, while Large configurations come with H100 NVL GPUs, and Extra-Large are delivered with GH200 NVL2 GPUs.
TECHunplugged’s Opinion
Manually building and managing an AI stack may be doable when starting small, but managing AI stacks at scale becomes a daunting task unless designing and running this dedicated type of infrastructure is the organization’s core business. HPE’s Private Cloud AI solution, co-engineered with NVIDIA, provides a turnkey approach that delivers multiple benefits to organizations, including:
- simplified deployment and management by delivering pre-validated configurations that work out of the box, including monitoring, observability, and a comprehensive control plane
- time-to-value acceleration with pre-built stacks and the ability to build or import custom stacks, increasing developer productivity by up to 90% compared to self-managed stacks
- data security by keeping data on-premises, maintaining full control over data, and adhering to data privacy and classification policies with multi-layered controls to protect data and models
- cloud-like scalability and economics through HPE Greenlake’s delivery model
Overall, HPE impressed us with its a practical and comprehensive approach to accelerate time-to-value in enterprise AI initiatives: HPE Private Cloud AI is a robust and well-designed solution worth looking at.
Additional Resources
Overall, HPE presented in three different segments, available on this page. We are also sharing the link to the individual videos below: