7 considerations when building your ML architecture
Andreea Munteanu
on 17 February 2025
Tags: AI/ML , AIML , Infrastructure , MLOps

As the number of organizations moving their ML projects to production is growing, the need to build reliable, scalable architecture has become a more pressing concern. According to BCG (Boston Consulting Group), only 6% of organizations are investing in upskilling their workforce in AI skills. For any organization seeking to reach AI maturity, this skills gap is likely to cause disruption.
Whilst it may seem that this skills gap is purely a concern from data scientists or ML engineers, in reality it affects a much wider section of your workforce than you might imagine. Security, privacy and infrastructure are three particular areas where machine learning is reshaping processes and best practices, meaning that professionals need to reshape their skillsets and ways of working.
In this blog, we will go through the most important considerations when building your ML architecture. We will detail the different avenues that enterprises can take in order to achieve their goals without huge investments or being blocked in the experimentation phase.
1. Build on what you already have
It is often the case that organizations already have the physical infrastructure they need to get started with AI/ML. Rather than building from scratch, you should start by extending the capacity of your current infrastructure. So what does this mean in practice? The name of the game is to identify components that are being underutilized – as this means they likely have the spare capacity needed to support your ML needs. Perhaps this means GPUs that are not used to maximum capacity, or public cloud access that could be extended to new machines.
By extending your existing infrastructure, you can reduce your initial investment, accelerate exploration and give business stakeholders time to better comprehend the possible return on investment.
Also, it’s important to keep in mind that GPUs are not always needed. Whilst machine learning has been a huge driver of GPU adoption, in the initial stages you might find you can get started without them. This also carries the advantage of buying time for GPUs to be delivered (given the current scarcity), by the time organizations are ready to scale their ML projects.
Indeed, when the time comes to scale AI/ML initiatives and go beyond adoption, this means scaling the underlying ML infrastructure as well. This means adding more compute power – and whilst it may be required to purchase additional hardware or storage, you should always seek to optimize your existing infrastructure. You might find that you can free up some of the capacity you need to scale up.
2. Avoid GPU underutilization and optimize your infrastructure
The State of AI Infrastructure report shared in 2024 that less than 50% of the surveyed organizations’ available GPUs are in use. Most respondents stated that they use queue management, job schedulers, multi-instance GPUs and quotas. These are all solutions designed to maximize efficiency in the face of GPU scarcity: in effect, to continue scaling even if you can’t purchase additional hardware.
To optimize the utilization of GPUs, organizations can adjust their architecture at different layers. Firstly, the adoption of a container orchestration solution ensures easy access to the tooling necessary for developing and deploying ML models. Kubernetes is the de facto standard for container orchestration, however there are multiple downstream Kubernetes distributions available on the market. Organizations should keep in mind their long-term plans and where they think their infrastructure will evolve, so that their chosen distribution can run wherever their ML journey takes them.
To increase the utilization of the GPUs, organizations could adopt a GPU scheduler solution. In instances when the computing power is limited, schedulers help with the distribution of the workloads and effective use of the resources. There are multiple solutions available, such as run.ai or Volcano, which are compatible with different hardware platforms and Kubernetes distributions.
Click here to learn more about Canonical’s distribution of Kubernetes >
3. Adopt an MLOps platform
Machine Learning Operations (MLOps) is often defined as DevOps, but for machine learning. It aims to standardize machine learning initiatives within a team or organization, and ultimately automate the deployment of workloads.
There are multiple benefits to the MLOps approach. Very often, organizations are siloed, using different tools in various departments, which hinders collaboration. From an efficiency perspective, using a single platform across your organization will not only reduce the time spent by data scientists and machine learning engineers on repetitive tasks but will also encourage collaboration. From an infrastructure perspective, the adoption of DevOps principles means that the ML architecture will be simpler and therefore easier to maintain, leaving organizations with reduced risk and a lower overhead. In terms of skills gaps, reducing the complexity of the ML infrastructure and building a standardized architecture that can be used by various teams will help streamline adoption.
So which MLOps platform should you adopt? There is a lot of choice on the market, which comprises both closed source platforms (like SageMaker, Wights & Biases or Vertex AI) or open source, like Charmed Kubeflow.
Charmed Kubeflow is Canonical’s distribution of Kubeflow, one of the most popular and widely used MLOps platforms. It is fully open-source, secured and portable. It runs on different environments, including public or private clouds, supporting hybrid or multi-cloud scenarios. It runs on any CNCF-conformant Kubernetes, leaving organizations the option to run it on their existing infrastructure.
Learn more about the deployment options available in Charmed Kubeflow >.
4. Adopt a hybrid cloud strategy
Organizations often think in binary terms about either public or private clouds, and build their AI strategy around one or the other. However, there are benefits to using both. Let’s quickly compare what they bring to the table.
Private clouds are often more cost-effective, easier to pass through audits, and more focused on the privacy concerns that AI entails. This is due to the host organizations being fully sovereign over their private cloud, meaning they have full control over the data. At the same time, GPU scarcity can pose a significant barrier to the adoption or build of any AI cloud, whilst the upfront investment is often a source of concern for organizations.
Public clouds offer easy-to-access compute resources and spot instances, which allow for quick scalability and avoid underutilization of infrastructure. At the same time, public clouds are often expensive when running multiple ML projects, and due to regulations in different parts of the world, not all data can be utilized.
All these considerations serve as motivators for organizations to build a hybrid cloud strategy – as it can offer you the “best of both worlds”. For example, you could use the public cloud to quickly perform experiments and validate ML projects, due to their easy access to compute resources. Once these projects are ready to be rolled to production or run at a larger scale, for example with a wider dataset, your models can be migrated to a private cloud deployment.
When building such a strategy, organizations need to choose tooling that is portable, so that it can run on both public and private environments. This avoids the need for upskilling on different infrastructure solutions and enables professionals to benefit from the same features, regardless of where the project runs. It will also simplify the migration process, ensuring a smooth rollout to production for any ML project. Usually, open source solutions are suitable for such use cases, as they can run anywhere.
Learn more about hybrid clouds >
5. Prioritize security
Security and privacy are two of the most important concerns when it comes to AI/ML. They need to be considered at different layers of the architecture, to ensure full and comprehensive protection of both the data and ML model. Therefore, organizations should build an ML architecture that provides security across the lifecycle, from data and models to tooling and hardware.
What does this mean for your infrastructure? Organizations should start by building an architecture which mitigates against vulnerabilities, especially criticals or highs. That means having a patching strategy, either in-house or through a third-party vendor. Best practices such as regular scans and updates will give you a clear understanding of the packages you use for your ML projects, their versions and their dependencies.
At the same time, organizations often have more than one AI/ML project with access to the data or any ML pipelines. Using ML tooling that provides network isolation and integrates with identity management providers is the foundation of such a solution.
Read more about security risks and how to address them in our overview of machine learning risks.
6. Upskill as you grow
Getting started on a new technology is often intimidating. There are a lot of skills to gain in a short period of time, so prioritization is the key. Collaborating with partners who already have expertise in the field will accelerate your AI adoption and reduce the initial overhead. For example, building an ML architecture from scratch is difficult, and adjusting an existing one can be even more challenging if there are already other workload types running in production. By drawing on the knowledge of partners who have been there and helped others in the same position as you, you can avoid common pitfalls and empower your engineers with confidence.
For example, Canonical’s MLOps workshop is a five day in-person engagement, where our experts help you design your ML architecture, based on your use cases, constraints and existing infrastructure.
After design comes maintenance. Managing an MLOps platform such as Kubeflow requires a new set of skills that organizations often do not have yet. This is why they should consider options that, once again, draw on the knowledge of others. In an enterprise context, this could be enterprise support (in the form of automated patching and maintenance, as is offered by Canonical through Ubuntu Pro) or fully-fledged managed services that remove
the burden of maintaining an MLOps platform from your in-house engineers. Managed services also carry the advantage of giving them the time to upskill and eventually take over from the managed service provider.
Learn more about enterprise AI solutions from Canonical >
7. Build an observable architecture
Observability is the foundation of any infrastructure that runs in production. Organizations should observe both the hardware and software stack. Knowing that ML models need continuous development and data is often messy, so a reliable ML infrastructure includes observability tools that address these challenges. Having alerting and dashboards is the foundation of an observable ML architecture, but organizations should also ensure that logs and tracing are enabled.
Learn more about observable MLOps in our introductory blog >
Conclusion and next steps
If you are curious about where to start or want to accelerate your ML journey, our MLOps workshop will help you design AI infrastructure for any use case. Our MLOps workshop is delivered by Canonical’s MLOps field engineering team, a team of experts trained to architect, design and deploy AI infrastructure at all scales, across industries.
In this workshop, we spend 5 days with your team onsite and work with you to build high-level and low-level architecture based on your existing infrastructure, using open source tooling. You can customize the workshop agenda based on your needs and topics that are most valuable for your organization.
Learn more about the MLOps workshop: https://ubuntu.com/ai/mlops-workshop
Further reading
Talk to us today
Interested in running Ubuntu in your organisation?
Newsletter signup
Related posts
AI in 2025: is it an agentic year?
2024 was the GenAI year. With new and more performant LLMs and a higher number of projects rolled out to production, adoption of GenAI doubled compared to the...
What is MLflow?
MLflow is an open source platform, used for managing machine learning workflows. It was launched back in 2018 and has grown in popularity ever since, reaching...
AI on-prem: what should you know?
Organisations are reshaping their digital strategies, and AI is at the heart of these changes, with many projects now ready to run in production. Enterprises...