Scaling MLOps with Platform Engineering

The “Last Mile” Problem in AI

Everybody loves a bit of an analogy to help us picture a story, and for me, I sometimes think about building a machine learning model being a bit like building a sports (kit) car in your garage: the engine roars perfectly on the test stand, but getting it onto the open road? That’s a different story. This is AI’s infamous “last mile.”

The “last mile problem” in AI refers to the challenges organisations face when transitioning AI models from successful testing phases to effective real-world applications. Despite billions invested in data and brilliant talent, most ML models never make it. And those that do? They often run on brittle, hand-assembled infrastructure, custom scripts, mismatched environments, and manual hacks. Data scientists, masters of math and modelling, are forced into DevOps roles, wrestling with Kubernetes, GPUs, and cloud networking. It’s slow, costly, and a colossal waste of talent.

MLOps promises to fix this. MLOps, or Machine Learning Operations, is a set of practices that combines machine learning, software engineering, and data engineering to manage the machine learning lifecycle efficiently. Just like DevOps and SRE, it focuses on deploying, monitoring, and maintaining machine learning models in production. However, it’s a methodology, not a turnkey solution. Enter an ML-ready internal developer platform (IDP) – the output of Platform Engineering and a game-changer that turns MLOps chaos into a production-ready AI factory.

Why AI Deployment Is Hard

Machine learning systems aren’t just software — they’re living, evolving systems with unique operational challenges:

Compute-Hungry Workloads: Training deep learning models requires expensive GPUs and TPUs. Scheduling and sharing them efficiently is a challenge.
Complex, Multi-Stage Pipelines: ML workflows aren’t “build, test, deploy.” They’re graphs of tasks: ingest → clean → feature engineer → train → evaluate → version → deploy → monitor → retrain.
Versioning Chaos: Every component, code, model, and data needs version control. A lot of places don’t do versioning very well, and as a result, a change in any component can break the system.
Data and Model Drift: Unlike software, models degrade over time. Teams must monitor for shifts in data or drops in predictive performance.

Without a standardised approach, every project becomes a bespoke infrastructure nightmare, forcing data scientists to reinvent the wheel repeatedly.

Platform Engineering: Creating the “Golden Path” for AI

Platform Engineering builds internal platforms (IDPs), curated, self-service environments that abstract software delivery complexity. From someone tinkering with data on their laptop to serving real customers with apps on the cloud, these environments simplify software delivery. Therefore, it makes sense that having a ready-to-go way of delivering AI training and serving in an already familiar approach provides clear and obvious value. This is our IDP for MLOps. Think of it as building a factory where your data scientists focus on the craft, while the platform handles the machinery.

Training Environment: Pre-configured with the right libraries, data access, and compute.
Hardware Access: A simple request reserves the right GPU, TPU or FPGA — no infrastructure or Terraform knowlege needed.
Deployment: Push code to Git, and the platform handles testing, validation, security and compliant deployment to production.

This creates a “golden path” — a standardised, automated route from model conception to production — reducing cognitive load and operational friction on data specialists and software developers.

A Modern MLOps Blueprint: Hybrid Platforms for the Full ML Lifecycle

In this section, I wanted to look at what a modern MLOps blueprint could look like. In this example, we leverage Kubernetes for training, complex workflows and high-performance inference; and serverless for event-driven inference. One of the most interesting Kubernetes developments of recent years is since version 1.33, it introduced a new stable feature called Dynamic Resource Allocation. DRA is a Kubernetes framework that allows container workloads to request and consume specialised, shared resources, such as high-performance storage or hardware accelerators, in a manner that’s decoupled from the core scheduling process. So I wanted to touch on the benefit it can bring and how an IDP feature can be created to make using it transparent to data specialists and software engineers.

First, let’s look at a potential solution summary

1. Kubernetes for Training & Complex Workflows

Training deep learning models requires sustained computational power. Kubernetes provides a robust foundation:

Acceleration: hardware accelerators like specialised GPUs can be attached to Kubernetes worker nodes to optimise the speed and cost of how models are trained.
ML Workflows: Tools like Kubeflow or Argo Workflows provide a common and familiar way to manage complex, multi-stage pipelines.
Experiment Tracking: Kubernetes, in particular, managed offerings provide a reasonable set of metrics and logging out of the box. Throw in native tools like FluxCD or Config Sync, and tracking experiment events, versions, models, and comparing results will be automatically available to the developer.

2. Serverless for Event-Driven Inference

For sporadic inference workloads (e.g., a user uploads a photo), serverless is cost-effective:

Google Cloud Run, AWS Lambda, or Knative scale to zero on demand.
Reduces idle costs and simplifies resource management.

These services generally also come with reasonable metrics and logging included, but would still require some setup. As we’re looking to leverage Kubernetes benefits, we recommend staying in the same specification model and utilising tools like Config Connector or AWS Controllers for Kubernetes to deploy them.

3. Kubernetes for High-Performance Inference

For low-latency, high-throughput applications, we want to be using specialist hardware, a TPU or FPGA chip. To do so, some extra wiring is needed on our cloud provider set-up. Some serverless services are now offering the ability to connect them up, but we’re going to stick with the Kubernetes approach because of its robust, consistent and agnostic approach:

Through our managed Kubernetes offering, we define dynamic node pools with specific hardware available.
We provide hardware classes to simplify their implementation
The platform automatically selects the best node pool target for the use case.
Using OPA Gatekeeper, we further get to define all our business compliance rules in the same way we define everything else.

Technical Deep Dive: Why Kubernetes 1.33+ Changes the Game

As you may have noted, Kubernetes becomes a fundamental tool of our MLOps. As well as using it to deploy our apps to, as it is traditionally used, we can also use it for workflows, security and compliance, and deploying our cloud serverless infrastructure; but the part I want to really highlight is how we can use it to consistently manage our AI hardware — DRA.

Dynamic Resource Allocation (DRA) for AI Accelerators

The Old Way: Rigid device allocation forced fragile workarounds — data scientists couldn’t easily request specific GPUs or slices for small tests.

New Way: Kubernetes 1.33+ introduces DRA:

ResourceSlice: Exposes GPU attributes (model, VRAM, architecture).
DeviceClass: Platform teams provide simple classes to categorise GPUs (e.g., gpu-high-performance-training).
ResourceClaim: Pipelines simply request a GPU from the appropriate DeviceClass.
ResourceClaimTemplate: Automates scaling; every pod gets a unique GPU claim automatically.

This flexibility reduces operational complexity and improves resource utilisation dramatically.

Topology-Aware Routing

This forces Kubernetes to send AI requests to a GPU or TPU in the same physical zone as the request origin. This makes inference faster and reduces cloud costs without complex networking

DRA finds a specific “slice” of hardware (e.g., “I need 1 TPU v4 chip”) and places your Pod on a node that has that specific device available
Adding trafficDistribution=PreferClose on a Kubernetes service ensures that once DRA has placed that Pod in a specific zone, network requests are routed to that exact zone — reducing latency and cross-zone cloud egress costs.
Without both, DRA will place your workload on a high-speed TPU in Zone A, but Kubernetes might route your request to it from Zone B, adding network lag and potentially negating some of the performance improvements.
Applied correctly means that there is no networking expertise required from data scientists. They just ask and get.

Full-Lifecycle MLOps: Day 2 Operations

Here is a clearer, more digestible rewrite tailored for an Internal Developer Platform (IDP) context.

Full-Lifecycle MLOps: Automating Day 2 Operations

A robust IDP doesn’t just deploy models; it manages their ongoing health and hardware stability through the entire lifecycle.

1. GitOps-Driven Deployment

Deployments are defined as Configuration-as-Data manifests, not complex IaC. Using tools like Config Sync or FluxCD to apply changes as they happen keeps a constant, synchronised implementation across each SDLC environment.

As new code for training or inference is developed, it is tested and delivered as a new image version in an image registry. To deploy, update the version in the manifest and the platform does the rest
Same applies to the models. As data scientists deploy new model versions simply by opening a Pull Request, the IDP can run any necessary validations (like safety checks); and when a new model version is ready, simply update the version of the model in the manifest as well.

2. Observability “Out-of-the-Box”

The platform provides a pre-configured monitoring stack covering three distinct layers of reliability:

Operational Health
Tooling: Prometheus or Zabbix (with Grafana for a common UI)
Purpose: Monitors standard metrics (latency, error rates, CPU/RAM usage).

Model Intelligence

Tooling: Evidently AI
Purpose: Tracks “Model Drift” and “Data Drift” to ensure predictions remain accurate over time.

Device Health

Tooling: Node Problem Detector — however, with services like GKE, it’s used under-the-hood, so using Google Cloud Monitoring is a low-touch alternative.
Purpose: Managed Hardware reliability — Automatically tracks TPU/GPU health, utilisation, and memory bandwidth to proactively catch silicon-level failures.

The Business Case: Turning Infrastructure into a Competitive Advantage

Platform engineering shifts IT from a support function to a strategic driver, directly impacting the bottom line in four key areas:

1. Accelerated Time-to-Market

The Shift: Moves deployment timelines from months to days.
How: Pre-approved “Golden Paths” automate the complex journey from code to production. This removes bottlenecks, allowing you to launch AI products whilst competitors are still configuring servers.

2. Maximised Talent ROI

The Shift: High-value resources stop doing low-value work.
How: Data Scientists are expensive. The platform abstracts away the infrastructure complexity, ensuring your PhDs spend their time optimising models, not wrestling with Kubernetes YAML or networking configurations.

3. Automated Governance & Security

The Shift: Compliance becomes a default state, not a manual hurdle.
How: Because every deployment is defined as code, security standards and audit trails are baked in automatically. This reduces risk without slowing down innovation.

4. Strategic Cost Efficiency

The platform actively drives down the cost of AI at scale using intelligent orchestration:

Maximise Hardware Value (via DRA): Ensures expensive GPUs and TPUs are fully utilised (“sliced” correctly) rather than sitting idle or under-allocated.
Cut Hidden Cloud Costs (via Topology-Aware Routing): Intelligently keeps data traffic local, eliminating the “silent budget killer” of cross-zone egress fees.
Right-Sizing: Uses a hybrid approach (Serverless + Kubernetes) to match the cheapest compute option to the specific workload intensity.

5. The Platform Impact

The gains can be significant. I’ve summarised some simple comparisons from articles, seminar talks and direct conversations I’ve had over the last couple of years:

Time-to-Market

Before: Months — Innovation is stalled by manual ticketing queues, environment configuration, and “works on my machine” issues.
After: Days — Self-service “Golden Paths” allow teams to deploy compliant, production-ready AI models automatically via a Pull Request.

Talent Focus

Before: Wasted — Expensive Data Scientists can spend 30–50% of their time fighting IAC, or learning networking, and general infrastructure plumbing.
After: Optimised — Scientists focus 100% on high-value modelling and experimentation, treating infrastructure as a transparent utility.

Governance

Before: Retroactive — Security and compliance are manual hurdles checked just before launch, often forcing last-minute rewrites.
After: By Design — Compliance, audit trails, and security policies are baked into the platform. It is secure by default.

Hardware Efficiency

Before: Static & Wasteful — GPUs are often exclusively locked to teams even when idle, and cross-zone traffic incurs hidden “egress tax.”
After: Dynamic & Efficient — the platform ensures silicon is sliced and shared dynamically, and topology-aware Routing eliminates unnecessary data travel costs.

Reliability

Before: Reactive — Operations teams scramble to fix crashes after users report them. Logs are siloed and hard to correlate.
After: Proactive — Automated Monitoring (Google Cloud Monitoring) detects silicon degradation (ECC errors) and moves workloads before the hardware fails.

Conclusion: Build an AI Factory, Not One-Off Projects

MLOps tells you what to do to productionise AI. Platform Engineering tells you how to do it at scale.

Stop treating each model as a unique, artisanal project. Build a factory: an Internal ML Platform that standardises workflows, automates operations, and frees your data scientists to innovate. Kubernetes 1.33+ and modern platform engineering finally make this possible.

Ready to take your AI platform to the next level?

We design modern MLOps and AI platforms that deliver results. Reach out to Mesoform today.

Blog

Case studies, strategies, and ideas shaping modern technology.

Scaling MLOps with Platform Engineering

The “Last Mile” Problem in AI

Why AI Deployment Is Hard

Platform Engineering: Creating the “Golden Path” for AI

A Modern MLOps Blueprint: Hybrid Platforms for the Full ML Lifecycle

1. Kubernetes for Training & Complex Workflows

2. Serverless for Event-Driven Inference

3. Kubernetes for High-Performance Inference

Technical Deep Dive: Why Kubernetes 1.33+ Changes the Game

Topology-Aware Routing

Full-Lifecycle MLOps: Day 2 Operations

Full-Lifecycle MLOps: Automating Day 2 Operations

1. GitOps-Driven Deployment

2. Observability “Out-of-the-Box”

The Business Case: Turning Infrastructure into a Competitive Advantage

1. Accelerated Time-to-Market

2. Maximised Talent ROI

3. Automated Governance & Security

4. Strategic Cost Efficiency

5. The Platform Impact

Conclusion: Build an AI Factory, Not One-Off Projects

Ready to take your AI platform to the next level?

Scaling MLOps with Platform Engineering

Working From Home: What's Changed Since 2021?

Why Platform Engineering Is So Much More Than Backstage.io

From Paralysis to Paved Roads: How Platform Engineering Resolves the Cognitive Crisis in DevOps and SRE

Using EKS to Build a Developer-First Internal Platform

Subscribe to our newsletters

Call us at

Email us at

Find us at

Sitemap

Social