Kubernetes for ML Workloads: A Production Deployment Guide

According to a 2023 CNCF survey, 96% of organizations are either using or evaluating Kubernetes, yet teams deploying machine learning workloads consistently report that standard Kubernetes configurations fail them at scale.

A model-serving cluster at Spotify, for example, can receive millions of inference requests per hour — GPU scheduling conflicts, resource starvation, and pod eviction under memory pressure are real, recurring problems.

Running ML workloads in Kubernetes is not the same as running stateless web services. Training jobs need exclusive GPU access. Batch inference pipelines need burst scheduling. Model servers need graceful rollouts without dropped connections.

This guide walks through the exact configuration steps, common failure patterns, and production-tested strategies you need to deploy ML workloads reliably — whether you are serving a fine-tuned transformer or orchestrating a multi-stage data pipeline.


Prerequisites Before You Deploy

Before writing a single YAML manifest, your environment needs specific capabilities in place. Skipping this phase is the primary reason ML deployments fail within their first week in production.

Cluster-Level Requirements

“While Kubernetes adoption has reached 96% across enterprises, the remaining challenge is specialized workload management—ML deployments require dynamic GPU scheduling and distributed training orchestration that vanilla Kubernetes configurations simply don’t provide out of the box.” — Sarah Chen, Senior AI Infrastructure Analyst at Gartner

Your Kubernetes cluster must meet the following baseline requirements for ML workloads:

  • Kubernetes version 1.27 or later — earlier versions lack the stable sidecar container support needed for logging and metric sidecars alongside training pods
  • Node Feature Discovery (NFD) installed and configured — this labels nodes with hardware capabilities including GPU model, memory bandwidth, and NUMA topology
  • NVIDIA GPU Operator (version 23.x or later) or AMD ROCm Device Plugin deployed — these manage driver installation, device plugin registration, and CUDA toolkit mounting automatically
  • cert-manager installed — required by most ML platform operators including Kubeflow and Seldon Core
  • A persistent storage class backed by high-throughput storage — AWS EFS, GCP Filestore, or a Ceph/Rook cluster with at least 500 MB/s sequential read throughput for dataset loading

You also need kubectl 1.27+, helm 3.12+, and the kustomize CLI installed locally. If you are using GPU nodes on AWS, the eks-node-viewer tool from AWS Labs is invaluable for monitoring GPU node utilization before and after deployments.

Access and Permission Requirements

Your deployment service account needs the following RBAC permissions at minimum: pods/exec, pods/log, persistentvolumeclaims, and the ability to create PriorityClass resources cluster-wide. Without priority class creation rights, your training jobs will be evicted in favor of lower-priority web services the first time your cluster faces memory pressure.


Step-by-Step: Deploying a Model Training Job

This section covers deploying a distributed PyTorch training job using the Kubeflow Training Operator, which is the most production-stable option for multi-node GPU training on Kubernetes as of 2024.

Step 1 — Install the Kubeflow Training Operator

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"

Verify the operator pod is running:

kubectl get pods -n kubeflow

You should see training-operator-XXXXXXXX with status Running. The operator manages PyTorchJob, TFJob, and MPIJob custom resources.

Step 2 — Create a Dedicated Namespace with Resource Quotas

Never deploy ML workloads into the default namespace. Create an isolated namespace with hard resource ceilings:

apiVersion: v1
kind: Namespace
metadata:
  name: ml-training
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-training
spec:
  hard:
    requests.nvidia.com/gpu: "16"
    limits.nvidia.com/gpu: "16"
    requests.memory: "512Gi"

Resource quotas are non-negotiable in shared clusters. Without them, a single runaway training job can consume all cluster GPUs, blocking every other team’s workloads.

Step 3 — Define a PriorityClass for Training Jobs

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ml-training-high
value: 1000000
globalDefault: false
description: "Priority for GPU training jobs"

Step 4 — Write the PyTorchJob Manifest

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: resnet-training
  namespace: ml-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          priorityClassName: ml-training-high
          containers:
            - name: pytorch
              image: your-registry/pytorch-trainer:1.0
              resources:
                limits:
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /data
                  name: training-data
          volumes:
            - name: training-data
              persistentVolumeClaim:
                claimName: dataset-pvc
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          priorityClassName: ml-training-high
          containers:
            - name: pytorch
              image: your-registry/pytorch-trainer:1.0
              resources:
                limits:
                  nvidia.com/gpu: 1

Step 5 — Deploy and Monitor

kubectl apply -f pytorch-job.yaml
kubectl get pytorchjob resnet-training -n ml-training -w

Watch for the Succeeded condition. Use kubectl logs on the master pod to stream training metrics in real time.


Serving Models in Production with Kubernetes

Training is only half the challenge. Model serving introduces a completely different set of Kubernetes concerns — horizontal scaling, rolling updates, A/B traffic splitting, and latency SLAs.

Deploying with KServe

KServe (formerly KFServing) is the most widely adopted model serving framework for Kubernetes as of 2024, supported by Google, IBM, Bloomberg, and others. It wraps your model in a standardized InferenceService resource.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
  namespace: ml-serving
spec:
  predictor:
    sklearn:
      storageUri: "gs://your-bucket/sklearn/iris"
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "2"
          memory: 4Gi

KServe handles the Knative-based autoscaling, model download from object storage, and gRPC/REST endpoint creation automatically. You get scale-to-zero for development environments and configurable minimum replicas for production.

Configuring Horizontal Pod Autoscaling for Inference

For GPU inference pods, CPU-based HPA thresholds are misleading — a GPU inference server can have 5% CPU usage while being completely GPU-bound. Use custom metrics from DCGM (NVIDIA Data Center GPU Manager) instead:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: External
      external:
        metric:
          name: dcgm_gpu_utilization
        target:
          type: AverageValue
          averageValue: "70"

This scales your inference deployment based on average GPU utilization across pods, keeping GPUs at roughly 70% utilization — high enough for cost efficiency, low enough to absorb traffic spikes.

For deeper exploration of how models get adapted for different serving contexts, the Vision Language Model Transfer Learning Methods agent covers architecture-level optimization techniques that directly affect your resource planning.


Common Errors and How to Fix Them

These are the failure patterns that appear most frequently in production ML clusters, based on documented incidents from teams at Twitter (pre-X), Airbnb, and Lyft.

Error: 0/5 nodes are available: 5 Insufficient nvidia.com/gpu

This is the most common GPU scheduling error, and it has three distinct causes:

  1. GPUs are allocated but idle — another pod holds GPU resources without actively using them. Run kubectl describe nodes and look for nvidia.com/gpu in the “Allocated resources” section. Use nvidia-smi via a debug pod to check actual GPU processes.

  2. Node Feature Discovery is not labeling nodes correctly — run kubectl get nodes --show-labels and verify feature.node.kubernetes.io/pci-10de.present=true exists on GPU nodes.

  3. GPU Operator DaemonSet pods are crashlooping — check kubectl get pods -n gpu-operator. Driver installation failures here block all GPU scheduling downstream.

Error: Training Pod OOMKilled Repeatedly

Out-of-memory kills during training are almost always caused by underestimating memory requirements at the pod spec level. PyTorch’s DataLoader with pin_memory=True and high num_workers counts consumes significant host memory outside of GPU memory. A training run that needs 40 GB of GPU memory may simultaneously need 80 GB of system RAM for data preprocessing.

Fix: set requests.memory to 1.5x your expected system RAM usage and limits.memory to 2x. Never set memory request equal to memory limit for training jobs — this causes immediate eviction when the Linux kernel’s memory accounting fluctuates during batch loading.

Error: ImagePullBackOff on Large Model Images

Model container images that include weights (a common but inadvisable pattern) frequently exceed 20 GB. Kubernetes’ default image pull timeout of 60 seconds will cause repeated ImagePullBackOff failures on slower node network connections.

Fix: store model weights in object storage (S3, GCS, or Azure Blob) and load them at pod startup using an init container. The kserve/storage-initializer image handles this pattern cleanly. This also drops your container image size from 20 GB to under 2 GB.

Error: Distributed Training Hangs Indefinitely

PyTorch distributed training using torch.distributed.init_process_group with the nccl backend will hang if even one worker pod is not reachable. This often happens when Kubernetes NetworkPolicy rules block inter-pod communication within a namespace.

Fix: explicitly allow all traffic within your ML namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-intra-namespace
  namespace: ml-training
spec:
  podSelector: {}
  ingress:
    - from:
        - podSelector: {}

Real-World Example: Waymo’s ML Infrastructure Patterns

Waymo, Alphabet’s autonomous driving subsidiary, has publicly documented several Kubernetes-based ML infrastructure patterns in conference talks at KubeCon NA 2022. Their approach illustrates production-scale considerations that most teams encounter only after painful incidents.

Waymo runs heterogeneous GPU fleets — a mix of NVIDIA A100s for large model training and T4s for perception model inference — on the same Kubernetes cluster. They achieve workload isolation using node taints and tolerations rather than separate clusters, which reduces their infrastructure management overhead significantly.

Their training jobs use preemptible/spot node pools for non-time-sensitive workloads, with automatic checkpointing every 15 minutes via a custom Kubernetes controller that watches for spot preemption signals. When a node receives a termination notice, the controller triggers a checkpoint save and gracefully terminates the pod — preserving up to 15 minutes of compute rather than losing the full training run.

Their model serving infrastructure uses Knative-based scale-to-zero for development model endpoints, which reduced their idle GPU costs by approximately 40% according to internal estimates shared at KubeCon. Production endpoints maintain a minimum of 3 replicas across availability zones for fault tolerance.

For teams building similar ML pipelines, the Machine Learning Engineering for Production (MLOps) agent provides architecture guidance aligned with these production patterns. You can also explore broader AI and Machine Learning resources for context on where Kubernetes fits in the overall ML toolchain.


Practical Recommendations for Production ML Clusters

Based on documented production deployments and publicly available post-mortems, these five recommendations address the most impactful decisions you will make:

1. Use Kueue for batch job queuing, not raw Kubernetes jobs. Kueue, a Kubernetes-native job queuing system from Google, adds quota management, job preemption, and multi-tenant fair sharing without requiring a separate scheduler. It integrates directly with PyTorchJob and MPIJob resources.

2. Never store model weights in container images. This pattern is tempting for simplicity but creates images that are impractical to pull quickly and impossible to update without full image rebuilds. Use the init container pattern with object storage URIs in your InferenceService manifests.

3. Set GPU time-slicing for development workloads. NVIDIA’s GPU Operator supports time-slicing that allows multiple pods to share a single physical GPU. This is not suitable for production inference with latency SLAs, but it cuts development cluster GPU costs by 4-8x. Configure it via the ClusterPolicy resource in the GPU Operator.

4. Implement distributed tracing from day one. Tools like Jaeger or Grafana Tempo integrated with your model serving layer reveal latency bottlenecks that metrics alone cannot diagnose — particularly the preprocessing-to-inference handoff in multi-stage pipelines.

5. Use topologySpreadConstraints for inference deployments. Spreading inference pods across availability zones prevents a single AZ failure from degrading your model API. Standard podAntiAffinity achieves the same result but is more verbose and less flexible.

For code-level implementation support during your Kubernetes configuration work, CodeMate can assist with debugging YAML manifests and Helm chart configuration. If your ML workload involves web data ingestion as part of its pipeline, Cyber Scraper Seraphina handles structured web data collection that feeds directly into training datasets.

You may also find these related posts useful: explore SnowChat for data querying patterns that complement your ML workflows, and consult Liner AI for research synthesis during your architecture planning phase.


Common Questions About Kubernetes for ML

How do I prevent training jobs from being evicted during peak cluster load? Create a dedicated PriorityClass with a value above your standard workload classes and assign it to training job pods. Additionally, set requests equal to limits for GPU resources — Kubernetes treats pods with equal requests and limits as Guaranteed QoS class, which is the last to be evicted under memory pressure.

What is the difference between KServe and Seldon Core for model serving on Kubernetes? KServe is tightly integrated with Knative and supports scale-to-zero, making it more cost-efficient for variable traffic. Seldon Core offers more flexibility for custom inference pipelines with chained microservices (preprocessors, explainers, drift detectors) but requires more configuration. For pure transformer model serving, KServe with the HuggingFace predictor runtime is typically faster to production.

Can I run Kubernetes ML workloads on-premises without cloud providers? Yes. On-premises deployments typically use Rancher or OpenShift as the cluster management layer, with Rook/Ceph for distributed storage and MetalLB for load balancer services. The GPU Operator works identically on bare metal. The main operational gap versus cloud is the absence of managed autoscaling for node groups — you handle physical hardware provisioning manually.

How much does a production ML Kubernetes cluster cost compared to managed ML platforms like SageMaker? According to a 2023 Andreessen Horowitz analysis, companies running self-managed GPU clusters on Kubernetes save 30-50% versus managed ML platform pricing at scale, but only after their workloads exceed roughly $50,000/month in spend. Below that threshold, the operational overhead of managing Kubernetes typically exceeds the cost savings.


Final Verdict

Kubernetes is not the simplest path to running ML workloads, but it is the most flexible and cost-efficient at production scale. The teams that succeed with it — including those at Waymo, Spotify, and Airbnb — share a common approach: they invest in the foundational infrastructure (GPU Operator, Kueue, KServe) before writing model-specific configuration, they enforce resource quotas from day one, and they treat checkpoint recovery as a first-class requirement rather than an afterthought.

If you are starting with fewer than five GPU nodes, a managed service like Google Vertex AI or AWS SageMaker will move faster. Once your workloads justify the infrastructure investment — and your team has at least one Kubernetes-experienced engineer — the patterns in this guide will give you a production deployment that scales without surprises.