AI/ML Workloads with Kubeflow: Enterprise Machine Learning Platform on Kubernetes
Kubeflow provides a comprehensive machine learning platform on Kubernetes, enabling end-to-end ML workflows from experimentation to production. This guide covers implementing enterprise-grade ML infrastructure with Kubeflow, including pipeline orchestration, distributed training, model serving, and operational best practices.
AI/ML Workloads with Kubeflow: Enterprise Machine Learning Platform on Kubernetes
Executive Summary
Kubeflow brings machine learning workflows to Kubernetes, providing tools for experiment tracking, pipeline orchestration, distributed training, hyperparameter tuning, and model serving. This guide provides practical implementation strategies for deploying production-grade ML platforms that enable data scientists and ML engineers to build, train, and deploy models at scale.
Understanding Kubeflow Architecture
Kubeflow Components Overview
Kubeflow Platform Architecture:
# kubeflow-architecture.yaml
apiVersion: architecture.kubeflow.org/v1
kind: KubeflowArchitecture
metadata:
name: enterprise-ml-platform
spec:
coreComponents:
kubeflowPipelines:
description: "ML workflow orchestration"
features:
- "DAG-based pipeline definitions"
- "Containerized steps"
- "Artifact tracking"
- "Pipeline versioning"
- "Scheduled and triggered runs"
storage:
- "MinIO for artifacts"
- "MySQL for metadata"
notebooks:
description: "Interactive development environment"
features:
- "JupyterLab interface"
- "Pre-configured ML frameworks"
- "GPU support"
- "Persistent storage"
- "Shared workspaces"
training:
operators:
- name: "TFJob"
framework: "TensorFlow"
features: ["Distributed training", "Parameter server", "AllReduce"]
- name: "PyTorchJob"
framework: "PyTorch"
features: ["Distributed training", "Gloo/NCCL backend"]
- name: "MXNetJob"
framework: "MXNet"
features: ["Distributed training", "Parameter server"]
- name: "XGBoostJob"
framework: "XGBoost"
features: ["Distributed XGBoost"]
hyperparameterTuning:
name: "Katib"
features:
- "Neural architecture search"
- "Hyperparameter optimization"
- "Early stopping"
- "Multiple search algorithms"
algorithms:
- "Random search"
- "Grid search"
- "Bayesian optimization"
- "Hyperband"
modelServing:
options:
- name: "KServe"
features:
- "Multi-framework support"
- "Auto-scaling"
- "Canary rollouts"
- "A/B testing"
- "Explainability"
- name: "Seldon Core"
features:
- "Advanced routing"
- "Model monitoring"
- "Outlier detection"
metadata:
name: "ML Metadata"
features:
- "Experiment tracking"
- "Artifact lineage"
- "Model registry"
- "Dataset versioning"
supportingServices:
storage:
- name: "MinIO"
purpose: "Artifact storage"
- name: "NFS/Ceph"
purpose: "Shared datasets"
databases:
- name: "MySQL"
purpose: "Pipeline metadata"
- name: "PostgreSQL"
purpose: "Katib experiments"
monitoring:
- "Prometheus for metrics"
- "Grafana for dashboards"
- "TensorBoard for training"
mlWorkflow:
dataPreparation:
- "Data ingestion pipelines"
- "Feature engineering"
- "Data validation"
- "Dataset versioning"
experimentation:
- "Interactive notebooks"
- "Experiment tracking"
- "Hyperparameter tuning"
- "Model comparison"
training:
- "Single-node training"
- "Distributed training"
- "GPU acceleration"
- "Training monitoring"
evaluation:
- "Model validation"
- "Performance metrics"
- "Model comparison"
- "Fairness analysis"
deployment:
- "Model serving"
- "A/B testing"
- "Canary releases"
- "Production monitoring"
ML Platform Technology Stack
Technology Selection Matrix:
// ml_platform_stack.go
package mlplatform
import (
"fmt"
)
type MLComponent struct {
Name string
Purpose string
Alternatives []string
Recommendation string
UseCases []string
}
func GetMLPlatformStack() []MLComponent {
return []MLComponent{
{
Name: "Kubeflow Pipelines",
Purpose: "Workflow orchestration",
Alternatives: []string{
"Apache Airflow",
"Argo Workflows",
"MLflow",
"Prefect",
},
Recommendation: "Best for ML-specific workflows with artifact tracking",
UseCases: []string{
"End-to-end ML pipelines",
"Reproducible experiments",
"Production model training",
},
},
{
Name: "KServe (KFServing)",
Purpose: "Model serving",
Alternatives: []string{
"TensorFlow Serving",
"TorchServe",
"Seldon Core",
"BentoML",
},
Recommendation: "Best for multi-framework serving with auto-scaling",
UseCases: []string{
"Production inference",
"A/B testing",
"Canary deployments",
},
},
{
Name: "Katib",
Purpose: "Hyperparameter tuning",
Alternatives: []string{
"Ray Tune",
"Optuna",
"Hyperopt",
"Azure ML HyperDrive",
},
Recommendation: "Best for Kubernetes-native HPO",
UseCases: []string{
"Model optimization",
"Neural architecture search",
"AutoML",
},
},
{
Name: "Jupyter Notebooks",
Purpose: "Interactive development",
Alternatives: []string{
"VS Code",
"RStudio",
"Zeppelin",
"Google Colab",
},
Recommendation: "Standard for data science workflows",
UseCases: []string{
"Exploratory data analysis",
"Model prototyping",
"Visualization",
},
},
{
Name: "MLflow",
Purpose: "Experiment tracking",
Alternatives: []string{
"Weights & Biases",
"Neptune.ai",
"Comet.ml",
"TensorBoard",
},
Recommendation: "Can integrate with Kubeflow for tracking",
UseCases: []string{
"Experiment logging",
"Model registry",
"Artifact storage",
},
},
}
}
func PrintMLStack() {
stack := GetMLPlatformStack()
fmt.Println("===== Enterprise ML Platform Stack =====\n")
for _, component := range stack {
fmt.Printf("%s\n", component.Name)
fmt.Printf(" Purpose: %s\n", component.Purpose)
fmt.Printf(" Recommendation: %s\n", component.Recommendation)
fmt.Printf(" Use Cases:\n")
for _, uc := range component.UseCases {
fmt.Printf(" - %s\n", uc)
}
fmt.Printf(" Alternatives: %v\n\n", component.Alternatives)
}
}
// ML workload resource requirements
type ResourceProfile struct {
Name string
CPUCores int
MemoryGB int
GPUCount int
GPUType string
StorageGB int
NetworkGbps int
}
func GetMLResourceProfiles() []ResourceProfile {
return []ResourceProfile{
{
Name: "Data Preprocessing",
CPUCores: 16,
MemoryGB: 64,
GPUCount: 0,
StorageGB: 500,
NetworkGbps: 10,
},
{
Name: "Small Model Training",
CPUCores: 8,
MemoryGB: 32,
GPUCount: 1,
GPUType: "NVIDIA T4",
StorageGB: 100,
NetworkGbps: 10,
},
{
Name: "Large Model Training",
CPUCores: 32,
MemoryGB: 128,
GPUCount: 4,
GPUType: "NVIDIA A100",
StorageGB: 1000,
NetworkGbps: 100,
},
{
Name: "Distributed Training",
CPUCores: 64,
MemoryGB: 256,
GPUCount: 8,
GPUType: "NVIDIA A100",
StorageGB: 2000,
NetworkGbps: 200,
},
{
Name: "Model Serving",
CPUCores: 4,
MemoryGB: 16,
GPUCount: 1,
GPUType: "NVIDIA T4",
StorageGB: 50,
NetworkGbps: 10,
},
{
Name: "Batch Inference",
CPUCores: 16,
MemoryGB: 64,
GPUCount: 2,
GPUType: "NVIDIA V100",
StorageGB: 200,
NetworkGbps: 25,
},
}
}
Kubeflow Installation and Configuration
Production Kubeflow Deployment
Complete Kubeflow Installation:
#!/bin/bash
# install-kubeflow.sh
# Deploy production-grade Kubeflow on Kubernetes
set -euo pipefail
KUBEFLOW_VERSION="1.8.0"
KUSTOMIZE_VERSION="5.0.0"
echo "Installing Kubeflow ${KUBEFLOW_VERSION}..."
# Install kustomize
echo "Installing kustomize..."
wget https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2Fv${KUSTOMIZE_VERSION}/kustomize_v${KUSTOMIZE_VERSION}_linux_amd64.tar.gz
tar -xzf kustomize_v${KUSTOMIZE_VERSION}_linux_amd64.tar.gz
sudo mv kustomize /usr/local/bin/
rm kustomize_v${KUSTOMIZE_VERSION}_linux_amd64.tar.gz
# Clone Kubeflow manifests
echo "Cloning Kubeflow manifests..."
git clone https://github.com/kubeflow/manifests.git
cd manifests
git checkout v${KUBEFLOW_VERSION}
# Install cert-manager (required for Kubeflow)
echo "Installing cert-manager..."
kustomize build common/cert-manager/cert-manager/base | kubectl apply -f -
kubectl wait --for=condition=ready pod -l 'app in (cert-manager,webhook)' \
--timeout=180s -n cert-manager
kustomize build common/cert-manager/kubeflow-issuer/base | kubectl apply -f -
# Install Istio for service mesh
echo "Installing Istio..."
kustomize build common/istio-1-17/istio-crds/base | kubectl apply -f -
kustomize build common/istio-1-17/istio-namespace/base | kubectl apply -f -
kustomize build common/istio-1-17/istio-install/base | kubectl apply -f -
# Install Dex for authentication
echo "Installing Dex..."
kustomize build common/dex/overlays/istio | kubectl apply -f -
# Install OIDC AuthService
echo "Installing AuthService..."
kustomize build common/oidc-authservice/base | kubectl apply -f -
# Install Knative Serving (optional, for KServe)
echo "Installing Knative Serving..."
kustomize build common/knative/knative-serving/overlays/gateways | kubectl apply -f -
kustomize build common/istio-1-17/cluster-local-gateway/base | kubectl apply -f -
# Install Kubeflow Namespace
echo "Creating Kubeflow namespace..."
kustomize build common/kubeflow-namespace/base | kubectl apply -f -
# Install Kubeflow Roles
echo "Installing Kubeflow roles..."
kustomize build common/kubeflow-roles/base | kubectl apply -f -
# Install Kubeflow Istio Resources
echo "Installing Kubeflow Istio resources..."
kustomize build common/istio-1-17/kubeflow-istio-resources/base | kubectl apply -f -
# Install Kubeflow Pipelines
echo "Installing Kubeflow Pipelines..."
kustomize build apps/pipeline/upstream/env/cert-manager/platform-agnostic-multi-user | kubectl apply -f -
# Install KServe
echo "Installing KServe..."
kustomize build contrib/kserve/kserve | kubectl apply -f -
kustomize build contrib/kserve/models-web-app/overlays/kubeflow | kubectl apply -f -
# Install Katib
echo "Installing Katib..."
kustomize build apps/katib/upstream/installs/katib-with-kubeflow | kubectl apply -f -
# Install Central Dashboard
echo "Installing Central Dashboard..."
kustomize build apps/centraldashboard/upstream/overlays/kserve | kubectl apply -f -
# Install Admission Webhook
echo "Installing Admission Webhook..."
kustomize build apps/admission-webhook/upstream/overlays/cert-manager | kubectl apply -f -
# Install Notebook Controller
echo "Installing Notebook Controller..."
kustomize build apps/jupyter/notebook-controller/upstream/overlays/kubeflow | kubectl apply -f -
# Install Jupyter Web App
echo "Installing Jupyter Web App..."
kustomize build apps/jupyter/jupyter-web-app/upstream/overlays/istio | kubectl apply -f -
# Install Profiles + KFAM
echo "Installing Profiles..."
kustomize build apps/profiles/upstream/overlays/kubeflow | kubectl apply -f -
# Install Volumes Web App
echo "Installing Volumes Web App..."
kustomize build apps/volumes-web-app/upstream/overlays/istio | kubectl apply -f -
# Install Tensorboard Controller
echo "Installing Tensorboard Controller..."
kustomize build apps/tensorboard/tensorboard-controller/upstream/overlays/kubeflow | kubectl apply -f -
# Install Tensorboard Web App
echo "Installing Tensorboard Web App..."
kustomize build apps/tensorboard/tensorboards-web-app/upstream/overlays/istio | kubectl apply -f -
# Install Training Operator
echo "Installing Training Operator..."
kustomize build apps/training-operator/upstream/overlays/kubeflow | kubectl apply -f -
# Install User Namespace
echo "Creating user namespace..."
kustomize build common/user-namespace/base | kubectl apply -f -
echo "Waiting for Kubeflow components to be ready..."
kubectl wait --for=condition=ready pod --all -n kubeflow --timeout=600s
echo "Kubeflow installation complete!"
echo ""
echo "Access Kubeflow Dashboard:"
echo "kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80"
echo "Then open: http://localhost:8080"
echo ""
echo "Default credentials:"
echo "Email: user@example.com"
echo "Password: 12341234"
Kubeflow Configuration:
# kubeflow-configuration.yaml
---
# GPU Node Pool Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-config
namespace: kubeflow
data:
gpu-allocation.yaml: |
# NVIDIA GPU node labels
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-tesla-a100
# or
node.kubernetes.io/instance-type: p3.8xlarge
# GPU resource limits
resources:
limits:
nvidia.com/gpu: 1
# NVIDIA device plugin daemonset
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
# Notebook Server Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: notebook-images
namespace: kubeflow
data:
spawner_ui_config.yaml: |
spawnerFormDefaults:
image:
value: gcr.io/kubeflow-images-public/tensorflow-2.11.0-notebook-cpu:1.8.0
options:
- gcr.io/kubeflow-images-public/tensorflow-2.11.0-notebook-cpu:1.8.0
- gcr.io/kubeflow-images-public/tensorflow-2.11.0-notebook-gpu:1.8.0
- gcr.io/kubeflow-images-public/pytorch-1.13.0-notebook-cpu:1.8.0
- gcr.io/kubeflow-images-public/pytorch-1.13.0-notebook-gpu:1.8.0
- company/custom-ml-notebook:latest
cpu:
value: '2.0'
readOnly: false
memory:
value: 4.0Gi
readOnly: false
workspaceVolume:
value:
mount: /home/jovyan
newPvc:
metadata:
name: '{notebook-name}-workspace'
spec:
resources:
requests:
storage: 10Gi
accessModes:
- ReadWriteOnce
dataVolumes:
value: []
readOnly: false
gpus:
value:
num: '0'
vendors:
- limitsKey: nvidia.com/gpu
uiName: NVIDIA
vendor: ''
---
# Pipeline Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: pipeline-install-config
namespace: kubeflow
data:
config.yaml: |
apiVersion: v1
kind: ConfigMap
metadata:
name: pipeline-install-config
data:
# MinIO configuration
bucketName: mlpipeline
minioServiceRegion: us-west-1
minioServicePort: "9000"
# MySQL configuration
mysqlDatabase: mlpipeline
mysqlPort: "3306"
# Cache configuration
cacheEnabled: "true"
cacheImage: gcr.io/google-containers/busybox
---
# Resource Quotas per Namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: ml-team-quota
namespace: kubeflow-ml-team
spec:
hard:
requests.cpu: "100"
requests.memory: 500Gi
requests.nvidia.com/gpu: "10"
persistentvolumeclaims: "50"
services.loadbalancers: "5"
---
# Network Policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: kubeflow-network-policy
namespace: kubeflow
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: kubeflow
- namespaceSelector:
matchLabels:
name: istio-system
egress:
- to:
- namespaceSelector:
matchLabels:
name: kubeflow
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 53
- protocol: UDP
port: 53
ML Pipeline Development
Kubeflow Pipelines SDK
Complete ML Pipeline Example:
# ml_pipeline.py
import kfp
from kfp import dsl
from kfp.components import create_component_from_func
from typing import NamedTuple
# Define data preprocessing component
@create_component_from_func
def preprocess_data(
input_data_path: str,
output_data_path: str,
test_size: float = 0.2
) -> NamedTuple('Outputs', [('num_samples', int), ('num_features', int)]):
"""Preprocess training data."""
import pandas as pd
from sklearn.model_selection import train_test_split
import pickle
# Load data
df = pd.read_csv(input_data_path)
# Feature engineering
X = df.drop('target', axis=1)
y = df['target']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_size, random_state=42
)
# Save processed data
with open(f'{output_data_path}/train.pkl', 'wb') as f:
pickle.dump((X_train, y_train), f)
with open(f'{output_data_path}/test.pkl', 'wb') as f:
pickle.dump((X_test, y_test), f)
outputs = NamedTuple('Outputs', [('num_samples', int), ('num_features', int)])
return outputs(len(df), len(X.columns))
# Define training component
@create_component_from_func
def train_model(
data_path: str,
model_path: str,
learning_rate: float = 0.001,
epochs: int = 10,
batch_size: int = 32
) -> NamedTuple('Outputs', [('accuracy', float), ('loss', float)]):
"""Train ML model."""
import tensorflow as tf
from tensorflow import keras
import pickle
import os
# Load data
with open(f'{data_path}/train.pkl', 'rb') as f:
X_train, y_train = pickle.load(f)
# Build model
model = keras.Sequential([
keras.layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
keras.layers.Dropout(0.2),
keras.layers.Dense(64, activation='relu'),
keras.layers.Dropout(0.2),
keras.layers.Dense(1, activation='sigmoid')
])
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=learning_rate),
loss='binary_crossentropy',
metrics=['accuracy']
)
# Train model
history = model.fit(
X_train, y_train,
epochs=epochs,
batch_size=batch_size,
validation_split=0.2,
verbose=1
)
# Save model
os.makedirs(model_path, exist_ok=True)
model.save(f'{model_path}/model.h5')
final_accuracy = history.history['val_accuracy'][-1]
final_loss = history.history['val_loss'][-1]
outputs = NamedTuple('Outputs', [('accuracy', float), ('loss', float)])
return outputs(float(final_accuracy), float(final_loss))
# Define evaluation component
@create_component_from_func
def evaluate_model(
data_path: str,
model_path: str
) -> NamedTuple('Outputs', [('test_accuracy', float), ('test_loss', float)]):
"""Evaluate trained model."""
import tensorflow as tf
from tensorflow import keras
import pickle
# Load test data
with open(f'{data_path}/test.pkl', 'rb') as f:
X_test, y_test = pickle.load(f)
# Load model
model = keras.models.load_model(f'{model_path}/model.h5')
# Evaluate
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
outputs = NamedTuple('Outputs', [('test_accuracy', float), ('test_loss', float)])
return outputs(float(test_accuracy), float(test_loss))
# Define deployment component
@create_component_from_func
def deploy_model(
model_path: str,
model_name: str,
namespace: str = 'kubeflow'
):
"""Deploy model to KServe."""
from kubernetes import client, config
import json
config.load_incluster_config()
# Create InferenceService
inference_service = {
'apiVersion': 'serving.kserve.io/v1beta1',
'kind': 'InferenceService',
'metadata': {
'name': model_name,
'namespace': namespace
},
'spec': {
'predictor': {
'tensorflow': {
'storageUri': model_path,
'resources': {
'limits': {
'cpu': '1',
'memory': '2Gi'
},
'requests': {
'cpu': '500m',
'memory': '1Gi'
}
}
}
}
}
}
# Apply inference service
api = client.CustomObjectsApi()
api.create_namespaced_custom_object(
group='serving.kserve.io',
version='v1beta1',
namespace=namespace,
plural='inferenceservices',
body=inference_service
)
print(f'Model {model_name} deployed successfully')
# Define complete pipeline
@dsl.pipeline(
name='End-to-End ML Pipeline',
description='Complete ML pipeline with training and deployment'
)
def ml_pipeline(
input_data_path: str = 'gs://my-bucket/data/input.csv',
learning_rate: float = 0.001,
epochs: int = 10,
batch_size: int = 32,
model_name: str = 'my-model'
):
"""Complete ML pipeline."""
# Create volumes for data and model storage
vop = dsl.VolumeOp(
name='data-volume',
resource_name='data-pvc',
size='10Gi',
modes=dsl.VOLUME_MODE_RWO
)
model_vop = dsl.VolumeOp(
name='model-volume',
resource_name='model-pvc',
size='5Gi',
modes=dsl.VOLUME_MODE_RWO
)
# Step 1: Preprocess data
preprocess = preprocess_data(
input_data_path=input_data_path,
output_data_path='/data'
).add_pvolumes({'/data': vop.volume})
# Step 2: Train model
train = train_model(
data_path='/data',
model_path='/model',
learning_rate=learning_rate,
epochs=epochs,
batch_size=batch_size
).add_pvolumes({
'/data': preprocess.pvolume,
'/model': model_vop.volume
}).after(preprocess)
# Step 3: Evaluate model
evaluate = evaluate_model(
data_path='/data',
model_path='/model'
).add_pvolumes({
'/data': preprocess.pvolume,
'/model': train.pvolume
}).after(train)
# Step 4: Deploy model (conditional on accuracy threshold)
with dsl.Condition(evaluate.outputs['test_accuracy'] > 0.85):
deploy = deploy_model(
model_path='gs://my-bucket/models/' + model_name,
model_name=model_name
).after(evaluate)
# Compile and submit pipeline
if __name__ == '__main__':
import kfp.compiler as compiler
# Compile pipeline
compiler.Compiler().compile(
ml_pipeline,
'ml_pipeline.yaml'
)
# Submit to Kubeflow
client = kfp.Client(host='http://localhost:8080')
experiment = client.create_experiment(name='ml-experiments')
run = client.run_pipeline(
experiment.id,
'ml-pipeline-run',
'ml_pipeline.yaml',
params={
'input_data_path': 'gs://my-bucket/data/input.csv',
'learning_rate': 0.001,
'epochs': 20,
'batch_size': 64,
'model_name': 'production-model-v1'
}
)
print(f'Pipeline run created: {run.id}')
Distributed Training with Kubeflow
PyTorch Distributed Training
PyTorchJob Configuration:
# pytorch-distributed-training.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-distributed-training
namespace: kubeflow
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
imagePullPolicy: IfNotPresent
command:
- python
- -m
- torch.distributed.launch
- --nproc_per_node=4
- --nnodes=4
- --node_rank=$(RANK)
- --master_addr=$(MASTER_ADDR)
- --master_port=$(MASTER_PORT)
- train.py
- --epochs=100
- --batch-size=256
- --lr=0.01
env:
- name: NCCL_DEBUG
value: INFO
resources:
limits:
nvidia.com/gpu: 4
memory: 64Gi
cpu: 16
requests:
nvidia.com/gpu: 4
memory: 32Gi
cpu: 8
volumeMounts:
- name: training-data
mountPath: /data
- name: model-output
mountPath: /output
volumes:
- name: training-data
persistentVolumeClaim:
claimName: training-data-pvc
- name: model-output
persistentVolumeClaim:
claimName: model-output-pvc
Worker:
replicas: 3
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
command:
- python
- -m
- torch.distributed.launch
- --nproc_per_node=4
- --nnodes=4
- --node_rank=$(RANK)
- --master_addr=$(MASTER_ADDR)
- --master_port=$(MASTER_PORT)
- train.py
- --epochs=100
- --batch-size=256
- --lr=0.01
resources:
limits:
nvidia.com/gpu: 4
memory: 64Gi
cpu: 16
requests:
nvidia.com/gpu: 4
memory: 32Gi
cpu: 8
volumeMounts:
- name: training-data
mountPath: /data
- name: model-output
mountPath: /output
volumes:
- name: training-data
persistentVolumeClaim:
claimName: training-data-pvc
- name: model-output
persistentVolumeClaim:
claimName: model-output-pvc
Model Serving with KServe
Production Model Deployment:
# kserve-inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris-model
namespace: kubeflow
spec:
predictor:
minReplicas: 2
maxReplicas: 10
scaleTarget: 80
scaleMetric: concurrency
sklearn:
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 1
memory: 2Gi
# Canary deployment
canaryTrafficPercent: 10
transformer:
minReplicas: 1
containers:
- name: transformer
image: company/feature-transformer:v1.0
env:
- name: PROTOCOL
value: v2
explainer:
minReplicas: 1
containers:
- name: explainer
image: kserve/alibi-explainer:latest
args:
- --model_name=sklearn-iris
- --predictor_host=sklearn-iris-predictor
Conclusion
Kubeflow provides enterprises with:
- End-to-End ML Platform: From experimentation to production
- Distributed Training: Scale training across multiple GPUs and nodes
- Pipeline Orchestration: Reproducible ML workflows
- Model Serving: Production-grade inference at scale
- Hyperparameter Tuning: Automated model optimization
- Multi-Tenancy: Isolated workspaces for teams
By implementing Kubeflow with the patterns in this guide, organizations can build scalable ML platforms that accelerate model development and deployment.
For more information on Kubeflow and MLOps, visit support.tools.