Run a PyTorchJob
This page shows how to leverage Kueue’s scheduling and resource management capabilities when running Training Operator PyTorchJobs.
This guide is for batch users that have a basic understanding of Kueue. For more information, see Kueue’s overview.
Before you begin
Check administer cluster quotas for details on the initial cluster setup.
Check the Training Operator installation guide.
Note that the minimum requirement training-operator version is v1.7.0.
You can modify kueue configurations from installed releases to include PyTorchJobs as an allowed workload.
PyTorchJob definition
a. Queue selection
The target local queue should be specified in the metadata.labels
section of the PyTorchJob configuration.
metadata:
labels:
kueue.x-k8s.io/queue-name: user-queue
b. Optionally set Suspend field in PyTorchJobs
spec:
runPolicy:
suspend: true
By default, Kueue will set suspend
to true via webhook and unsuspend it when the PyTorchJob is admitted.
Sample PyTorchJob
This example is based on https://github.com/kubeflow/training-operator/blob/855e0960668b34992ba4e1fd5914a08a3362cfb1/examples/pytorch/simple.yaml.
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-simple
namespace: default
labels:
kueue.x-k8s.io/queue-name: user-queue
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
imagePullPolicy: Always
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"
resources:
requests:
cpu: 1
memory: "200Mi"
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
imagePullPolicy: Always
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"
resources:
requests:
cpu: 1
memory: "200Mi"
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.