Training Runtime Images
Curated TrainingRuntime images for Kubeflow Trainer v2. Each image bundles a specific PyTorch + accelerator stack so users can submit TrainJobs without rebuilding.
TOC
Available runtimesPicking a runtimeApply a TrainingRuntimeSubmit a TrainJobDevice resource requestsNVIDIA GPUHuawei Ascend NPUImage caveatsBuild your ownAvailable runtimes
CUDA images are amd64-only; CANN images are arm64-only.
Picking a runtime
- torchrun on GPU →
torch2.6-cu126-amd64 - torchrun on NPU →
torch2.6-cann8.5-arm64(setruntimeClassName: ascend) - LLM SFT / LoRA with LLaMA-Factory →
llamafactory0.9-cu126-amd64(GPU) orllamafactory0.9-cann8.5-arm64(NPU) - TRL / PEFT SFT / OSFT / DPO →
traininghub0.1-cu126-amd64 - Megatron-style training on Ascend →
mindspeed-llm-cann8.5-arm64
Apply a TrainingRuntime
Ready-to-apply YAMLs live in assets/training-runtimes/. Each YAML pins :v0.1.0; override the tag to track a different release. The YAMLs default to a Kubeflow Profile namespace — change metadata.namespace to the namespace where you submit jobs.
Submit a TrainJob
A shared smoke template applies to any runtime — set spec.runtimeRef.name to the runtime you want to exercise:
Device resource requests
NVIDIA GPU
Whole-device request:
HAMI vGPU slice:
Huawei Ascend NPU
Always set runtimeClassName: ascend so the host driver libs and DCMI sockets are injected.
Standard Huawei device-plugin:
HAMI vNPU (each 910B4 slices into 20 cores / 32 GiB):
With HAMI, allocatable.huawei.com/Ascend910B4 reads 0 because HAMI allocates through its scheduler extender. If pods stay Pending on hami-scheduler: 1 node unregistered, confirm the host driver is loaded (/sys/bus/pci/drivers/davinci, npu-smi info healthy) and the node is labeled ascend=on.
Image caveats
- All CANN images —
runtimeClassName: ascendbind-mounts host/usr/local/Ascendbut does not export the CANN env vars (LD_LIBRARY_PATH,ASCEND_HOME_PATH, …). Every entrypoint that importstorch_npumust firstsource /usr/local/Ascend/ascend-toolkit/set_env.sh(and optionally/usr/local/Ascend/nnal/atb/set_env.sh); otherwise the import fails withlibhccl.so: cannot open shared object file. The published runtime YAMLs already do this — keep thesourcelines in any derived runtime. traininghub0.1-cu126-amd64— ships CUDA runtime but not the toolkit. DeepSpeed JIT op compilation needsnvcc; mount or installnvidia-cuda-toolkitand setCUDA_HOMEif you use those ops.mindspeed-llm-cann8.5-arm64—megatron.coreneedspkg_resources, so installsetuptools<81in the job entrypoint.import mindspeed_llmcurrently fails on themindspeed_llmmaster /core_r0.8.0mismatch; the underlying torch + torch_npu + megatron.core + mindspeed stack trains correctly without the adapter shim.
Build your own
The Containerfiles, multi-arch buildkitd helper, e2e harness, and post-fix scan evidence are in kubeflow-plugin/training-runtimes. Each framework image is a thin layer on torch2.6-cu126-amd64 or torch2.6-cann8.5-arm64, so deriving a new runtime is mostly FROM docker.io/alaudadockerhub/torch2.6-cu126-amd64:v0.1.0 plus framework installs.