Efficient Transparent Checkpointing of AI/ML Workloads in Kubernetes
Location
Teaching room (277), Oxford e-Research Centre, 7 Keble Road, Oxford
Date & Time
Wednesday 02 Apr 2025 13:00 - Wednesday 02 Apr 2025 13:45
Availability
Oxford e-Research Centre welcomes you to a seminar entitled "Efficient Transparent Checkpointing of AI/ML Workloads in Kubernetes"
Abstract:
As long-running AI/ML workloads become more common in cloud-native environments, the need for efficient checkpointing mechanisms to provide fault tolerance becomes increasingly important. However, current state-of-the-art techniques for transparent GPU checkpointing rely on intercepting and logging device API calls (e.g., CUDA runtime) as well as capturing input data and object handles (e.g., events, streams). This approach inevitably introduces steady-state overhead and requires replaying the entire recorded execution, potentially with nondeterministic operations, to recover from failures. This talk will cover how the Kubernetes container checkpointing functionality has been extended with recently introduced CRIU plugins to enable transparent checkpoint/restore of GPU computations without the overhead of API interception, logging, or re-execution. This talk will also discuss how these mechanisms can be utilized to improve resource utilization in large-scale GPU clusters.
Speakers: Viktória Spišaková (Masaryk University) and Radostin Stoyanov (University of Oxford)
Location: Oxford e-Research Centre Teaching room (277), 7 Keble Road, Oxford, OX1 3QG
Date & Time: Wednesday, 2 April 2025 at 13:00pm
Viktória and Radostin will present in April at KubeCon + CloudNativeCon Europe 2025 find out more: https://sched.co/1tx7i