Skip to main content
Menu

Efficient Transparent Checkpointing of AI/ML Workloads in Kubernetes

Location

Teaching room (277), Oxford e-Research Centre, 7 Keble Road, Oxford

Date & Time

Wednesday 02 Apr 2025 13:00 - Wednesday 02 Apr 2025 13:45

Availability

Open to all. Tea, coffee and biscuits provided. You are welcome to bring your own lunch

Oxford e-Research Centre welcomes you to a seminar entitled "Efficient Transparent Checkpointing of AI/ML Workloads in Kubernetes"

Abstract:

As long-running AI/ML workloads become more common in cloud-native environments, the need for efficient checkpointing mechanisms to provide fault tolerance becomes increasingly important. However, current state-of-the-art techniques for transparent GPU checkpointing rely on intercepting and logging device API calls (e.g., CUDA runtime) as well as capturing input data and object handles (e.g., events, streams). This approach inevitably introduces steady-state overhead and requires replaying the entire recorded execution, potentially with nondeterministic operations, to recover from failures. This talk will cover how the Kubernetes container checkpointing functionality has been extended with recently introduced CRIU plugins to enable transparent checkpoint/restore of GPU computations without the overhead of API interception, logging, or re-execution. This talk will also discuss how these mechanisms can be utilized to improve resource utilization in large-scale GPU clusters.

Speakers: Viktória Spišaková (Masaryk University) and Radostin Stoyanov (University of Oxford)

Location: Oxford e-Research Centre Teaching room (277), 7 Keble Road, Oxford, OX1 3QG

Date & Time: Wednesday, 2 April 2025 at 13:00pm

Viktória and Radostin will present in April at KubeCon + CloudNativeCon Europe 2025 find out more: https://sched.co/1tx7i