Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning
This paper is accepted by Eurosys 20. This paper is also a follow-up work of Gandiva(OSDI 18).
Previous schedulers like Tiresias(NSDI 19), Optimus(Eurosys 18), Gandiva(OSDI 18) do not consider user fairness. Now in heterogeneous clusters, which includes different types of GPUs, it’s more difficult to guarantee the fair share across users. Deep learning training jobs are gang-scheduling and long-running, which makes schedulers like Yarn not suitable. This paper introduces Gandiva_fair, and it achieves fairness and efficiency by using a split scheduler, a load balancer with migration, and an automatic resource trading mechanism.
- A GPU cluster with heterogeneous GPUs
- Different DLT jobs submitted by users
- Allocation for each job
- Satisfy inter-user fairness (this indicates fairness inside each server)
1. Split Stride Gang-scheduling
2. Load balancing
3. Scheduler Efficiency
4. Handling GPU heterogeneity transparently
varying marginal utility
My reviews and thoughts
No doubt that the AI system is popular. For DLT cluster scheduling, it’s always to set an objective for fairness. For different companies, the requirements may be completely different.