Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning
This paper is accepted by Eurosys 20. This paper is also a follow-up work of Gandiva(OSDI 18).
Background
Previous schedulers like Tiresias(NSDI 19), Optimus(Eurosys 18), Gandiva(OSDI 18) do not consider user fairness. Now in heterogeneous clusters, which includes different types of GPUs, it’s more difficult to guarantee the fair share across users. Deep learning training jobs are gang-scheduling and long-running, which makes schedulers like Yarn not suitable. This paper introduces Gandiva_fair, and it achieves fairness and efficiency by using a split scheduler, a load balancer with migration, and an automatic resource trading mechanism.
Problem definition
Inputs:
- A GPU cluster with heterogeneous GPUs
- Different DLT jobs submitted by users
- Tickets
Outputs:
- Allocation for each job
Constraints:
- Satisfy inter-user fairness (this indicates fairness inside each server)
Methodology
1. Split Stride Gang-scheduling
2. Load balancing
3. Scheduler Efficiency
4. Handling GPU heterogeneity transparently
varying marginal utility
Advantages
Disadvantages
Conclusion
My reviews and thoughts
No doubt that the AI system is popular. For DLT cluster scheduling, it’s always to set an objective for fairness. For different companies, the requirements may be completely different.