Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning

No Comments

This paper is accepted by Eurosys 20. This paper is also a follow-up work of Gandiva(OSDI 18).

Background

Previous schedulers like Tiresias(NSDI 19), Optimus(Eurosys 18), Gandiva(OSDI 18) do not consider user fairness. Now in heterogeneous clusters, which includes different types of GPUs, it’s more difficult to guarantee the fair share across users. Deep learning training jobs are gang-scheduling and long-running, which makes schedulers like Yarn not suitable. This paper introduces Gandiva_fair, and it achieves fairness and efficiency by using a split scheduler, a load balancer with migration, and an automatic resource trading mechanism.

Problem definition

Inputs:

  • A GPU cluster with heterogeneous GPUs
  • Different DLT jobs submitted by users
  • Tickets

Outputs:

  • Allocation for each job

Constraints:

  • Satisfy inter-user fairness (this indicates fairness inside each server)

Methodology

1. Split Stride Gang-scheduling

2. Load balancing

3. Scheduler Efficiency

4. Handling GPU heterogeneity transparently

varying marginal utility

Advantages

Disadvantages

Conclusion

My reviews and thoughts

No doubt that the AI system is popular. For DLT cluster scheduling, it’s always to set an objective for fairness. For different companies, the requirements may be completely different.

Categories: Read Papers