Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning

    No Comments

    This paper is accepted by Eurosys 20. This paper is also a follow-up work of Gandiva(OSDI 18).

    Background

    Previous schedulers like Tiresias(NSDI 19), Optimus(Eurosys 18), Gandiva(OSDI 18) do not consider user fairness. Now in heterogeneous clusters, which includes different types of GPUs, it’s more difficult to guarantee the fair share across users. Deep learning training jobs are gang-scheduling and long-running, which makes schedulers like Yarn not suitable. This paper introduces Gandiva_fair, and it achieves fairness and efficiency by using a split scheduler, a load balancer with migration, and an automatic resource trading mechanism.

    Problem definition

    Inputs:

    • A GPU cluster with heterogeneous GPUs
    • Different DLT jobs submitted by users
    • Tickets

    Outputs:

    • Allocation for each job

    Constraints:

    • Satisfy inter-user fairness (this indicates fairness inside each server)

    Methodology

    1. Split Stride Gang-scheduling

    2. Load balancing

    3. Scheduler Efficiency

    4. Handling GPU heterogeneity transparently

    varying marginal utility

    Advantages

    Disadvantages

    Conclusion

    My reviews and thoughts

    No doubt that the AI system is popular. For DLT cluster scheduling, it’s always to set an objective for fairness. For different companies, the requirements may be completely different.

    Categories: 未分类

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    *

    code