开源软件名称(OpenSource Name):microsoft/hivedscheduler开源软件地址(OpenSource Url):https://github.com/microsoft/hivedscheduler开源编程语言(OpenSource Language):Go 96.1%开源软件介绍(OpenSource Introduction):Microsoft OpenPAI HiveDSchedulerHiveD is a scheduler for deep learning workloads. As one standalone component of Microsoft OpenPAI, HiveD is designed to be a Kubernetes Scheduler Extender for Multi-Tenant GPU clusters. A multi-tenant GPU cluster assumes multiple tenants (teams) share the same GPU pool in a single physical cluster (PC) and provides some resource guarantees to each tenant. HiveD models each tenant as a virtual cluster (VC), so that one tenant can use its own VC as if it is a private cluster, while it can also use other VCs' free resource at lower priority. Why You Need HiveDHiveD provides several key features for deep learning workloads as follows. Topology-Aware Resource GuaranteeThe killer feature that distinguishes HiveD is that it provides resource guarantee to each VC, not only in terms of quantity, a numeric value, but also in terms of topology, a key requirement of GPU-based training jobs. For example, a traditional scheduler guarantees that a VC can use 8 GPUs. However, it does not know the topology of these 8 GPUs. It is possible that an 8-GPU training job which has to run within a single node, cannot be allocated even if its VC still has 8 free GPUs. This is because these 8 free GPUs may belong to multiple nodes. HiveD protects VCs' resources in terms of cell, a user-defined resource type that encodes both the quantity and other kinds of information, such as topology and hardware type. In the above example, a user can define a cell type of 8-GPU node, and the VC can be assigned one of such cell. Then, HiveD will ensure that there is always one 8-GPU node available for the VC, regardless of the other workloads in the cluster. HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different device models (e.g., NVIDIA V100 GPU, AMD Radeon MI100 GPU, Cloud TPU v3), or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them. Gang SchedulingHiveD optimizes the performance of gang scheduling, a typical scheduling requirement for deep learning training jobs, where all containers should be allocated before the training job can begin. Multiple gang-scheduled jobs competing for the same set of resource may lead to starvation, where each job only gets partial resource and has to wait indefinitely. HiveD schedules all containers within a job in a transactional manner, i.e., all these containers' requirements will be granted or denied as a whole, thus avoiding partial resource allocation and starvation. PrioritiesHiveD supports multiple job priorities. Higher-priority jobs can preempt lower-priority jobs. HiveD also introduces opportunistic jobs, i.e., jobs with the lowest priority which can use other VCs' free resource when possible (without breaking the resource guarantees to other VCs). Feature
Prerequisite
Quick StartDocOfficial ImageRelated Project
ContributingThis project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com. When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA. This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments. ReferencePlease cite HiveD in your publications if it helps your research:
|
2023-10-27
2022-08-15
2022-08-17
2022-09-23
2022-08-13
请发表评论