报告题目:Machine-Learning for Optimization: Coflow Scheduling in Cloud High-Performance Computing
报告人:沈鸿
报告时间:2022年6月11日10:00-12:00
报告地点:开云全站中国有限公司4号楼302
邀请单位:开云全站中国有限公司
报告内容简介:
Statistics shows that transmission delay of dependent data flows among computation tasks accounts for more than 50% of the total execution time of an HPC job. How to effectively schedule these data flows is therefore crucial for reducing the execution time of application jobs and increasing the profit of HPC systems, which is particularly important for managing shared HPC resources in a cloud computing environment. In this talk, I will address the problem of scheduling groups of parallel data flows, namely coflows, among computation tasks in HPC jobs. Because this problem even in its simplest case of single stage is known NP-hard, various greedy strategies, heuristic and machine learning based approaches have been proposed to obtain sub-optimal solutions. As an example of application of machine learning techniques for solving optimization problems, I will present our recent work in combining different machine learning models and techniques for effectively scheduling coflows. I will begin with an overview on machine learning and traditional optimization strategies for solving problems, our approaches of combining them in different settings, the coflow scheduling problem and its research status. I will then introduce our work on offline scheduling of single-stage coflows by combining meta-learning with DNN to balance the efficiency and fairness in terms of minimizing coflow completion time while ensuring the desired fairness among coflows. Next, I will present our work for online scheduling of multi-stage coflows, a more challenging but realistic situation in cloud HPC environments, by combining GNN, DRL and self-attention mechanism to improve the scalability and efficiency measured by job completion time. Finally, I will conclude this talk by showing some challenges in applying ML techniques for solving optimization problems and our future work.
报告人简介:
沈鸿教授毕业于北京科技大学获得学士学位,毕业于中国科学技术大学获得硕士学位,并在芬兰阿博阿卡德米大学获得博士学位。他现任澳门理工大学教授,中山大学特聘教授,以及阿德莱德大学兼职教授。在阿德莱德大学,他曾担任计算机科学系教授(主席)长达15年之久。沈鸿教授主要研究领域包括并行与分布式计算、隐私保护计算和高性能网络,并领导多个不同国家的研究中心和项目。他已发表400多篇论文,其中包括100多篇发表在包括IEEE和ACM交易等主要国际期刊上。沈鸿教授获得了许多荣誉和奖项,并在专业学会、期刊编辑委员会和会议委员会中担任不同的职务。