Project Shad: Scheduling in Hadoop across Data Centers



This project aims to develop an efficient scheduler for a Hadoop YARN system built across multiple data centers to serve a large volume of jobs. Compared to a typical Hadoop cluster, input data sets are distributed across multiple sites. Our scheduler is expected to dynamically derive the best strategy for assigning computation tasks and migrating source data. The decision will be affected by complicated factors including data locality, network overhead, computing capacity at each data center, dependency between tasks/jobs, each job's progress and urgency.

Under construction ...