دانلود رایگان مقاله انگلیسی + خرید ترجمه فارسی | |
عنوان فارسی مقاله: | رویکرد برنامه ریزی زمان واقعی ترکیبی برای پلت فرمهای چند هسته ای در مقیاس بزرگ |
عنوان انگلیسی مقاله: | A Hybrid Real-Time Scheduling Approach for Large-Scale Multicore Platforms |
دانلود مقاله انگلیسی: | برای دانلود رایگان مقاله انگلیسی با فرمت pdf اینجا کلیک نمائید |
مشخصات مقاله انگلیسی (PDF) و ترجمه مقاله (Word) | |
سال انتشار مقاله | 2007 |
تعداد صفحات مقاله انگلیسی | 10 صفحه با فرمت pdf |
تعداد صفحات ترجمه مقاله | 19 صفحه با فرمت ورد |
رشته های مرتبط | فناوری اطلاعات و کامپیوتر |
دانشگاه تهیه کننده | گروه علوم کامپیوتر، دانشگاه کارولینای شمالی در چاپل هیل (Department of Computer Science, The University of North Carolina at Chapel Hill) |
کلمات کلیدی این مقاله | زمانبندی بلادرنگ، EDF انحصاری، خوشه بندی |
بخشی از ترجمه:
در این مقاله قصد داریم روشی را برای زمانبندی وظایف بلادرنگ بر روی پلت فرمهای چند هسته ای و بزرگ مقیاس، و با استفاده از کشهای سلسله مراتبی اشتراکی ارائه دهیم. در این روش، ۱ پلت فرم چند هسته ای در داخل کلاستر یا خوشهها بخش بندی شده است. وظایت نیز به صورت پویا به این خوشهها تخصیص داده شده و در داخل هر خوشه با استفاده از الگوریتم زمانبندی EDF انحصاری زمانبندی میشوند. نشان دادهایم که این روش ترکیبی در بخش بندی و زمابندی میتواند نسبت به پلت فرمهای بزرگ مقیاس عملکرد بهتری داشته باشد. همچنین اندازهی مناسبی را برای خوشه در نظر گرفتهایم تا بتوانیم به بهترین کارائی ممکن دست پیدا کنیم، البته با این شرط که مشخصه های ۱ مجموعه وظیفه را بتوان پشتیبانی کرد.
بخشی از مقاله انگلیسی:
Abstract We propose a hybrid approach for scheduling real-time tasks on large-scale multicore platforms with hierarchical shared caches. In this approach, a multicore platform is partitioned into clusters. Tasks are statically assigned to these clusters, and scheduled within each cluster using the preemptive global EDF scheduling algorithm. We show that this hybrid of partitioning and global scheduling performs better on large-scale platforms than either approach alone. We also determine the appropriate cluster size to achieve the best performance possible, given the characteristics of the task set to be supported. 1 Introduction Multicore architectures, which include several processors on a single chip, are being widely touted as a solution to the “thermal roadblock” imposed by single-core designs. Most chip makers have released dual-core chips, and a few designs with more than two cores have been released as well. For instance, both Intel and AMD have released four-core chips, Sun recently released its eight-core Niagara chip, and Intel is expected to release chips with 80 cores within five years [6]. Azul, a company that builds Java appliances, has created 48-core chips that are used in systems with up to 768 total cores [1]. These appliances are used to process large numbers of transactions with soft real-time requirements. To summarize, largescale multicore platforms with tens or even hundreds of cores per chip may become a reality fairly soon and applications with (soft) real-time constraints will likely be deployed on them. In this paper, we consider the issue of how to efficiently schedule soft real-time workloads on such large platforms. In most proposed multicore platforms, different cores share on-chip caches. For example, by the end of 2007, both Intel and AMD plan to have chips with a cache shared by four cores, and the aforementioned Azul chip ∗Work supported by a grant from Intel Corp., by NSF grants CNS 0408996, CCF 0541056, and CNS 0615197 and by ARO grant W911NF-06-1-0425. From other eight cores… From other 48 cores… L1 L1 Core 4 L1 Core 1 Core 13 Core 16 L3 L4 L1 L2 L2 Figure 1: Large-scale multicore architecture with a four-level cache hierarchy. Three of the levels contain shared caches. has a shared cache for each group of eight cores. To effectively exploit the available parallelism in these systems, such caches must not become performance bottlenecks. In fact, the issue of efficient cache usage on multicore platforms is one of the most important problems with which chip makers are currently grappling. This will become an even more pressing issue as the number of cores on a chip increases. For example, Intel envisions a 32- core platform where all cores share a cache. To alleviate issues related to cache contention and coherence, a treelike cache hierarchy will likely exist. To reasonably constrain the focus of this paper, we henceforth take the 64- core platform shown in Fig. 1 to be the “canonical” large platform under consideration. In this platform, all cores are symmetric, single-threaded, and share an L4 cache. Note that groups of cores also share L2 and L3 caches, which reduces contention at the L4 cache. Given such a platform, the question of whether to use partitioning or global scheduling approaches when scheduling soft real-time applications becomes more complicated. While either approach might be viable, each has serious drawbacks on this platform, and neither will likely utilize the system very well. Globalscheduling algorithms are better able than partitioning approaches to utilize multiprocessor systems when system overheads are negligible. For example, on a system with M cores, the global earliest-deadline-first (EDF) algorithm can ensure bounded deadline tardiness (which is sufficient for many soft real-time applications) for any such task system if total utilization is at most M [5, 10]. On the other hand, global algorithms are susceptible to large overheads on large platforms. These overheads are due to scheduling-related costs when scheduling a large task set on a large number of cores, high contention for the global run queue, and the cost of migrating data between two cores that share only a low-level cache (or, on some platforms, no cache at all). Partitioning approaches result in no task migrations and reduced scheduling costs; however, due to bin-packing limitations, there exist task systems with total utilization of approximately M/2 that no such approach can correctly schedule, even if bounded deadline tardiness is allowed. This becomes an even greater concern on large-scale platforms with relatively simple cores (a likely scenario, since using simple cores enables more cores to be placed onto a chip). This is because task utilizations on such simple cores may be high, which makes partitioning more difficult. Contributions. Driven by the above considerations, we propose the use of a hybrid approach that exploits the natural groupings of cores around different levels of shared caches. In our hybrid approach, we partition the platform into clusters of cores that share a cache. We then statically assign tasks to clusters, and schedule tasks within each cluster using a global scheduling algorithm, namely, preemptive global EDF. In this approach, migration costs within a cluster are a function of the access time (and size) of the shared cache of that cluster. When tasks have large working sets (WSs), cluster sizes can be kept small in order to keep migration costs low. By partitioning the system into clusters instead of individual cores, we alleviate bin-packing limitations by effectively increasing bin sizes in comparison to item sizes. For example, with four-core clusters, a task can occupy at most 25% of a bin. As a result, it is much easier to partition such tasks onto clusters than onto individual cores. Note that by tuning the cluster size, we can mitigate the weaknesses of (and exploit the advantages of) each approach. The “ideal” cluster size depends on both the maximum task utilization and task working set size (WSS), as well as the overall system utilization of the real-time workload. One of the main contributions of this paper is to devise rules of thumb for choosing cluster sizes. These rules were devised based upon a series of schedulability experiments that were conducted for our canonical system under a variety of different real-time workloads. We used SESC, a cycle-accurate architecture simulator that supports the MIPS instruction set [8], to obtain realistic overheads for our schedulability experiments. We found that larger cluster sizes improve bin packings, and are preferable when task utilizations are high. When task utilizations are lower, migration and scheduling costs are usually the greater concern, particularly when WSSs are large, and therefore smaller cluster sizes are preferable. We show that for many of the workloads that we investigated, a cluster size of four is ideal for our platform, as the maximum size of a bin item is reduced to a size that makes bin packing much easier while keeping scheduling and migration costs low. There also exist scenarios where a pure partitioning or global scheduling approach is the best choice, and we do not rule out the possibility of using these approaches when it is beneficial to do so. For example, a pure partitioning approach might be beneficial when the real-time workload is very small, as the bin-packing problem would not be a concern. Alternately, when task utilizations are very high and WSSs are very low, a pure global approach might prove to be best. Our hybrid approach simply provides greater flexibility when determining how to most efficiently schedule a real-time workload. One limitation of our experiments is worth noting. While we believe that many of our conclusions are of a general nature, these conclusions have been drawn based on empirical data taken from one simulated test platform, that shown in Fig. 1 and further elaborated upon in Sec. 3. SESC produces very detailed simulations, but is quite slow, so it was only feasible to explore one such platform in this paper. While it is difficult to predict exactly what future 64-core platforms will look like, we believe that our chosen platform is a reasonable approximation of what can be expected. In future work, we hope to evaluate other platforms, most notably those exhibiting either core or cache asymmetry. Another difficultly in performing these experimentsfor soft real-time systems is the need to generate averagecase, rather than worst-case, measurements. This is particularly difficult when determining system overheads, as discussed further in Sec. 3.1. Furthermore, when provisioning a system based on average-case execution costs, overruns can occur. We assume that such overruns will be temporary—if an execution cost is truly an average-case measurement, then underruns should occur sufficiently frequently to counterbalance overruns in the long run. In terms of scheduling, there are numerous ways of handling overruns. In this paper, we assume that a job that does not complete by the end of its provisioned time uses time allocated to the next job of the same task. This is the simplest way of handling overruns, as it does not require modification to the scheduling algorithm and prevents an overrunning task from having a negative impact on other tasks in the system. Unfortunately, this approach can lead to increased deadline tardiness. It also makes it difficult to consider truly non-preemptive scheduling algorithms, as it allows overrunning jobs to be preempted. Issues related to using non-preemptive scheduling algorithms assuming average-case execution costs are left as future work. Related work. Little work has explored the impact of large-scale multicore platforms on real-time scheduling algorithms. Recent work [3] (by our research group) has compared partitioning and global approaches using real scheduling and system overheads. However, this work was performed on a conventional (non-multicore) fourprocessor SMP platform. Unfortunately, the platform of interest in this paper, or any approximation thereof, may not exist for at least several years, so we must resort to simulation. What real-time workload needs 64 cores? Before continuing, we explain the need for real-time support on a platform with 64 cores. One envisioned application of multicore platforms is as multi-purpose home appliances, with one machine serving many of the computing needs of a home. These may include supporting, for example, HDTV video streaming, a videoconferencing session, and several “virtual” user terminals simultaneously. In such a system, applications with soft real-time requirements must run alongside other (soft real-time and non-real-time) applications. If our hybrid approach were used to implement such a system, then some tasks of the real-time workload could actually be server tasks for supporting non-real-time processing. In such a scenario, our approach would increase the capacity of the system for handling both real-time and non-real-time applications. It is worth noting that, in these home appliances, many of the envisioned real-time applications might be very processor-intensive (e.g., streaming from an HDTV video source), and therefore might contain tasks with utilizations higher than are typically considered “normal” for real-time tasks, especially if processing cores get simpler as the number of cores on a chip increases. For this reason, we consider task sets containing such highutilization tasks in our experiments. Another application of these large-scale platforms is real-time transaction processing. As noted earlier, Azul [9] is a company that has developed a variety of large-scale multicore platforms explicitly for the purpose of handling transaction-oriented, Java-based workloads. Processors with 48 cores on a chip are being placed into systems with hundreds of cores in order to handle the demand created by businesses such as financial institutions. While developing more powerful hardware is certainly an important part of satisfying the demand for processing real-time transactions, it will also be important to most ef- ficiently make use of currently-existing hardware through the use of appropriate scheduling algorithms. This will allow timing guarantees to be made while supporting the largest real-time workload possible, in turn allowing businesses to achieve the largest possible return on their hardware investment. Organization. The rest of this paper is organized as follows. In Sec. 2, we present a brief introduction to EDF-based partitioned and global scheduling and a description of our hybrid EDF scheduling approach. In Sec. 3, we present the results of experiments performed to determine the ideal cluster size for our target platform under a variety of real-time workloads. In Sec. 4, we state several “rules of thumb” based on these results that can be employed when determining an appropriate cluster size for a real-time workload running on a large-scale multicore platform. Finally, in Sec. 5, we conclude.
دانلود رایگان مقاله انگلیسی + خرید ترجمه فارسی | |
عنوان فارسی مقاله: | رویکرد برنامه ریزی زمان واقعی ترکیبی برای پلت فرمهای چند هسته ای در مقیاس بزرگ |
عنوان انگلیسی مقاله: | A Hybrid Real-Time Scheduling Approach for Large-Scale Multicore Platforms |