| ![]() | |||||||||
Published in Proceedings of US/Japan Workshop on Parallel Symbolic Computing: Languages, Systems, and
Applications, Cambridge, MA, USA, October 1992.
Also appeard in LNCS 748, pp. 380-401, Springer Verlag, October 1992.
Customizable Policy Management
in the
Sting Operating System
James Philbin
NEC Research Institute
Princeton, NJ 08540
philbin@research.nj.nec.com
Abstract. Sting is an operating system designed to serve as a highly efficient substrate for modern programming languages. It is designed to run on both parallel and distributed systems. This paper describes one of the important contributions of Sting - customizable policy management at both the kernel and user level of the operating system. Two well defined interfaces separate control issues from policy issues. These interfaces allow different, customized policy management modules to be implemented without changing the rest of the system. Customizable policy management makes Sting suitable for many different operating environments including real time, interactive, and computationally intensive. It also allows the user to choose or implement a policy management strategy that is best suited to a particular program.
1 Introduction
Sting is an operating system designed to serve as a highly efficient substrate for modern programming languages. It is designed to run on MIMD parallel computers, with either shared or disjoint memory, as well as distributed machines composed of networks of workstations.
Sting was designed as an experimental platform for exploring and comparing different models of parallel programming. We have used Sting to implement many different algorithms corresponding to different paradigms of parallelism including: result parallelism, master/slave parallelism, and speculative parallelism.
We have also implemented several different parallel programming models on top of Sting including: futures [Hal85], first class tuple spaces [Jag91], and engines [HF84] and compared them using various parallel algorithms that use these paradigms.
One of the fundamental design goals of Sting was to separate control issues from policy issues to the extent possible. This separation occurs at two different abstraction levels in the system: in the Abstract Physical Processor and in the Virtual Processor. Each of these abstractions is separated into two components, the ?controller? which
implements the control part of the abstraction and the ?policy manager? which makes policy decisions for the controller. Separating control from policy allows us to define different behaviors for what is functionally the same system by modifying only the policy manager part of the abstraction.
Traditionally there have been several classes of operating system including: real time, interactive, and batch. These three classes have provided different interfaces to the user and thus porting a program from one class of OS to another has been difficult. Additionally, since the scheduling decisions made by each class are different it is difficult to debug a program for one, say a real time application, on another, say an interactive development system, and have confidence that the application will run correctly and efficiently on the target system.
The situation is complicated further by the number of different scheduling regimes used in each of these classes of systems. For example, some real time systems use a fixed scheduling order for processes, some use a priority discipline, some use a running quantum discipline, and still others use a combination of these. Interactive operating systems or batch operating systems have at least as many scheduling alternatives if not more.
Separating control from policy allows us to build one operating system that can be easily customized for various classes of operating system. In Sting the modules implementing policy managers are very small relative to the size of the system - consisting in general of less than one hundred lines of code. Thus, we only have to write a small piece of code to build a new system with different policy behavior. Additionally, since the policy manager presents a well defined interface we need only test the new policy manager and not the entire system when we change policy behavior.
Hydra [WLH81] was the first operating system designed with the separation of control and policy in mind, but Hydra only allowed policy customization at the kernel level. Sting goes further than this by allowing the programmer to customize policy decisions as they relate to a particular program. Thus an interactive program such as an editor or window manager can have very different policies from a compute intensive program such as a fluid dynamics simulation or finite elements computation. Separation of control from policy in Hydra was also expensive since it required several context switches between the kernel and the policy manager. Since the Sting policy managers are linked directly into the appropriate address space and require no context switching, they are as efficient as traditional (non-customizable) operating system policy management.
In Sting, virtual processors are multiplexed on abstract physical processors and threads are multiplexed on virtual processors. All policy decisions relating to this multiplexing are decided by policy managers. Decisions relating to the multiplexing of virtual processors on physical processors are made by the Virtual Processor Policy Manager (VPPM) while decisions relating to the multiplexing of threads on virtual processors are made by the Thread Policy Manager (TPM).
Policy managers make three types of decisions: which multiplexor to map a new object (VP or thread) to when it is created or resumed, what order to run the objects mapped on a particular processor, and when to remap or move an object from one pro-
cessor to another. The large decision spaces that relate to these two levels of policy managers are discussed in detail below.
2 Overview of Sting
Sting is an operating system designed to support modern programming languages such as Scheme, SmallTalk, ML, Modula3, or Haskell. It provides a foundation of low level, orthogonal constructs, that allows the language designer or implementor to build the various constructs required by these languages easily and efficiently.
Modern programming languages have more extensive requirements than traditional programming languages such as Cobol, Fortran, C, or Pascal1. The list below identifies some of the requirements that distinguish modern from traditional languages.
Parallelism - The growing availability of general purpose multi-processors has lead to increased interest in building efficient and expressive platforms for concurrent programming. Most efforts to incorporate concurrency into high-level programming languages involve the addition of special purpose primitives to the language.
Multiple Synchronization Models - There are many synchronization protocols used in parallel or asynchronous programming. A modern operating environment should as far as possible provide the primitives to support the various protocols.
Lazy and Eager Evaluation - Many modern languages support either lazy or eager evaluation or both. It is important for the operating system to provide the full range of evaluation strategies from lazy to eager.
Automatic Storage Management - This has become a fundamental feature of many modern languages, because automatic storage management allows more expressive programs, while at the same time reducing both the number of errors in and the complexity of programs.
Topology Mapping - While not yet supported in many programming languages, the ability to control the mapping of processes to processors so as to reduce the communication overhead of a program will become more important as the size of multi-processor computer systems continues to grow and the topologies become more complex.
Sting supports these various requirements efficiently. It does so in an architectural framework that is more general and more efficient than those currently available. It also provides the programmer with an increased level of expressiveness and control, and an unparalleled level of customizability.
The Sting operating system architecture is composed of several layers of abstraction, as shown in Figure 2-a. The first layer is the abstract physical machine, which is composed of abstract physical processors (PP) connected in some physical topology (PT). This layer corresponds to the micro-kernel in newer operating systems. The sec-
1. Even though Sting is designed to support modern programming languages it accommodates traditional programming languages just as efficiently.
ond layer consists of virtual machines and virtual processors (VPs). A virtual machine is composed of a virtual address space into which a set of virtual processors are mapped and connected in a virtual topology (VT). Virtual machines are mapped onto abstract physical machines, with each virtual processor mapped onto a physical processor. The third layer of abstraction defines threads (T). Threads are lightweight processes that are run on virtual processors.
The threads in a virtual machine are multiplexed on the virtual processors of that machine. At the same time, the virtual processors are multiplexed on the various physical processors. A virtual machine can have a unlimited number of threads and virtual processors, while a physical machine can have only as many abstract physical processors as there are actual hardware processors in the machine.
The virtual processors of a virtual machine are connected together in a virtual topology. The virtual topology maps each virtual processor to a physical processor. For example, a virtual topology might represent a tree of virtual processors that is mapped onto physical processors that are physically connected in a mesh topology. Virtual topologies allow the programmer to express a program in a (virtual) topology that is suitable to the algorithm being implemented, while at the same time obtaining an effi-
Figure 2-a: Sting Architecture
P
VPC VPPMk
VPC VPPM0
P
Abstract Physical
PT
TC TPML
TC TPM0
VP0 VPL
VT
T0 Tm Tm+1 Tn
Virtual Machine N
cient mapping onto the actual physical topology of the target machine. Virtual topologies also allow parallel programs to be portable across different physical topologies.
Physical processors and virtual processors each have two components a controller and a policy manager. The controller is a state transition machine for the objects being multiplexed on that processor, i.e. physical processors multiplex virtual processors and virtual processors multiplex threads. The policy manager for a processor makes policy decisions when they are requested by the controller.
Since abstract physical processors multiplex virtual processors, they are composed of a virtual processor controller (VPC) and a virtual processor policy manager (VPPM). Virtual processors, which multiplex threads, are composed of a thread controller (TC) and a thread policy manager (TPM). Together the virtual processor policy managers and the thread policy managers make all policy decisions for a Sting system.
Sting is implemented in T [RA82] a dialect of Scheme using a modified version of the Orbit compiler. Readers interested in a more details about Sting should see [JP92a], [JP92b], and [Phi93].
Since the thread policy manager and the virtual processor policy manager are structurally similar in many ways and since users will in general be more interested in customizing the thread policy manager we will discuss it in detail below. After that we will discuss aspects of the virtual processor policy manager that are different from the thread policy manager.
3 Virtual Processors
A Virtual Processor (VP) is an abstraction of a hardware processor. As such, it is responsible for the creation, destruction, scheduling, and migration of lightweight threads. It also handles interrupts (hardware and software) and virtual processor controller up calls, i.e. software interrupts generated by the abstract physical machine, for example, when a thread blocks in the abstract physical machine.
Each VP is associated with both a virtual machine and an abstract physical processor. A physical processor may run VPs associated with many different virtual machines. More than one VP from the same virtual machine can also run on the same physical processor.
Sting?s VP?s are first class objects. This means they can be passed to and returned from procedure calls, and can be stored in data structures. Being first class, VP?s provides Sting with several capabilities that other operating systems lack:
? the user can explicitly map a thread to a particular virtual processor;
? the user can build abstract topologies using VP self-relative addressing; and,
? the policies of any VP can be easily customized.
As explained above, control and policy are separated in the virtual processor. Each virtual processor is composed of two software components: the thread controller and the thread policy manager. Figure 3-a shows the relationship between the virtual processor, the virtual machine, and user code. The thread policy manager is completely contained within the virtual processor. User code and threads interact with the thread
controller and the thread controller calls the thread policy manager to make policy decisions for it.
The thread controller handles the virtual processor?s interaction with other system components such as physical processors and threads. The most important function of the thread controller is to handle the state transitions of threads. Whenever a thread makes a state transition that results in it yielding the virtual processor on which it is currently running, the thread controller calls the thread policy manager to determine which thread to run next. The thread controller is discussed in more detail in [JP92a], and {Phi93].
4 Thread Policy Manager
Each virtual processor contains a thread policy manager. The thread policy manager makes all policy decisions relating to the scheduling and migration of threads on virtual processors. The thread controller is a client of the thread policy manager and it is inaccessible to user code. The thread controller calls the thread policy manager whenever it needs to make a decision concerning:
? the initial mapping of a thread to a virtual processor;
? which thread a virtual processor should run next when the current thread releases the virtual processor for some reason; or
? when and which threads to migrate to/from a virtual processor.
While all virtual processors have the same thread controller, each virtual processor may have a different policy manager. This ability is particularly important for real time applications where each processor may be controlling a different subsystem with different scheduling requirements.
The thread policy manager presents a well-defined interface to the thread controller. The data structures that the thread policy managers use to make their decisions are
Figure 3-a: Separation of Control and Policy in the Virtual Processor
Thread Thread
Policy ManagerController
Virtual Processor
Virtual Machine
User Code
completely private to them. They may be local to a particular thread policy manager or shared among the various instances of the thread policy manager, or some combination thereof, but they are never available to any other part of the system. The thread policy manager can thus be customized to provide different behaviors for different virtual machines. This allows the user to customize policy decisions depending on the type of program being run. For example, a computationally intensive program such as a fluid dynamics simulator, might use a non-preemptible lifo scheduling policy, because each thread should run as long as possible and because the lifo scheduling order is optimal for the particular structure of the algorithm being used, while a window manager or user shell might use a priority based fifo policy for the obvious reasons.
It is also worth noting that each thread has an associated priority and quantum. These values are supplied by the user/programmer when a thread is created or scheduled. The only part of the Sting system that uses these values is the thread policy manager. Associating priority and quantum with each thread allows the implementation of the full gamut of scheduling strategies from quantum based real time scheduling to priority based interactive scheduling.
4.1 Thread Policy Manager Interface
The thread policy manager communicates with the thread controller through a simple and easily implemented interface. Whenever the thread controller needs to make a policy decision it calls on the thread policy manager to make that decision. A thread policy manager can be implemented by any module that conforms to the interface.
Figure 4-a shows the interface to the thread policy manager. The interface can be divided into five components: policy manager initialization, initial placement policy,
VP Initialization
(tpm-initialize-vp vp) -> no-value
Initial Placement Policy
(tpm-allocate-vp vp thread) -> vp
Thread Scheduling Policy
(tpm-dequeue-ready-thread vp) -> thread | #F
(tpm-enqueue-ready-thread vp thread) -> no-value
(tpm-wakeup-suspended-threads vp) -> no-value
Scheduling Data Integrity
(tpm-ensure-priority priority) -> no-value
(tpm-ensure-quantum quantum) -> no-value
Migration Policy
(tpm-vp-idle vp) -> no-value
Figure 4-a: Thread Policy Manager Interface
scheduling policy, guards used by the thread controller to ensure that priority and quantum data are of the correct type, and migration policy. We emphasize that these policy interface procedures are called only by the thread controller and never by user code. Below we briefly describe the functionality of each of the interface procedures.
tpm-initialize-vp - This procedure is called when a VP is created. It is responsible for initializing any data structures associated with the thread policy manager on the VP that is its argument.
tpm-allocate-vp - This procedure is called whenever a thread is to be scheduled on a VP. While the user may request that a thread be scheduled on a particular VP, the thread policy manager has the final say regarding the VP on which a thread is scheduled.
tpm-dequeue-ready-thread - This procedure returns the next thread that is ready to run or false if there is no ready thread. This procedure is called from the thread controller when a thread has yielded the processor for whatever reason.
tpm-enqueue-ready-thread - This procedure is called when a thread becomes ready to run. It is responsible for enqueuing the thread according to the priority and quantum associated with the thread.
tpm-wakeup-suspended-threads - This procedure is responsible for moving any suspended threads into the appropriate ready queue if the real clock time has passed the requested wakeup time.
tpm-ensure-priority - This procedure ensures that it argument conforms to the thread policy manager?s notion of a valid priority.
tpm-ensure-quantum - This procedure ensures that it argument conforms to the thread policy manager?s notion of a valid quantum.
tpm-vp-idle - This procedure is called if there is no thread runnable. It can do several things:
? It can call the abstract physical processor to inform it that the VP is idle.
? It can migrate threads from either a local or global queue associated with some VP.
? It can decide to do housekeeping chores for the thread policy manager.
We should point out that the categories in the above classification of the thread policy manager ?s interface procedures (Figure 4-a) are not orthogonal. For example, both tpm-enqueue-ready-thread and tpm-dequeue-ready-thread could be used to handle load balancing and thread migration on the virtual processor. tpmenqueue-ready-thread could also be used to handle initial mapping of a new thread to a virtual processor. Thus tpm-allocate-vp and tpm-vp-idle are redundant procedures, but they are useful because they simplify the implementation of both the thread controller and the thread policy manager.
A discussion of memory management and communication in Sting is beyond the scope of this paper, but it should be pointed out that thread policy managers can communicate with each other using any Sting mechanism, in particular shared memory or
polymorphic ports.
5 Policy Dimensions
As mentioned above, thread policy managers make three types of policy decisions: (1) which processor to schedule a thread on; (2) which thread to run next on a given processor; and (3) when to migrate a thread to/from another processor and onto which processor to migrate it. There are many different strategies for making these decisions, depending on both the application environment and the implementation of a particular application, but we believe that Sting?s thread policy manager interface can support all of them. Below we discuss various strategies for handling these three policy dimensions.
5.1 Initial Thread Placement Decisions
There are many possible criteria for deciding on which virtual processor to run a thread. The two principle criteria that can be used to make this decision are load balancing and ?nearness? in the communications topology, to other threads with which data is shared. The reasons for load balancing are obvious, although load balancing strategies are not necessarily so. The reason nearness matters is that it can significantly reduce the cost of communication overhead. A third, though less important reason for mapping a thread to a particular processor is that the processor has a hardware device connected to it that the thread needs to access efficiently.
Below we describe several different load balancing strategies that we have implemented and several that we have not implemented but that may be appropriate under certain circumstances.
Parent?s VP - This strategy involves mapping a new thread to the same VP that its parent was running on when it was created. This strategy does not attempt to balance the load on the machine as threads are created, rather it relies on some migration strategy, discussed below, to balance the load on the processor. It does take advantage of the locality normally associated with a parent thread and its children.
Local with Threshold Overflow to Global - This strategy is similar to the Parent?s VP strategy. The difference is that when the number of ready threads on a VP reaches a threshold, some number (typically half) of the threads are moved to a global queue. When the local ready queue is empty the VP requests threads out of the global queue. This strategy exploits locality just as the Parent?s VP strategy does. It has the further advantage that the load on an individual processor never goes above a particular threshold. The disadvantage of this strategy is possible contention on the global queue.
Topology Mapping - This strategy relies on the programmer to specify the VP onto which a thread is mapped. It has the significant advantage that the programmer can take advantage of the data sharing and data distribution attributes of a particular program to improve its efficiency. It has the disadvantage that it requires more work on the part of the programmer. Any of the other placement strategies can be implemented in such a way so that they honor programmer
specified mappings, but make appropriate decisions when the programmer doesn?t specify the mapping.1
Random Placement - This strategy involves randomly mapping a new thread to a VP. It relies on the randomness to create an even distribution of threads on the processors. This strategy, however, does not take any advantage of the memory or communications architecture of the machine.
Round Robin - This strategy involves relying on a global counter to distribute the threads evenly across all VPs. It has the advantage of being simple and guaranteeing a good distribution. It has the disadvantage that the global counter must have a lock and thus may become a bottleneck, and like random placement, it does not take any advantage of the memory or communications architecture of the machine.
Idle VP - While there are many possible load based strategies the idle processor strategy is perhaps the simplest. Whenever a VP is idle it places itself on the idle VP queue and whenever a new thread is created it is mapped onto the VP at the head of the queue. If no VP is idle some other strategy is used map the thread to a VP. The advantage of this strategy is that idle processors quickly receive new work that is created. The disadvantage is that the number of threads per VP may be very unbalanced, and it does not take any advantage of the memory or communications architecture of the machine.
Load Based - This strategy relies on each VP maintaining some notion of its load and inserting itself in global queue ordered from the least loaded to the most loaded VP. When a thread is mapped to a VP it is always mapped to the least loaded VP. This strategy has obvious advantages, but its disadvantages are that the global queue may be a significant bottleneck if the threads are very lightweight and it does not take any advantage of the memory or communications architecture of the machine.
There are many other possible initial placement strategies. Some can be devised by combining elements of those mentioned above. The performance of a program or algorithm can depend crucially on the initial thread placement strategy used. It is also clear that for programs where computation costs completely dominate communications costs, i.e. embarrassingly parallel programs, initial placement decisions are less important, because they do not need to take advantage of the communications topology; however, initial placement decisions will still effect load balancing. For fine grained parallel programs, especially if run on a machine with a complex topology or distributed hierarchical memory, initial placement decisions can be fundamental to good performance.
5.2 Scheduling
While there are many different strategies for initially mapping a thread to a virtual processor, there are also many other possible strategies for scheduling those threads
1. We discuss topology mapping in more detail in [Phi93].
once they are mapped. Rather than discuss these strategies in detail we will discuss various dimensions along which decisions about the scheduler design must be made.
Local or Global Queues - The first decision the scheduler designer must make is whether the queue(s) will be global, i.e. associated with the virtual machine, or local, i.e. associated with each virtual processor, or some combination of the two. Global queues have the advantage of being more fair, but the disadvantage of being a potential source of contention, and therefore a bottleneck.
Number and Class of Queues - Ready threads can be divided into three classes for the purposes of scheduling:
? Those that have been scheduled but never been run and thus have no execution context.
? Those that have been run before and were in the running state immediately prior to going into the ready state.
? And those that have been run before and were in the blocked state immediately prior to going into the ready state.
Ready threads of all classes can be scheduled in one queue, but it may be advantageous to separate threads in different classes into different queues. For example, consider an implementation in which the scheduler maintains one queue for threads that have been scheduled but never run and another queue for threads which have been run.1 The policy manager can quickly migrate threads that have never been run to other VPs without having to migrate the thread?s execution context (because it doesn?t have one). The scheduler designer may also wish to discriminate between threads that were running and threads that were blocked, since the former, having been run more recently, are likely to have more cache/memory locality than the latter.
Ordering Within a Queue - There are various strategies for ordering threads within a particular queue. The two simplest are lifo and fifo, but since threads can have both a priority and a quantum associated with them much more complicated orderings can be created.
Locking Discipline - Another decision the scheduler designer needs to make is locking discipline required for the queues. If the queues are global and have multiple readers and writers, they must be locked on every access. If the queues are local, and have only one reader, i.e. the local VP, and many writers, they only need to be locked for writing. Finally, it is possible that a queue could be completely private to a processor and not need a lock. This is possible if the queue contains only threads that were previously running on that processor and those threads can never be migrated to another processor.
Quantum Discipline - The final decision the scheduler designer must make concerns the amount of time each thread will run before yielding the processor,
1. This sort of distinction is interesting in Sting because threads when created are very small data structures (approximately 10 words) and do not acquire an execution context (stack, heap,...) until they begin evaluating. For more information see [Phi93].
i.e. its quantum. There are several possibilities. The quantum for each thread may be determined statically by the programmer. This is generally done when the schedule for the threads is predetermined prior to the execution of a program. Other possibilities are that the quantum is the same for all threads; it is the same for all threads at a given priority, but different for threads at different priorities; or the quantum can vary dynamically under programmer control or under policy manager control.
These various scheduler dimensions allow the thread policy manager to be designed for various real time, computationally intensive, and interactive environments. Another interesting aspect of the thread policy manager is that it is possible to implement a completely static scheduler for a particular program if an optimal or near optimal schedule can be determined for the threads.
5.3 Example Thread Policy Managers
To date, several different thread scheduling strategies have been implemented and tested.
? Global LIFO - This scheduler has one global queue in a lifo order.
? Global FIFO - This scheduler has one global queue in a fifo order.
? Local LIFO 1 Queue - This scheduler has one local queue per VP that is ordered in a lifo manner.
? Local FIFO 1 Queue - This scheduler has one local queue per VP that is ordered in a fifo manner.
? Local LIFO 2 Queue - This scheduler has two local queues per VP. One contains threads that have never been run and the other contains threads that have been. Both queues are ordered in a lifo manner.
? Local FIFO 2 Queue - This scheduler has three local queues per VP. One contains threads that have never been run, one contains threads that have been and were running, and the third contains threads that were blocked. All three queues are ordered in a fifo manner.
? Local LIFO 3 Queue - This scheduler has three local queues per VP. One contains threads that have never been run, one contains threads that have been and were running, and the third contains threads that were blocked. All three queues are ordered in a lifo manner.
Each of these thread policy managers were implemented with less than 230 lines of code. 160 of those lines were in a library shared by all of them. Given this library each schedular was implemented with less than 70 lines of code. We think this demonstrates how easily a thread policy manager can be customized. Benchmarks of these policy managers are discussed in the next section.
Each of these thread policy managers that use a local queue can have either of two initial thread placement strategies: round robin or random. Round robin placement uses a global counter that is used to determine the VP onto which the thread should be mapped. Round robin placement does a reasonably good job of load balancing, but the
global counter is a potential bottleneck. Random placement is done using a random number generator. It gives a reasonably balanced load without the contention of a global resource. One potential weakness with each of these strategies is that they completely ignore the topology of the physical machine.
These thread policy managers are provided as part of a library of standard thread policy managers that are delivered with the system. Thus, most Sting users will not find it necessary to implement a policy manager, rather they will simply load the appropriate policy manager for their application from the library. The standard policy mangers implement not only scheduling decisions, but also initial placement and migrations decisions.
5.4 Migration Decisions
The final issue in thread policy manager design concerns strategies for thread migration. Thread can be migrated for various reasons. The most common is to balance the load on a virtual machine across the different processors. But there are two other important reasons for migrating a thread from one VP to another. The first is performance. It may be much more efficient to move a thread closer to the resources it is using, whenever possible. The second reason is reliability. If a processor fails, then a checkpointed thread can be migrated to another processor and resumed there.
As with the other policy decisions there are many possible strategies for migrating threads. We discuss a few that relate to load balancing below.
Migration for load balancing usually occurs when a VP is idle. This is because there is little or no overhead1 in having an idle VP search for work to do, since it would not be doing work otherwise. The strategy does have to be careful to ensure that a VP does not continue to search for work if new work has arrived in the VP?s ready queue(s).
Random Search - This strategy entails picking a VP at random and grabbing some number of threads, usually half, from it?s queue(s). This strategy can be tuned in several way. The thread policy manager may try to migrate threads that have not yet started evaluation first. In order to reduce the cost of migration. If there are no threads that have not started evaluation then the policy manager may choose to migrate threads that are evaluating, or to look for another processor with threads that have not yet started evaluating. The advantage of this strategy is that the amortized cost of finding work is good. The disadvantage is that it may decrease the locality of the threads in the virtual machine.
Ordered Search - Another strategy is for the thread policy manager to conduct an ordered search starting with VP?s that are close to it and gradually moving to those that are further away. This approach has the advantage that it has a better chance of maintaining the locality of threads.
Load Based - If a queue of processors ordered by their load is being kept to improve the initial mapping of threads to VPs, then it can be used by an idle processor to find the most heavily loaded processor and take some, again usually
1. There is the potential to create more contention for shared resources however.
half, of its threads. Because there are many possible ways of measuring the ?load? on a virtual processor, the thread policy manager is responsible for calculating it in a manner suitable to the policy being implemented.
There are many other possible strategies for using migration to balance the work on a machine and improve performance. The appropriate strategy will depend not only on a particular program, but also on the memory hierarchy and the communications topology of the physical machine.
6 Benchmark Results
We have benchmarked six different thread policy managers running a program called Abisort.1 Abisort performs an ?adaptive? bitonic sort [BN89] of n = 16,384 numbers. The adaptive algorithm achieves optimal complexity (O(n log n) rather than the O(n log2 n) of the standard bitonic sort algorithm) by storing bitonic sequences in a special tree data structure. Adaptive bitonic sort performs about twice as many comparisons as a merge sort, and has somewhat greater bookkeeping costs. However, its parallel divide-and-conquer merge operation allows virtually linear speedup when n >> p. Such speedup is not possible with straightforward implementations (on MIMD machines) of other divide-and-conquer sorts such as merge sort and quicksort which contain significant sequential phases. Abisort creates a tree containing 106497 threads.
The program run is exactly the same for each policy manager, i.e. the source is not changed and it is not recompiled. The following thread policy managers were tested:
GLIFO - A policy manager that uses one global queue for scheduling all the threads on virtual processors. The queue is organized in a last in first out manner.
GFIFO - A policy manager that uses one global queue for scheduling all the threads on virtual processors. The queue is organized in a first in first out manner.
L1 LIFO Random - A policy manager that uses one local ready queue on each virtual processor. The queue is organized in a last in first out manner. The initial mapping of thread to virtual processor is done by picking a processor at random.
L1 LIFO Round Robin- A policy manager that uses one local ready queue on each virtual processor. The queue is organized in a last in first out manner. The initial mapping of thread to virtual processor in a round robin manner using a global counter access to which is synchronized.
L1 FIFO Random - A policy manager that uses one local ready queue on each virtual processor. The queue is organized in a first in first out manner. The initial mapping of thread to virtual processor is done by picking a processor at random.
L1 LIFO Round Robin - A policy manager that uses one local ready queue on each virtual processor. The queue is organized in a last in first out manner. The
1. The abisort we used is one of Mohr?s benchmarks and the description of the algorithm is largely taken from [Moh91].
initial mapping of thread to virtual processor in a round robin manner using a global counter access to which is synchronized.
The benchmarks were run on an eight processor Silicon Graphics PowerSeries 480 with 256 Mb of main memory. Each processor is a MIPS R3000 running at 40Mhz with a 64kb data cache and a 64kb instruction cache. Table 6-a through Table 6-f show the results of running abisort on a Sting virtual machine composed of eight virtual processors using the various policy managers. These tables give a processor by processor breakdown of various statistics.
The following statistics about threads are recorded by each VP:
Created - The number of thread created by the program.
Scheduled - The number of threads scheduled to evaluate.
Absorbed - The number of threads absorbed by other threads.
Blocked - The number of times a thread blocked for any reason.
Resumed - The number of times a blocked thread resumed execution.
Determined - The number of threads that determined (i.e. completed) a value.
Idle - The number of times the virtual processor had no work to do. When a virtual processor has no work it runs its root thread which increments this counter and then spins until there is more work to do.
The following statistics about thread execution contexts are recorded by each VP:
Created - The number of thread control blocked created. This occurs when the virtual processors pool of thread control blocks is empty and it must create a new thread control block.
Allocated - The number of thread control blocks allocated from the virtual processor ?s pool of thread control blocks.
Re-used - The number of thread control blocks that were reused because an unevaluating thread was proceeded by a thread which had just terminated.
The following statistics about mutexes are maintained by the system:
Created - The total number of mutexes created by the system.
Acquired - The number of times a mutex was acquired.
Released - The number of times a mutex was released.
An examination of the tables shows that each of the policy managers does a good job of scheduling threads evenly across the processors with the variance in load under 5% for all of them. The round robin schedulers distribute the threads with essentially no variance. However, this even distribution comes at the cost of acquiring an extra lock for each thread scheduled. All of the policy managers show good balance in other respects. The the number of threads absorbed by each processor is well balanced as is the amount of blocking and the number of times a processor goes idle.
Threads VP1 VP2 VP3 VP4 VP5 VP6 VP7 VP8 Total
Created 13274 13449 12997 13268 13385 13525 13251 13347 106497
Scheduled 13274 13449 12997 13268 13385 13525 13251 13347 106497
Absorbed 10889 11102 10600 10837 10895 11076 10938 10959 87296
Blocked 1472 1526 1547 1506 1514 1576 1491 1546 12178
Resumed 1500 1570 1568 1533 1491 1495 1520 1501 12178
Determined 13191 13527 13123 13209 13372 13443 13357 13275 106497
Idle 30 33 33 22 35 27 21 28 229
Execution Contexts
Created 50
Allocated 1164 1206 1226 1207 1177 1286 1187 1201 9654
Reused 1181 1191 1180 1214 1148 1191 1204 1238 9547
Mutexes
Created 13274 13449 12997 13268 13385 13525 13251 13347 106497
Acquired 33351 34073 33363 33524 33540 34156 33611 33729 269347
Released 33351 34073 33363 33524 33540 34156 33611 33729 269347
Table 6-a: Abisort with Global LIFO Policy
Threads VP1 VP2 VP3 VP4 VP5 VP6 VP7 VP8 Total
Created 13541 13370 13167 13510 13220 13155 13198 13335 106497
Scheduled 13541 13370 13167 13510 13220 13155 13198 13335 106497
Absorbed 13512 13335 13128 13468 13189 13127 13171 13288 106218
Blocked 29 30 27 40 34 30 25 36 251
Resumed 50 26 21 25 39 35 27 28 251
Determined 13572 13364 13148 13493 13244 13161 13200 13315 106497
Idle 21 25 25 22 19 18 27 25 182
Execution Contexts
Created 38
Allocated 18 21 21 22 20 17 19 21 159
Reused 20 12 13 14 13 18 15 15 120
Mutexes
Created 13541 13370 13167 13510 13220 13155 13198 13335 106497
Acquired 27259 26849 26412 27143 26607 26448 26507 26775 214000
Released 27259 26849 26412 27143 26607 26448 26507 26775 214000
Table 6-b: Abisort with Global FIFO Policy
Threads VP1 VP2 VP3 VP4 VP5 VP6 VP7 VP8 Total
Created 13227 13259 13396 13240 13499 13332 13347 13196 106497
Scheduled 13313 13312 13312 13312 13312 13312 13312 13312 106497
Absorbed 10889 11102 10600 10837 10895 11076 10938 10959 87296
Blocked 251 254 258 237 256 243 237 235 1971
Resumed 244 247 267 226 272 244 238 233 1971
Determined 13163 13339 13410 13184 13546 13413 13326 13116 106497
Idle 80 69 75 42 59 85 57 51 518
Execution Contexts
Created 36
Allocated 257 254 264 237 265 250 246 237 2010
Reused 55 51 52 44 46 68 45 37 398
Mutexes
Created 13227 13259 13396 13240 13499 13332 13347 13196 106497
Acquired 27386 27696 27906 27350 28088 27810 27647 27228 221111
Released 27386 27696 27906 27350 28088 27810 27647 27228 221111
Table 6-c: Abisort with Local LIFO and Round Robin
Threads VP1 VP2 VP3 VP4 VP5 VP6 VP7 VP8 Total
Created 13340 13444 13244 13238 13269 13304 13375 13282 106497
Scheduled 13313 13312 13312 13312 13312 13312 13312 13312 106497
Absorbed 13264 13391 13164 13140 13193 13262 13298 13236 105948
Blocked 48 61 56 48 42 57 52 56 420
Resumed 28 59 47 49 64 64 50 59 420
Determined 13296 13458 13211 13198 13260 13366 13381 13327 106497
Idle 22 11 16 23 21 13 22 21 149
Execution Contexts
Created 38
Allocated 40 42 39 35 29 43 43 41 306
Reused 22 31 28 26 30 30 30 34 243
Mutexes
Created 13340 13444 13244 13238 13269 13304 13375 13282 106497
Acquired 26798 27166 26659 26594 26707 26942 26962 26867 214695
Released 26798 27166 26659 26594 26707 26942 26962 26867 214695
Table 6-d: Abisort with Local FIFO and Round Robin
Threads VP1 VP2 VP3 VP4 VP5 VP6 VP7 VP8 Total
Created 13332 13322 13258 13191 13309 13332 13440 13312 106497
Scheduled 13318 13400 13233 13096 13297 13504 13023 13254 106497
Absorbed 12835 12800 12709 12727 12867 12845 13023 12745 102551
Blocked 397 405 401 384 411 433 365 415 3211
Resumed 371 359 420 417 385 432 402 425 3211
Determined 13277 13234 13287 13264 13351 13381 13479 13224 106497
Idle 170 115 164 191 175 120 144 122 1201
Execution Contexts
Created 52
Allocated 419 400 411 408 432 445 378 410 3303
Reused 80 95 89 82 82 76 64 75 643
Mutexes
Created 13332 13322 13258 13191 13309 13332 13440 13312 106497
Acquired 28265 28233 28238 28167 28450 28563 28538 28204 226658
Released 28265 28233 28238 28167 28450 28563 28538 28204 226658
Table 6-e: Abisort with Local LIFO and Random
Threads VP1 VP2 VP3 VP4 VP5 VP6 VP7 VP8 Total
Created 13326 13398 13370 13342 13233 13344 13371 13311 106497
Scheduled 13302 13388 13229 13106 13305 13512 13413 13112 106497
Absorbed 13260 13272 13275 13257 13117 13253 13292 13017 105743
Blocked 81 80 78 72 82 77 75 63 608
Resumed 78 91 97 66 78 56 91 51 608
Determined 13366 13385 13392 13314 13217 13322 13425 13077 106497
Idle 43 27 16 35 19 24 27 15 206
Execution Contexts
Created 40
Allocated 61 57 54 51 54 58 44 43 422
Reused 45 40 40 41 48 32 54 33 333
Mutexes
Created 13326 13398 13370 13342 13233 13344 13371 13311 106497
Acquired 27059 27094 27092 26953 26764 26959 27133 26412 215466
Released 27059 27094 27092 26953 26764 26959 27133 26412 215466
Table 6-f: Abisort with Local FIFO and Random
Table 6-g show a comparison of the results for the six systems tested. This table shows that for abisort a fifo strategy works better than the lifo strategy and that the global queue works much better than the local queues. The thread policy manager that works best, GFIFO, as we might expect, has less idle time, less blocking, and creates fewer execution contexts than any of the others. The one that performs worst, L1 LIFO Random, not surprisingly has much more blocking, more idle time, and creates more execution contexts than the others.
The machine we ran these benchmarks on is a physically shared memory machine and the results of these policy managers show less variance than they might on a physically disjoint memory machine. Abisort is only one benchmark; to understand the behavior of various thread policy managers many more benchmarks need to be studied. Finally, there are many other thread policy managers that we would like to test in the future.
7 Virtual Processor Management
As mentioned above there is a second level of policy management in the Sting operating system, i.e. kernel or virtual processor policy management. Each physical
Threads G
LIFO
G
FIFO
L
LIFO
Random
L
LIFO
RR
L
FIFO
Random
L
FIFO
RR
Created 106497 106497 106497 106497 106497 106497
Scheduled 106497 106497 106497 106497 106497 106497
Absorbed 87296 106218 102551 87296 105743 105948
Blocked 12178 251 3211 1971 608 420
Resumed 12178 251 3211 1971 608 420
Determined 106497 106497 106497 106497 106497 106497
Idle 229 182 1201 518 206 149
Execution Contexts
Created 50 38 52 36 40 38
Allocated 9654 159 3303 2010 422 306
Reused 9547 120 643 398 333 243
Mutexes
Created 106497 106497 106497 106497 106497 106497
Acquired 269347 214000 226658 221111 215466 214695
Released 269347 214000 226658 221111 215466 214695
Timing (secs)
14.77 11.40 16.94 14.76 14.14 14.68
Table 6-g: Abisort with Various TPMs
processor abstraction is composed of a virtual processor controller (VPC) and a virtual processor policy manager (VPPM). The relationship between the VP controller and the VP policy manager is similar to that between the thread controller and the thread policy manager, i.e. the VP controller is a client of the VP policy manager. Whenever the VP controller needs to make a policy decision it calls the VP policy manager to make that decision.
While all physical processor runs the same VP controller, they can run different VP policy managers. This allows a multiprocessor system to customize the system?s use of each physical processor. It is also possible for the system to run the same VP policy manager on each of the physical processors.
When a virtual machine wishes to schedule a virtual processor on an abstract physical processor it calls the virtual processor controller on that physical processor. Likewise, when a virtual machine wishes to remove a virtual processor from an abstract physical processor it calls the virtual processor controller on that physical processor. Each VP controller manages the virtual processors which are mapped onto its physical processor, including all virtual processor state changes.
The VP policy manager makes all policy decisions relating to the scheduling and migration of virtual processors on physical processors. There are three types of decision: First it determines the VP to PP map. This mapping takes place at two distinct times, when the VP is run for the first time and when a VP which has been blocked is rerun. Second, the policy manager also determines the order in and duration for which VPs on a PP are run. Finally, the VP policy manager decides when a VP should be moved (migrated) from one processor to another.
These three decisions allow the VP policy manager to balance the work load on a machine and determine the fairness properties of the physical machine with respect to virtual machines. They also allow VP policy managers to decide where to move the VPs of a fault tolerant VM when a physical processor fails.
Like the thread policy manager the VP policy manager presents a well-defined interface to the VP controller. The data structures which the VP policy manager uses to make its decisions are completely private to it. These data structures may be local to a particular VP policy manager or shared among the various instances of the VP policy manager, or some combination thereof, but no other component of the system has access to them. The VP policy manager can be customized to provide different behaviors to different instances of Sting. This functionality allows it to be customized for different operating system environments as diverse as real time, interactive, or computationally intensive systems.
The principle difference between the thread policy manager and the VP policy manager is that the VP policy manger must be built into the physical machine, i.e. kernel. This means that it must be linked into the operating system when it is built.
Finally, while the thread policy manger is concerned with load balancing and fairness among threads, the virtual processor policy manager is concerned with load balancing and fairness among virtual machines and virtual processors.
8 Conclusion
The fundamental contribution of this work is the development of simple and efficient abstractions that allow the separation of policy management from control in an operation system. Although our system is implemented in Scheme the abstractions developed could be implemented in a more traditional language such as C. It would be relatively easy to add these abstractions to a more traditional operating system such as Unix.
A second important contribution of this work lies in providing the user with the ability to customize the policy management of threads. The ability to customize policy management at the virtual machine level allows the user/programmer to exploit memory and communications properties of a particular algorithm. It also allows the user/ programmer to exploit memory and communications properties when porting a program from one hardware architecture to another without modifying the program itself.
The approach to policy management described in this paper has proven to be extremely flexible. The interfaces to the policy managers are simple and implementing the various policy managers described was easy, requiring less than a day each to code and debug them.
It is unfortunate that we only had a shared memory multiprocessor at our disposal for testing these policy managers. We believe that policy management issues are much more interesting and important on disjoint memory machines. We are currently building a distributed shared memory machine using high performance workstations connected with a fast interconnect, on which to test this hypothesis.
9 References
[BN89] G. Bilardi and A. Nicolau. Adaptive Bitonic Sorting: An Optimal Parallel Algorithm for Shared Memory Machines. SIAM J. Computing, 18:2, April 1989, pages 216-228.
[CDD+91] Mark Crovella, Prakash Das, Czarek Dubnicki, Thomas LeBlanc, Evangelos Markatos. Multiprogramming on Multiprocessors. Technical Report 385, Computer Science Department, University of Rochester, May 1991.
[Hal85] Robert Halstead. Multilisp: A Language for Concurrent Symbolic Computation. Transactions on Programming Languages and Systems, 7(4):501-538, October 1985.
[HF84] Christopher T. Haynes and Danial P. Friedman. Engines Build Process Abstractions. Proceedings of the 1984 ACM Lisp and Functional Programming Conference, pages 18-24, 1984.
[Jag91] Suresh Jagannathan. Customization of First Class Tuple Spaces in a Higher Order Language. Conference on Parallel Languages and Architectures Europe, pages 254-276, June 1991.
[JP92a] Suresh Jagannathan and James Philbin. A Foundation for an Efficient Multi-Threaded Scheme System. Proceedings of the 1992 Conference on Programming Language Design and Implementation, June 1992.
[JP92b] Suresh Jagannathan and James Philbin. STING: A Customizable Substrate for Concurrent Symbolic Computing. Technical Report 91-003-3- 0050-1, NEC Research Institute, 1992.
[MP90] Henry Massalin and Calton Pu. Fine-Grain Adaptive Scheduling using Feedback. Computing Systems, 3:1, Winter 1990.
[Moh91] Eric Mohr. Dynamic Partitioning of Parallel Lisp Programs. Technical Report YaleU/DCS/RR-869, Yale University, October 1991.
[MVZ90] Cathy McCann, Raj Vaswani, and John Zahorjan. A Dynamic Processor Allocation Policy for Multiprogrammed, Shared Memory Multiprocessors. Technical Report 90-03-02, Department of Computer Science, University of Washington, March 1990 (revised February 1991).
[Phi93] James Philbin. The Design of an Operating System for Modern Programming Languages. PhD thesis, Department of Computer Science, Yale University, 1993.
{RA82] Jonathan A. Rees and Norman I. Adams. T : A Dialect of Lisp or, LAMBDA: The Ultimate Software Tool. Proceedings of the ACM Symposium on Lisp and Functional Programming, pages 114-122, 1982.
[TG89] Andrew Tucker and Anoop Gupta. Process Control and Scheduling Issures for Multiprogrammed Shard-Memory Multiprocessors. Proceedings of the Twelfth ACM Symposium on Operating System Principles, December, 1989.
[WLH81] William Wulf, Roy Levin, and Samuel Harbison. HYDRA/C.mmp: An Experimental Computer System. McGraw-Hill, 1991.
10 Acknowledgments
Sting was designed jointly by the author and Suresh Jagannathan. Alvaro Campos implemented several of the early policy managers.