%A M. L. Scott %A T. J. LeBlanc %T Psyche: A General-Purpose Operating System for Shared-Memory Multiprocessors %R BPR 19, TR 223 %K bpr19 tr223 %I URCSD %D July 1987 %X The Psyche project at the University of Rochester aims to develop a high-performance operating system to support a wide variety of models for parallel programming on a shared-memory multiprocessor. It is predicated on the assumption that no one model of process state or style of communication will prove appropriate for all applications, but that a shared-memory machine can and should support all models. Through a system of keys and access lists, incremental changes to address spaces, and direct execution of remote operations, Psyche will allow individual applications to balance the conflicting goals of efficiency, flexibility, and security. A pilot implementation is under construction on the BBN Butterfly Parallel Processor. %X This paper is superceded by the 1988 ICPP paper, and does not appear in this directory. %T Bridge: A High-Performance File System for Parallel Processors %A P. C. Dibble %A M. L. Scott %A C. S. Ellis %K dcs %J PROC of the Eighth ICDCS %P 154-161 %D 13-17 June 1988 %C San Jose, CA %X Faster storage devices cannot solve the I/O bottleneck problem for large multiprocessor systems if data passes through a file system on a single processor. Implementing the file system as a parallel program can significantly improve performance. Selectively revealing this parallel structure to utility programs can produce additional improvements, particularly on machines in which interprocessor communication is slow compared to aggregate I/O bandwidth. %X We have designed and prototyped a parallel file system that distributes each file across multiple storage devices and processors. Naive programs are able to access files just as they would with a conventional file system, while more sophisticated programs may export pieces of their code to the processors managing the data, for optimum performance. Theoretical and early empirical data indicate nearly linear speedup on critical operations for up to several dozen devices. %X 88.ICDCS.Bridge.ps.Z Copyright 1988, IEEE %A T. J. LeBlanc %A M. L. Scott %A C. M. Brown %T Large-Scale Parallel Programming: Experience with the BBN Butterfly Parallel Processor %J PROC of the First PPEALS %C New Haven, CT %D 19-21 July 1988 %P 161-172 %O Also available as BPR 22, URCSD, September 1988 %X For three years, members of the Computer Science Department at the University of Rochester have used a collection of BBN Butterfly\*[TM\*] Parallel Processors to conduct research in parallel systems and applications. For most of that time, Rochester's 128-node machine has had the distinction of being the largest shared-memory multiprocessor in the world. In the course of our work with the Butterfly we have ported three compilers, developed five major and several minor library packages, built two different operating systems, and implemented dozens of applications. Our experience clearly demonstrates the practicality of large-scale shared-memory multiprocessors, with non-uniform memory access times. It also demonstrates that the problems inherent in programming such machines are far from adequately solved. Both locality and Amdahl's law become increasingly important with a very large number of nodes. The availability of multiple programming models is also a concern; truly general-purpose parallel computing will require the development of environments that allow programs written under different models to coexist and interact. Most important, there is a continuing need for high-quality programming tools; widespread acceptance of parallel machines will require the development of programming environments comparable to those available on sequential computers. %X This paper contains several cut-and-paste figures, and is not easily placed in ftp-able form; it doesn't appear in this directory. Copyright 1988, ACM %A M. L. Scott %A T. J. LeBlanc %A B. D. Marsh %T Design Rationale for Psyche, a General-Purpose Multiprocessor Operating System %J PROC of the 1988 ICPP %C St. Charles, IL %D 15-19 August 1988 %P 255-262 %V II \(mi Software %O Also published in the \f2University of Rochester 1988-89 Computer Science and Computer Engineering Research Review\fP. %X The Psyche project at the University of Rochester aims to develop a high-performance operating system to support a wide variety of models for parallel programming. It is predicated on the conviction that no one model of process state or style of communication will prove appropriate for all applications, but that shared-memory multiprocessors (particularly the scalable .q NUMA variety) can and should support all models. Conventional approaches, such as shared memory or message passing, can be regarded as points on a continuum that reflects the degree of sharing between processes. Psyche facilitates dynamic sharing by providing a user interface based on passive data abstractions in a uniform virtual address space. It ensures that users pay for protection only when it is required by permitting lazy evaluation of protection policies implemented with keys and access lists. The data abstractions define conventions for sharing the uniform address space; the tradeoff between protection and performance determines the degree to which those conventions are enforced. In the absence of protection boundaries, access to a shared abstraction can be as efficient as a procedure call or a pointer dereference. %X The postscript in this directory has been reconstructed from old troff source, and doesn't precisely match the formatting of the version in the conference proceedings. The references have also been re-built from a newer database, and may not be exactly the same. 88.ICPP.Psyche_Rationale.ps.Z Copyright 1988, The Pennsylvania State University %A M. L. Scott %A T. J. LeBlanc %A B. D. Marsh %T Evolution of an Operating System for Large-Scale Shared-Memory Multiprocessors %K psyche tr309 %R TR 309 %I URCSD %D March 1989 %X Scalable shared-memory multiprocessors (those with non-uniform memory access times) are among the most flexible architectures for high-performance parallel computing, admitting efficient implementations of a wide range of process models, communication mechanisms, and granularities of parallelism. Such machines present opportunities for general-purpose parallel computing that cannot be exploited by existing operating systems, because the traditional approach to operating system design presents a virtual machine in which the definition of processes, communication, and grain size are outside the control of the user. Psyche is an operating system designed to enable the most effective use possible of large-scale shared memory multiprocessors. The Psyche project is characterized by (1) a design that permits the implementation of multiple models of parallelism, both within and among applications, (2) the ability to trade protection for performance, with information sharing as the default, rather than the exception, (3) explicit, user-level control of process structure and scheduling, and (4) a kernel implementation that uses shared memory itself, and that provides users with the illusion of uniform memory access times. %X The postscript here was reconstructed from old troff source, and does not match the formatting of the hard-copy TR. In particular, the bibliography has re-built from a newer database, and in several cases cites newer versions of papers -- versions that postdate the TR. 89.TR309.Psyche_Evolution.ps.Z %A T. J. LeBlanc %A B. D. Marsh %A M. L. Scott %T Memory Management for Large-Scale NUMA Multiprocessors %K psyche tr311 %R TR 311 %I URCSD %D March 1989 %X Large-scale shared-memory multiprocessors such as the BBN Butterfly and IBM RP3 introduce a new level in the memory hierarchy: multiple physical memories with different memory access times. An operating system for these NUMA (NonUniform Memory Access) multiprocessors should provide traditional virtual memory management, facilitate dynamic and widespread memory sharing, and minimize the apparent disparity between local and nonlocal memory. In addition, the implementation must be scalable to configurations with hundreds or thousands of processors. %X This paper describes memory management in the Psyche multiprocessor operating system, under development at the University of Rochester. The Psyche kernel manages a multi-level memory hierarchy consisting of local memory, nonlocal memory, and backing store. Local memory stores private data and serves as a cache for shared data; nonlocal memory stores shared data and serves as a disk cache. The system structure isolates the policies and mechanisms that manage different layers in the memory hierarchy, so that customized data structures and policies can be constructed for each layer. Local memory management policies are implemented using mechanisms that are independent of the architectural configuration; global policies are implemented using multiple processes that increase in number as the architecture scales. Psyche currently runs on the BBN Butterfly Plus multiprocessor. %X This paper describes an ambitious early design for a VM system for Psyche. Our plans for VM were scaled back substantially after implementation was about half completed, due to changing interests of graduate students. Further discussion can be found in the WEBDMS paper. 89.TR311.Psyche_VM.ps.Z %A M. L. Scott %A T. J. LeBlanc %A B. D. Marsh %T A Multi-User, Multi-Language Open Operating System %K Psyche %J PROC of the Second Workshop on Workstation Operating Systems %C Pacific Grove, CA %D 27-29 September 1989 %P 125-129 %X An open operating system provides a high degree of programming flexibility and efficiency. At the same time, it generally requires that all programs be written in a single language, and provides no protection other than that which is available from the compiler. These limitations may be tolerated by the user of a personal computer, but they become unacceptable on a workstation that must run untrusted software written in many different languages. Psyche is an operating system designed to make the most effective possible use of shared-memory multiprocessors and uniprocessor machines. It combines the flexibility of an open operating system with the ability to write in multiple languages and to establish solid protection boundaries. It also provides the efficiency of an open operating system for programs that do not require protection. %X A 5-page position paper. 89.WWOS.Psyche_Open_System.ps.Z Copyright 1989, IEEE %A P. C. Dibble %A M. L. Scott %T Beyond Striping: The Bridge Multiprocessor File System %J Computer Architecture News %V 17 %N 5 %D September 1989 %P 32-39 %X High-performance parallel computers require high-performance file systems. Exotic I/O hardware will be of little use if file system software runs on a single processor of a many-processor machine. We believe that cost-effective I/O for large multiprocessors can best be obtained by spreading both data and file system computation over a large number of processors and disks. To assess the effectiveness of this approach, we have implemented a prototype system called Bridge, and have studied its performance on several data intensive applications, among them external sorting. A detailed analysis of our sorting algorithm indicates that Bridge can profitably be used on configurations in excess of one hundred processors with disks. Empirical results on a 32-processor implementation agree with the analysis, providing us with a high degree of confidence in this prediction. Based on our experience, we argue that file systems such as Bridge will satisfy the I/O needs of a wide range of parallel architectures and applications. %X 89.CAN.Bridge.ps.Z %A W. J. Bolosky %A R. P. Fitzgerald %A M. L. Scott %T Simple But Effective Techniques for NUMA Memory Management %J PROC of the Twelfth SOSP %C Litchfield Park, AZ %D 3-6 December 1989 %P 19-31 %O In \f2\&OSR 23\fP:5 %X Multiprocessors with non-uniform memory access times introduce the problem of placing data near the processes that use them, in order to improve performance. We have implemented an automatic page placement strategy in the Mach operating system on the IBM ACE multiprocessor workstation. Our experience indicates that even very simple automatic strategies can produce nearly optimal page placement. It also suggests that the greatest leverage for further performance improvement lies in reducing {\em false sharing}, which occurs when the same page contains objects that would best be placed in different memories. %X 89.SOSP.ACE_NUMA.ps.Z Copyright 1989, ACM %A M. L. Scott %A T. J. LeBlanc %A B. D. Marsh %A T. G. Becker %A C. Dubnicki %A E. P. Markatos %A N. G. Smithline %T Implementation Issues for the Psyche Multiprocessor Operating System %K psyche WEBDMS %J Computing Systems %V 3 %N 1 %D Winter 1990 %P 101-137 %O Earlier version presented at the \f2First Workshop on Experiences with Building Distributed and Multiprocessor Systems\fP, Ft. Lauderdale, FL, 5-6 October, 1989, 227-236 %X Psyche is a parallel operating system under development at the University of Rochester. The Psyche user interface is designed to allow programs with widely differing concepts of process, sharing, protection, and communication to run efficiently on the same machine, and to interact productively. In addition, the Psyche development effort is addressing a host of implementation issues for large-scale shared-memory multiprocessors, including the organization of kernel functions, data structures, and address maps for machines with non-uniform memory; the migration and replication of pages to maximize locality; the introduction of user-level device drivers, memory management, and scheduling; and remote source-level kernel debugging. We focus in this paper on our implementation of Psyche for the BBN Butterfly Plus multiprocessor, though many of the issues we consider apply to any operating system kernel on a large-scale shared-memory machine. We describe our major design decisions, the results of our initial experience with the implementation, and our plans for continued evaluation and experimentation with kernel implementation techniques. %X The postscript in this directory has been reconstructed from old troff source; the references have been re-built from a newer database, and may not be exactly right. 90.Comp_Sys.Psyche_Impl.ps.Z Copyright 1990, The Regents of the University of California %A M. L. Scott %A T. J. LeBlanc %A B. D. Marsh %T Multi-Model Parallel Programming in Psyche %J PROC of the Second PPOPP %C Seattle, WA %D 14-16 March, 1990 %P 70-78 %O In \f2\&ACM SIGPLAN Notices 25\fP:3 %X Many different parallel programming models, including lightweight processes that communicate with shared memory and heavyweight processes that communicate with messages, have been used to implement parallel applications. Unfortunately, operating systems and languages designed for parallel programming typically support only one model. Multi-model parallel programming is the simultaneous use of several different models, both across programs and within a single program. This paper describes multi-model parallel programming in the Psyche multiprocessor operating system. We explain why multi-model programming is desirable and present an operating system interface designed to support it. Through a series of three examples, we illustrate how the Psyche operating system supports different models of parallelism and how the different models are able to interact. %X The postscript in this directory has been reconstructed from old troff source, and may not match the conference proceedings precisely; the references have been re-built from a newer database, and may not be exactly right. 90.PPOPP.Multi_Model_Psyche.ps.Z Copyright 1990, ACM %A T. J. LeBlanc %A E. P. Markatos %T Operating System Support for Adaptable Real-Time Systems %J PROC of the Seventh IEEE Workshop on Real-Time Operating Systems and Software %C Charlottesville, VA %D May 1990 %P 1-10 %K multi-model real-time satisfice user-level scheduling %O In \f2IEEE Real-Time Systems Newsletter\fP %X This paper outlines our plans for a real-time systems research program to support the long-term goal of developing intelligent robots. The distinguishing characteristic of our approach to real-time systems is an emphasis on system \fIadaptability\fP in a dynamic real-world environment. We achieve adaptability by allowing multiple real-time process models, with different known properties and timing constraints, to coexist within a single system and application. Using a hardware architecture in which a large-scale multiprocessor controls a behavioral system with vision and manipulation capabilities, we are constructing a prototype software environment that provides a predictable schedule for \fIreflexive\fP robot tasks and a flexible environment for implementing adaptive \fIcognitive\fP robot tasks. We will exploit multiple processors, user-level scheduling, and user-defined process and communication models in the construction of real-time robotics applications. %X 90.RTOSS.Adaptable_Real_Time.ps.Z %A P. C. Dibble %A M. L. Scott %T The Parallel Interleaved File System: A Solution to the Multiprocessor I/O Bottleneck %K bridge %J IEEETPDS %D to appear, pending revision %X Parallel computers with non-parallel file systems are limited by the performance of the processor running the file system. We have designed and implemented a parallel file system called Bridge that eliminates this problem by spreading both data and file system computation over a large number of processors and disks. To assess the effectiveness of Bridge we have used it to implement several data-intensive applications, including a parallel external merge sort. The merge sort is a particularly demanding application; it requires significant amounts of interprocessor communication and data movement. A detailed analysis of this application indicates that Bridge can profitably be used on configurations in which disks are attached to more than 150 processors. Empirical results on a 32-processor implementation agree with the analysis, providing us with a high degree of confidence in this prediction. Based on our experience, we argue that file systems such as Bridge will satisfy the I/O needs of a wide range of parallel architectures and applications. %X This paper has been through one round of reviewing for IEEE TPDS, and is currently undergoing revision. The postscript in this directory is the version submitted to the journal in May of 1990. 90.TPDS.Bridge.ps.Z %A M. L. Scott %T The Lynx Distributed Programming Language: Motivation, Design, and Experience %K tr308 %J COMPLANG %V 16 %N 3/4 %D 1991 %P 209-233 %O Earlier version published as TR 308, ``An Overview of Lynx,'' URCSD, August 1989 %X A programming language has clear advantages over a library package for communication between processes in a distributed environment. Unfortunately, most existing distributed languages are better suited to communication between the processes of a single application than they are to communication between processes that are developed independently. Such independent development is characteristic both of the systems software for multicomputers and of the applications software for geographically-distributed networks. Lynx is a message-based language designed to support both application and system software in a single conceptual framework. It extends the advantages of high-level communication to processes designed in isolation, and compiled and placed in operation without knowledge of their peers. A virtual-circuit abstraction called the .i link supports changes in communication topology at run time, under user control. A lightweight .i thread mechanism is integrated with message passing in a way that makes it easy for server processes to maintain context for interleaved conversations with an arbitrary number of clients. Experience with Lynx under two very different implementations has yielded important insights into the interface between a language run-time package and the underlying operating system, and has helped to identify the inherent costs of message-passing systems. %X This is the place to start if you're interested in Lynx. It cites, and summarizes, most of the other papers. The postscript here is the version from which the galleys were prepared. 91.Comp_Lang.Lynx.ps.Z Copyright 1991, Pergamon Press, plc %A L. A. Crowl %A T. J. LeBlanc %T Architectural Adaptability in Parallel Programming via Control Abstraction %I URCSD %R TR 359 %D January 1991 %K architectural independence closure Matroshka %X Parallel programming involves finding the potential parallelism in an application, choosing an algorithm, and mapping it to the architecture at hand. Since a typical algorithm has much more potential parallelism than any single architecture can effectively exploit, we usually program the parallelism that the available control constructs easily express and that the given architecture efficiently exploits. This approach produces programs that exhibit much less parallelism than the original algorithm and whose performance depends entirely on the underlying architecture. To port such a program to a new architecture, we must rewrite the program to remove any ineffective parallelism and to recover any lost parallelism appropriate for the new machine. %X In this paper we show how to adapt a parallel program to different architectures using control abstraction. With control abstraction we can define and use a rich variety of control constructs to represent an algorithm's potential parallelism. Since control abstraction separates the definition of a construct from its implementation, a construct may have several different implementations, each exploiting a different subset of the parallelism admitted by the construct. By selecting an implementation for each control construct using annotations, we can vary the parallelism we choose to exploit without otherwise changing the source code. This approach produces programs that exhibit most of, if not all, the potential parallelism in an algorithm, and whose performance can be tuned for a specific architecture simply by choosing among the various implementations for the control constructs in use. %X 91.TR359.Control_Abstraction.ps.Z %A J. M. Mellor-Crummey %A M. L. Scott %T Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors %K mcs lock MellorCrummeyScott90 %J TOCS %V 9 %N 1 %P 21-65 %D February 1991 %O Earlier version published as TR 342, URCSD, April 1990, and COMP TR90-114, Center for Research on Parallel Computation, Rice UNIV, May 1990 %X Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in shared-memory parallel programs. Unfortunately, typical implementations of busy-waiting tend to produce large amounts of memory and interconnect contention, introducing performance bottlenecks that become markedly more pronounced as applications scale. We argue that this problem is not fundamental, and that one can in fact construct busy-wait synchronization algorithms that induce no memory or interconnect contention. The key to these algorithms is for every processor to spin on separate {\em locally-accessible} flag variables, and for some other processor to terminate the spin with a single remote write operation at an appropriate time. Flag variables may be locally-accessible as a result of coherent caching, or by virtue of allocation in the local portion of physically distributed shared memory. %X We present a new scalable algorithm for spin locks that generates $O(1)$ remote references per lock acquisition, independent of the number of processors attempting to acquire the lock. Our algorithm provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than a swap-with-memory instruction. We also present a new scalable barrier algorithm that generates $O(1)$ remote references per processor reaching the barrier, and observe that two previously-known barriers can likewise be cast in a form that spins only on locally-accessible flag variables. None of these barrier algorithms requires hardware support beyond the usual atomicity of memory reads and writes. %X We compare the performance of our scalable algorithms with other software approaches to busy-wait synchronization on both a Sequent Symmetry and a BBN Butterfly. Our principal conclusion is that {\em contention due to synchronization need not be a problem in large-scale shared-memory multiprocessors}. The existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides a case against so-called ``dance hall'' architectures, in which shared memory locations are equally far from all processors. %X The postscript here is the version from which the galleys were prepared. 91.TOCS.Scalable_Sync.ps.Z Copyright 1991, ACM %A M. Crovella %A P. Das %A C. Dubnicki %A T. LeBlanc %A E. Markatos %T Multiprogramming on Multiprocessors %R TR 385 %I URCSD %D February 1991 (revised May 1991) %K scheduling coscheduling time-slicing hardware partitions %X Several studies have shown that applications may suffer significant performance degradation unless the scheduling policy minimizes the overhead due to multiprogramming. This overhead includes context switching among applications, waiting time incurred by one process due to the preemption of another, and various migration costs associated with moving a process from one processor to another. Many different multiprogramming solutions have been proposed, but each has limited applicability or fails to address an important source of overhead. In addition, there has been little experimental comparison of the various solutions in the presence of applications with varying degrees of parallelism and synchronization. %X In this paper we explore the tradeoffs between different approaches to multiprogramming a multiprocessor. We modified an existing operating system to implement three different multiprogramming options: time-slicing, coscheduling, and dynamic hardware partitions. Using these three options, we implemented applications that vary in the degree of parallelism, and the frequency and type of synchronization. We show that in most cases coscheduling is preferable to time-slicing. We also show that although there are cases where coscheduling is beneficial, dynamic hardware partitions do no worse, and will often do better. We conclude that under most circumstances, hardware partitioning is the best strategy for multiprogramming a multiprocessor, no matter how much parallelism applications employ or how frequently synchronization occurs. %X 91.TR385.Multiprogramming_Multiprocessors.ps.Z %A J. M. Mellor-Crummey %A M. L. Scott %T Synchronization Without Contention %J PROC of the Fourth ASPLOS %C Santa Clara, CA %D 8-11 April 1991 %P 269-278 %O In \f2\&CAN 19\fP:2, \f2\&OSR 25\fP (special issue), and \f2\&ACM SIGPLAN Notices 26\fP:4 %X Conventional wisdom holds that contention due to busy-wait synchronization is a major obstacle to scalability and acceptable performance in large shared-memory multiprocessors. We argue the contrary, and present fast, simple algorithms for contention-free mutual exclusion, reader-writer control, and barrier synchronization. These algorithms, based on widely available \fap{} instructions, exploit local access to shared memory to avoid contention. We compare our algorithms to previous approaches in both qualitative and quantitative terms, presenting their performance on the Sequent Symmetry and BBN Butterfly multiprocessors. Our results highlight the importance of local access to shared memory, provide a case against the construction of so-called ``dance hall'' machines, and suggest that special-purpose hardware support for synchronization is unlikely to be cost effective on machines with sequentially consistent memory. %X This file is a little ugly; the version for the conference proceedings was formatted for oversize mats. This reconstruction was made with the scaling facility of dvi2ps. Its references were regenerated in the reconstruction, and may not match the proceedings exactly. 91.ASPLOS.Contention-Free_Sync.ps.Z Copyright 1991, ACM %A W. J. Bolosky %A M. L. Scott %A R. P. Fitzgerald %A R. J. Fowler %A A. L. Cox %T NUMA Policies and Their Relation to Memory Architecture %J PROC of the Fourth ASPLOS %C Santa Clara, CA %D 8-11 April 1991 %P 212-221 %O In \f2\&CAN 19\fP:2, \f2\&OSR 25\fP (special issue), and \f2\&ACM SIGPLAN Notices 26\fP:4 %X Multiprocessor memory reference traces provide a wealth of information on the behavior of parallel programs. We have used this information to explore the relationship between kernel-based NUMA management policies and multiprocessor memory architecture. Our trace analysis techniques employ an off-line, optimal cost policy as a baseline against which to compare on-line policies, and as a policy-insensitive tool for evaluating architectural design alternatives. We compare the performance of our optimal policy with that of three implementable policies (two of which appear in previous work), on a variety of applications, with varying relative speeds for page moves and local, global, and remote memory references. Our results indicate that a good NUMA policy must be chosen to match its machine, and confirm that such policies can be both simple and effective. They also indicate that programs for NUMA machines must be written with care to obtain the best performance. %X 91.ASPLOS.NUMA_Policies.ps.Z Copyright 1991, ACM %A J. M. Mellor-Crummey %A M. L. Scott %T Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors %J PROC of the Third PPOPP %C Williamsburg, VA %D 21-24 April 1991 %P 106-113 %X Reader-writer synchronization relaxes the constraints of mutual exclusion to permit more than one process to inspect a shared object concurrently, as long as none of them changes its value. On uniprocessors, mutual exclusion and reader-writer locks are typically designed to de-schedule blocked processes; however, on shared-memory multiprocessors it is often advantageous to have processes busy wait. Unfortunately, implementations of busy-wait locks on shared-memory multiprocessors typically cause memory and network contention that degrades performance. Several researchers have shown how to implement scalable mutual exclusion locks that exploit locality in the memory hierarchies of shared-memory multiprocessors to eliminate contention for memory and for the processor-memory interconnect. In this paper we present reader-writer locks that similarly exploit locality to achieve scalability, with variants for reader preference, writer preference, and reader-writer fairness. Performance results on a BBN TC2000 multiprocessor demonstrate that our algorithms provide low latency and excellent scalability. %X 91.PPOPP.Read_Write_Sync.ps.Z Copyright 1991, ACM %A L. A. Crowl %T Architectural Adaptability in Parallel Programming %I URCSD %R Ph.\|D. thesis, TR 381 %D May 1991 %K architectural independence closure Matroshka Natasha %X To create a parallel program, programmers must decide what parallelism to exploit, and choose the associated data distribution and communication. Since a typical algorithm has much more potential parallelism than any single architecture can effectively exploit, programmers usually express only the exploitation of parallelism appropriate to a single machine. Unfortunately, parallel architectures vary widely. A program that executes efficiently on one architecture may execute badly, if at all, on another architecture. To port such a program to a new architecture, we must rewrite the program to remove any ineffective parallelism, to introduce any parallelism appropriate for the new machine, to re-distribute data and processing, and to alter the form of communication. %X Architectural adaptability is the ease with which programmers can tune or port a program to a different architecture. The thesis of this dissertation is that control abstraction is fundamental to architectural adaptability for parallel programs. With control abstraction, we can define and use a rich variety of control constructs to represent an algorithm's potential parallelism. Since control abstraction separates the definition of a construct from its implementation, a construct may have several different implementations, each providing different exploitations of parallelism. By selecting an implementation for each use of a control construct with annotations, we can vary the parallelism we choose to exploit without otherwise changing the source code. %X We present Matroshka, a programming model that supports architectural adaptability in parallel programs through object-based data abstraction and closure-based control abstraction. Using the model, we develop several working example programs, and show that the example programs adapt well to different architectures. We also outline a programming method based on abstraction. To show the implementation feasibility of our approach, we describe a prototype language based on Matroshka, describe its implementation, and compare the performance of the prototype with existing programs. %X 91.TR381.Crowl_thesis.ps.Z %A E. Markatos %A M. Crovella %A P. Das %A C. Dubnicki %A T. LeBlanc %T The Effects of Multiprogramming on Barrier Synchronization %K tr380 tree barriers %R TR 380 %I URCSD %D May 1991 %X One of the most common ways to share a multiprocessor among several applications is to give each application a set of dedicated processors. To ensure fairness, an application may receive fewer processors than it has processes. Unless an application can easily adjust the number of processes it employs during execution, several processes from the same application may have to share a processor. In this paper we quantify the performance penalty that arises when more than one process from the same application runs on a single processor of a NUMA (NonUniform Memory Access) multiprocessor. We consider programs that use coarse-grain parallelism and barrier synchronization because they are particularly sensitive to multiprogramming. We quantify the impact on the performance of an application of quantum size, frequency of synchronization, and the type of barrier used. We conclude that dedicating processors to an application, even without migration or dynamic adjustment of the number of processes, is an effective scheduling policy even for programs that synchronize frequently using barriers. %X 91.TR380.Barrier_Multiprogramming.ps.Z %A E. P. Markatos %T Multiprocessor Synchronization Primitives with Priorities %J PROC of the Eighth IEEE Workshop on Real-Time Operating Systems and Software %D May 1991 %O Also published in the PROC of the 1991 IFAC workshop on Real-Time Programming, Pergamon Press %X Low-level multiprocessor synchronization primitives, such as spinlocks, are usually designed with little or no consideration about timing constraints, which makes them inappropriate for real-time systems. In this paper, we propose a new synchronization mechanism, the priority spinlock, that takes into account the priorities of the processes that want to acquire it, and favors high priority processes. We define what a priority spinlock is, and propose two algorithms to implement priority spinlocks with local spinning. Priority spinlocks can be used to provide prioritized mutually exclusive access to shared resources in real-time multiprocessor systems. They can also be used as building blocks for higher level priority synchronization primitives, such as priority semaphores. %X 91.RTOS.Priority_Spinlocks.ps.Z Copyright 1991, Pergamon Press, plc %A E. Chaves %A T. J. LeBlanc %A B. D. Marsh %A M. L. Scott %T Kernel-Kernel Communication in a Shared-Memory Multiprocessor %K psyche tr368 %R submitted for publication %D 1991 %O Earlier version published as TR 368, URCSD, January 1991, and presented at the \f2Second Symposium on Distributed and Multiprocessor Systems\fP, Atlanta, GA, 21-22 March 1991, pp. 105-116. %X In the standard kernel organization on a shared-memory multiprocessor all processors share the code and data of the operating system; explicit synchronization is used to control access to kernel data structures. Distributed-memory multicomputers use an alternative approach, in which each instance of the kernel performs local operations directly and uses remote invocation to perform remote operations. Either approach to inter-kernel communication can be used in a NonUniform Memory Access (NUMA) multiprocessor, although the performance and conceptual tradeoffs may not be apparent in advance. %X In this paper we compare the use of remote memory access and remote invocation in the kernel of a NUMA multiprocessor operating system. We discuss the issues and architectural features that must be considered when choosing an inter-kernel communication scheme, and describe a series of experiments on the BBN Butterfly Plus designed to empirically evaluate the tradeoffs between remote memory access and remote invocation. We conclude that the Butterfly architecture is biased towards the use of remote invocation for most kernel operations, but that a small set of frequently executed operations can benefit from the use of remote memory access. %X 91.SEDMS.Psyche_Kernel_Comm.ps.Z %A W. J. Bolosky %A M. L. Scott %T Evaluation of Multiprocessor Memory Systems Using Off-Line Optimal Behavior %K algorithm tracing trace based analysis %R TR 403 %I URCSD %D September 1991 %O Submitted for publication %X In recent years, much effort has been devoted to analyzing the performance of distributed memory systems for multiprocessors. Such systems usually consist of a set of memories or caches, some device such as a bus or switch to connect the memories and processors, and a policy for determining when to put which addressable objects in which memories. In attempting to evaluate such systems, it has generally proven difficult to separate the performance implications of the hardware architecture from those of the policy that controls the hardware (whether implemented in software or hardware). In this paper we describe the use of off-line optimal analysis to achieve this separation. Using a trace-driven dynamic programming algorithm, we compute the policy decisions that would maximize overall memory system performance for a given program execution. The result allows us to eliminate the artifacts of any arbitrarily chosen policy when evaluating hardware performance, and provides a baseline against which to compare the performance of particular, realizable, policies. We illustrate this technique in the context of software-controlled page migration and replication, and argue for its applicability to other forms of multiprocessor memory management. %X 91.TR403.Optimal_Analysis.ps.Z %A B. D. Marsh %A M. L. Scott %A T. J. LeBlanc %A E. P. Markatos %T First-Class User-Level Threads %K psyche %J PROC of the Thirteenth SOSP %C Pacific Grove, CA %D 14-16 October 1991 %P 110-121 %O In \f2\&OSR 25\fP:5 %X It is often desirable, for reasons of clarity, portability, and efficiency, to write parallel programs in which the number of processes is independent of the number of available processors. Several modern operating systems support more than one process in an address space, but the overhead of creating and synchronizing kernel processes can be high. Many runtime environments implement lightweight processes (threads) in user space, but this approach usually results in second-class status for threads, making it difficult or impossible to perform scheduling operations at appropriate times (e.g. when the current thread blocks in the kernel). In addition, a lack of common assumptions may also make it difficult for parallel programs or library routines that use dissimilar thread packages to communicate with each other, or to synchronize access to shared data. %X We describe a set of kernel mechanisms and conventions designed to accord .i "first-class status" to user-level threads, allowing them to be used in any reasonable way that traditional kernel-provided processes can be used, while leaving the details of their implementation to user-level code. The key features of our approach are (1) shared memory for asynchronous communication between the kernel and the user, (2) software interrupts for events that might require action on the part of a user-level scheduler, and (3) a scheduler interface convention that facilitates interactions in user space between dissimilar kinds of threads. We have incorporated these mechanisms in the Psyche parallel operating system, and have used them to implement several different kinds of user-level threads. We argue for our approach in terms of both flexibility and performance. %X 91.SOSP.Psyche_First_Class_Threads.ps.Z Copyright 1991, ACM %A M. L. Scott %A W. Garrett %T Shared Memory Ought to be Commonplace (position paper) %R submitted for publication %D November 1991 %X Shared memory as a programming abstraction is widely used within parallel applications. It is .i not widely used .i between applications, but we argue that it should be. Specifically, we suggest that shared memory is both faster and more intuitive than the principal alternatives in many cases, and that the general disuse of shared memory facilities in systems such as Unix is due in large part to a simple lack of amenities, rather than to any fundamental limitation. We are pursuing a series of measures to make shared memory more convenient. %X 92.Shared_Memory.ps.Z %A W. J. Bolosky %A M. L. Scott %T A Trace-Based Comparison of Shared Memory Multiprocessors Using Optimal Off-Line Analysis %K algorithm tracing trace based %R submitted for publication %D December 1991 %X There are three major classes of MIMD multiprocessors: cache-coherent machines, NUMA (non-uniform memory reference) machines without cache coherence, and distributed-memory multicomputers. All three classes can be used to run shared-memory applications, though the third requires software support in order to do so, and the second requires software support in order to do so well. We use trace-driven simulation to compare the performance of these classes, in an attempt to determine the value of various architectural features and parameters on overall program performance. For those systems whose hardware or software supports both coherent caching (migration, replication) and remote reference, we use optimal off-line analysis to make the correct decision in all cases. This technique allows us to evaluate architectural alternatives without worrying that the results may be biased by a poor data placement policy. We find that the size of the unit of coherence (page or cache line) is the dominant factor in performance; that NUMA systems can have performance comparable to that of cache coherent machines; and that even relatively expensive, software-implemented remote reference is beneficial in distributed shared memory machines. %X 91.Architectural_Comparison.ps.Z %A B. D. Marsh %A C. M. Brown %A T. J. LeBlanc %A M. L. Scott %A T. G. Becker %A P. Das %A J. Karlsson %A C. A. Quiroz %T Multi-Model Parallel Programming for Animate Vision %K checkers psyche %J IEEEC %V 25 %N 2 %D February 1992 %X Animate vision systems couple computer vision, robotics, and cognition. These activities need cooperative parallel processing at several granularities. The Psyche system provides this multi-model capability. %X The postscript in this directory is missing a black-and-white photo. There will be minor editorial changes in the final Computer version. 92.Computer.Psyche_Animate_Vision.ps.Z Copyright 1992, IEEE %A B. D. Marsh %A C. M. Brown %A T. J. LeBlanc %A M. L. Scott %A T. G. Becker %A P. Das %A J. Karlsson %A C. A. Quiroz %T Operating System Support for Animate Vision %K tr374 checkers psyche %J JPDC %D to appear %O Earlier version published as TR 374, URCSD, June 1991 %X Animate vision systems couple computer vision and robotics to achieve robust and accurate vision, as well as other complex behavior. These systems combine low-level sensory processing and effector output with high-level cognitive planning -- all computationally intensive tasks that can benefit from parallel processing. A typical animate vision application will likely consist of many tasks, each of which may require a different parallel programming model, and all of which must cooperate to achieve the desired behavior. These {\em multi-model} programs require an underlying software system that not only supports several different models of parallel computation simultaneously, but which also allows tasks implemented in different models to interact. %X This paper describes the Psyche multiprocessor operating system, which was designed to support multi-model programming, and the Rochester Checkers Player, a multi-model robotics program that plays checkers against a human opponent. Psyche supports a variety of parallel programming models within a single operating system by according first-class status to processes implemented in user space. It also supports interactions between programming models using model-independent communication, wherein different types of processes communicate and synchronize without relying on the semantics or implementation of a particular programming model. The implementation of the Checkers Player, in which different parallel programming models are used for vision, robot motion planning, and strategy, illustrates the use of the Psyche mechanisms in an application program, and demonstrates many of the advantages of multi-model programming for animate vision systems. %X The postscript in this directory is missing a black-and-white photo. 92.JPDC.Psyche_Checkers.ps.Z