%A M. L. Scott
%A T. J. LeBlanc
%T Psyche: A General-Purpose Operating System for Shared-Memory
Multiprocessors
%R BPR 19, TR 223
%K bpr19 tr223
%I URCSD
%D July 1987
%X The Psyche project at the University of Rochester aims to develop
a high-performance operating system to support a wide variety of
models for parallel programming on a shared-memory multiprocessor.
It is predicated on the assumption that no one model of process
state or style of communication will prove appropriate for all
applications, but that a shared-memory machine can and should support
all models.
Through a system of keys and access lists, incremental changes
to address spaces, and direct execution of remote operations, Psyche
will allow individual applications to balance the conflicting goals
of efficiency, flexibility, and security.
A pilot implementation is under construction on the BBN Butterfly Parallel
Processor.
%X This paper is superceded by the 1988 ICPP paper, and does not appear
in this directory.

%T Bridge: A High-Performance File System for Parallel Processors
%A P. C. Dibble
%A M. L. Scott
%A C. S. Ellis
%K dcs
%J PROC of the Eighth ICDCS
%P 154-161
%D 13-17 June 1988
%C San Jose, CA
%X Faster storage devices cannot solve the I/O bottleneck problem for
large multiprocessor systems if data passes through a
file system on a single processor.
Implementing the file system as a
parallel program can significantly improve performance.
Selectively revealing this parallel structure to
utility programs can produce additional improvements,
particularly on machines in which interprocessor
communication is slow compared to aggregate I/O bandwidth.
%X
We have designed and prototyped a parallel file system that distributes each
file across multiple storage devices and processors.
Naive programs are able to access files just as they
would with a conventional file system, while more sophisticated programs may
export pieces of their code to the processors managing the data, for optimum
performance.
Theoretical and early empirical data indicate nearly linear speedup
on critical operations for up to several dozen devices.
%X 88.ICDCS.Bridge.ps.Z
Copyright 1988, IEEE

%A T. J. LeBlanc
%A M. L. Scott
%A C. M. Brown
%T Large-Scale Parallel Programming: Experience with the BBN Butterfly
Parallel Processor
%J PROC of the First PPEALS
%C New Haven, CT
%D 19-21 July 1988
%P 161-172
%O Also available as BPR 22, URCSD, September 1988
%X For three years, members of the Computer Science Department at the
University of Rochester have used a collection of BBN Butterfly\*[TM\*]
Parallel Processors to conduct research in parallel systems and applications.
For most of that time, Rochester's 128-node machine has had the distinction
of being the largest shared-memory multiprocessor in the world.
In the course of our work with the Butterfly we have ported three compilers,
developed five major and several minor library packages,
built two different operating systems,
and implemented dozens of applications.
Our experience clearly demonstrates the practicality of large-scale
shared-memory multiprocessors, with non-uniform memory access times.
It also demonstrates that the problems inherent in programming
such machines are far from adequately solved.
Both locality and Amdahl's law become increasingly important
with a very large number of nodes.
The availability of multiple programming models is also a concern;
truly general-purpose parallel computing will require the development of
environments that allow programs written under different models to
coexist and interact.
Most important, there is a continuing need for high-quality programming tools;
widespread acceptance of parallel machines will require the development
of programming environments comparable to those available
on sequential computers.
%X This paper contains several cut-and-paste figures, and is not easily
placed in ftp-able form; it doesn't appear in this directory.
Copyright 1988, ACM

%A M. L. Scott
%A T. J. LeBlanc
%A B. D. Marsh
%T Design Rationale for Psyche, a General-Purpose Multiprocessor Operating
System
%J PROC of the 1988 ICPP
%C St. Charles, IL
%D 15-19 August 1988
%P 255-262
%V II \(mi Software
%O Also published in the \f2University of Rochester
1988-89 Computer Science and Computer Engineering Research Review\fP.
%X The Psyche project at the University of Rochester aims to develop
a high-performance operating system to support a wide variety of
models for parallel programming.
It is predicated on the conviction that no one model of process
state or style of communication will prove appropriate for all
applications, but that shared-memory multiprocessors (particularly
the scalable
.q NUMA
variety) can and should support all models.
Conventional approaches, such as shared memory or
message passing, can be regarded as points
on a continuum that reflects the degree of sharing between processes.
Psyche facilitates dynamic sharing by providing a user interface
based on passive data abstractions in a uniform virtual address space.
It ensures that users pay for protection only when it is required
by permitting lazy evaluation of protection policies implemented with
keys and access lists.
The data abstractions define conventions for sharing the uniform
address space;
the tradeoff between protection and performance determines
the degree to which those conventions are enforced.
In the absence of protection boundaries,
access to a shared abstraction can be as efficient as
a procedure call or a pointer dereference.
%X The postscript in this directory has been reconstructed from old troff
source, and doesn't precisely match the formatting of the version in the
conference proceedings.  The references have also been re-built from a newer
database, and may not be exactly the same.
88.ICPP.Psyche_Rationale.ps.Z
Copyright 1988, The Pennsylvania State University

%A M. L. Scott
%A T. J. LeBlanc
%A B. D. Marsh
%T Evolution of an Operating System for Large-Scale Shared-Memory
Multiprocessors
%K psyche tr309
%R TR 309
%I URCSD
%D March 1989
%X Scalable shared-memory multiprocessors (those with non-uniform memory
access times) are among the most flexible architectures for high-performance
parallel computing, admitting efficient implementations of a wide
range of process models, communication mechanisms, and granularities
of parallelism.
Such machines present opportunities for general-purpose parallel computing that
cannot be exploited by existing operating systems, because the traditional
approach to operating system design presents a virtual machine in which the
definition of processes, communication, and grain size are
outside the control of the user.
Psyche is an operating system designed to enable the most effective
use possible of large-scale shared memory multiprocessors.
The Psyche project is characterized by
(1) a design that permits the implementation of multiple models of parallelism,
both within and among applications,
(2) the ability to trade protection for performance, with information sharing
as the default, rather than the exception,
(3) explicit, user-level control of process structure and scheduling,
and
(4) a kernel implementation that uses shared memory itself, and that
provides users with the illusion of uniform memory access times.
%X The postscript here was reconstructed from old troff source, and does not
match the formatting of the hard-copy TR.  In particular, the bibliography
has re-built from a newer database, and in several cases cites newer versions
of papers -- versions that postdate the TR.
89.TR309.Psyche_Evolution.ps.Z

%A T. J. LeBlanc
%A B. D. Marsh
%A M. L. Scott
%T Memory Management for Large-Scale NUMA Multiprocessors
%K psyche tr311
%R TR 311
%I URCSD
%D March 1989
%X Large-scale shared-memory multiprocessors such as the BBN Butterfly
and IBM RP3 introduce a new level in the
memory hierarchy: multiple physical memories
with different memory access times.
An operating system for these NUMA (NonUniform Memory Access)
multiprocessors should provide traditional virtual memory management,
facilitate dynamic and widespread memory sharing,
and minimize the apparent disparity between local and nonlocal memory.
In addition, the implementation must be scalable to configurations
with hundreds or thousands of processors.
%X This paper describes memory management
in the Psyche multiprocessor operating system,
under development at the University of Rochester.
The Psyche kernel manages a multi-level memory hierarchy
consisting of local memory, nonlocal memory, and backing store.
Local memory stores private data and serves as a cache for shared data;
nonlocal memory stores shared data and serves as a disk cache.
The system structure isolates the policies and mechanisms
that manage different layers in the memory hierarchy, so that customized
data structures and policies can be constructed for each layer.
Local memory management policies are implemented using mechanisms that are
independent of the architectural configuration;
global policies are implemented using multiple processes
that increase in number as the architecture scales.
Psyche currently runs on the BBN Butterfly Plus multiprocessor.
%X This paper describes an ambitious early design for a VM system for Psyche.
Our plans for VM were scaled back substantially after implementation was
about half completed, due to changing interests of graduate students.
Further discussion can be found in the WEBDMS paper.
89.TR311.Psyche_VM.ps.Z

%A M. L. Scott
%A T. J. LeBlanc
%A B. D. Marsh
%T A Multi-User, Multi-Language Open Operating System
%K Psyche
%J PROC of the Second Workshop on Workstation Operating Systems
%C Pacific Grove, CA
%D 27-29 September 1989
%P 125-129
%X An open operating system provides a high degree of
programming flexibility and efficiency.
At the same time, it generally requires that all programs be written
in a single language, and provides no protection other than that which
is available from the compiler.
These limitations may be tolerated by the user of a personal computer,
but they become unacceptable on a workstation that must run untrusted software
written in many different languages.
Psyche is an operating system designed to make the most effective
possible use of shared-memory multiprocessors and uniprocessor machines.
It combines the flexibility of an open operating system with
the ability to write in multiple languages and to establish solid
protection boundaries.
It also provides the efficiency of an open operating system
for programs that do not require protection.
%X A 5-page position paper.
89.WWOS.Psyche_Open_System.ps.Z
Copyright 1989, IEEE

%A P. C. Dibble
%A M. L. Scott
%T Beyond Striping: The Bridge Multiprocessor File System
%J Computer Architecture News
%V 17
%N 5
%D September 1989
%P 32-39
%X High-performance parallel computers require high-performance file systems.
Exotic I/O hardware will be of little use if file system software
runs on a single processor of a many-processor machine.
We believe that cost-effective I/O for large multiprocessors can best
be obtained by spreading both data and file system computation over
a large number of processors and disks.
To assess the effectiveness of this approach, we have implemented
a prototype system called Bridge, and have studied its performance
on several data intensive applications, among them external sorting.
A detailed analysis of our sorting algorithm indicates that Bridge
can profitably be used on configurations in excess of one hundred
processors with disks.
Empirical results on a 32-processor implementation agree with the analysis,
providing us with a high degree of confidence in this prediction.
Based on our experience, we argue that file systems such as Bridge
will satisfy the I/O needs of a wide range of parallel architectures
and applications.
%X 89.CAN.Bridge.ps.Z

%A W. J. Bolosky
%A R. P. Fitzgerald
%A M. L. Scott
%T Simple But Effective Techniques for NUMA Memory Management
%J PROC of the Twelfth SOSP
%C Litchfield Park, AZ
%D 3-6 December 1989
%P 19-31
%O In \f2\&OSR 23\fP:5
%X Multiprocessors with non-uniform memory access times introduce the problem
of placing data near the processes that use them, in order to improve
performance.
We have implemented an automatic page placement strategy
in the Mach operating system on
the IBM ACE multiprocessor workstation.
Our experience indicates that even very simple automatic strategies
can produce nearly optimal page placement.
It also suggests that the greatest leverage for further performance improvement
lies in reducing {\em false sharing},
which occurs when the same page
contains objects that would best be placed in different memories.
%X 89.SOSP.ACE_NUMA.ps.Z
Copyright 1989, ACM

%A M. L. Scott 
%A T. J. LeBlanc
%A B. D. Marsh
%A T. G. Becker
%A C. Dubnicki
%A E. P. Markatos
%A N. G. Smithline
%T Implementation Issues for the Psyche Multiprocessor Operating System
%K psyche WEBDMS
%J Computing Systems
%V 3
%N 1
%D Winter 1990
%P 101-137
%O Earlier version presented at the \f2First Workshop on Experiences
with Building Distributed and Multiprocessor Systems\fP, Ft. Lauderdale, FL,
5-6 October, 1989, 227-236
%X Psyche is a parallel operating system under development at the
University of Rochester.
The Psyche user interface is designed to allow programs with widely
differing concepts of process, sharing, protection, and communication
to run efficiently on the same machine, and to interact productively.
In addition, the Psyche development effort is addressing a host of
implementation issues for large-scale shared-memory multiprocessors,
including the organization of kernel functions, data structures, and
address maps for machines with non-uniform memory; the migration and
replication of pages to maximize locality; the introduction of
user-level device drivers, memory management, and scheduling; and
remote source-level kernel debugging.
We focus in this paper on our implementation of Psyche for the BBN
Butterfly Plus multiprocessor, though many of the issues we consider
apply to any operating system kernel on a large-scale shared-memory
machine.
We describe our major design decisions, the results of our initial
experience with the implementation, and our plans for continued
evaluation and experimentation with kernel implementation techniques.
%X The postscript in this directory has been reconstructed from old troff
source; the references have been re-built from a newer database, and
may not be exactly right.
90.Comp_Sys.Psyche_Impl.ps.Z
Copyright 1990, The Regents of the University of California

%A M. L. Scott
%A T. J. LeBlanc
%A B. D. Marsh
%T Multi-Model Parallel Programming in Psyche
%J PROC of the Second PPOPP
%C Seattle, WA
%D 14-16 March, 1990
%P 70-78
%O In \f2\&ACM SIGPLAN Notices 25\fP:3
%X Many different parallel programming models, including lightweight
processes that communicate with shared memory and heavyweight processes
that communicate with messages, have been used to implement parallel
applications.
Unfortunately, operating systems and languages designed for parallel
programming typically support only one model.
Multi-model parallel programming is the simultaneous use of several
different models, both across programs and within a single program.
This paper describes multi-model parallel programming in the Psyche
multiprocessor operating system.
We explain why multi-model programming is desirable and present an
operating system interface designed to support it.
Through a series of three examples, we illustrate how the Psyche
operating system supports different models of parallelism
and how the different models are able to interact.
%X The postscript in this directory has been reconstructed from old troff
source, and may not match the conference proceedings precisely; the
references have been re-built from a newer database, and may not be
exactly right.
90.PPOPP.Multi_Model_Psyche.ps.Z
Copyright 1990, ACM

%A T. J. LeBlanc
%A E. P. Markatos
%T Operating System Support for Adaptable Real-Time Systems
%J PROC of the Seventh IEEE Workshop on Real-Time Operating Systems
and Software
%C Charlottesville, VA
%D May 1990
%P 1-10
%K multi-model real-time satisfice user-level scheduling
%O In \f2IEEE Real-Time Systems Newsletter\fP
%X This paper outlines our plans for a real-time systems research program
to support the long-term goal of developing intelligent robots.
The distinguishing characteristic of our
approach to real-time systems is an emphasis on system \fIadaptability\fP
in a dynamic real-world environment.
We achieve adaptability by allowing multiple real-time process models,
with different known properties and timing constraints,
to coexist within a single system and application.
Using a hardware architecture in which a large-scale multiprocessor
controls a behavioral system with vision and manipulation capabilities,
we are constructing a prototype software environment that provides a
predictable schedule for \fIreflexive\fP robot tasks and
a flexible environment for implementing adaptive
\fIcognitive\fP robot tasks.
We will exploit multiple processors,
user-level scheduling, and user-defined process and communication models
in the construction of real-time robotics applications.
%X 90.RTOSS.Adaptable_Real_Time.ps.Z

%A P. C. Dibble
%A M. L. Scott
%T The Parallel Interleaved File System: A Solution to the Multiprocessor
I/O Bottleneck
%K bridge
%J IEEETPDS
%D to appear, pending revision
%X Parallel computers with non-parallel file systems are limited by the
performance of the processor running the file system.
We have designed and implemented a parallel file system called Bridge
that eliminates this problem by spreading both data and file system
computation over a large number of processors and disks.
To assess the effectiveness of Bridge we have used it to implement
several data-intensive applications, including a parallel external
merge sort.
The merge sort is a particularly demanding application; it requires
significant amounts of interprocessor communication and data movement.
A detailed analysis of this application indicates that Bridge can
profitably be used on configurations in which disks are attached to
more than 150 processors.
Empirical results on a 32-processor implementation agree with the
analysis, providing us with a high degree of confidence in this
prediction.
Based on our experience, we argue that file systems such as Bridge will
satisfy the I/O needs of a wide range of parallel architectures and
applications.
%X This paper has been through one round of reviewing for IEEE TPDS,
and is currently undergoing revision.  The postscript in this directory
is the version submitted to the journal in May of 1990.
90.TPDS.Bridge.ps.Z

%A M. L. Scott
%T The Lynx Distributed Programming Language: Motivation, Design, and
Experience
%K tr308
%J COMPLANG
%V 16
%N 3/4
%D 1991
%P 209-233
%O Earlier version published as TR 308, ``An Overview of Lynx,'' URCSD,
August 1989
%X A programming language has clear advantages over a library package
for communication between processes in a distributed environment.
Unfortunately, most existing distributed languages are better suited to
communication between the processes of a single application than they are to
communication between processes that are developed independently.
Such independent development is characteristic both of the systems software
for multicomputers and of the applications software for
geographically-distributed networks.
Lynx is a message-based language designed to support both application
and system software in a single conceptual framework.
It extends the advantages of high-level communication
to processes designed in isolation, and compiled and placed in
operation without knowledge of their peers.
A virtual-circuit abstraction called the
.i link
supports changes in communication topology at run time, under user control.
A lightweight
.i thread
mechanism is integrated with message passing in
a way that makes it easy for server processes to maintain context for
interleaved conversations with an arbitrary number of clients.
Experience with Lynx under two very different implementations has
yielded important insights into the interface between a language run-time
package and the underlying operating system, and has helped to identify
the inherent costs of message-passing systems.
%X This is the place to start if you're interested in Lynx.
It cites, and summarizes, most of the other papers.
The postscript here is the version from which the galleys were prepared.
91.Comp_Lang.Lynx.ps.Z
Copyright 1991, Pergamon Press, plc

%A L. A. Crowl
%A T. J. LeBlanc
%T Architectural Adaptability in Parallel Programming via Control Abstraction
%I URCSD
%R TR 359
%D January 1991
%K architectural independence closure Matroshka
%X Parallel programming involves finding the potential parallelism in an
application, choosing an algorithm, and mapping it to the architecture at
hand.  Since a typical algorithm has much more potential parallelism than
any single architecture can effectively exploit, we usually program the
parallelism that the available control constructs easily express and that
the given architecture efficiently exploits.  This approach produces programs
that exhibit much less parallelism than the original algorithm and whose
performance depends entirely on the underlying architecture.  To port such a
program to a new architecture, we must rewrite the program to remove any
ineffective parallelism and to recover any lost parallelism appropriate for
the new machine.
%X In this paper we show how to adapt a parallel program to different
architectures using control abstraction.  With control abstraction we can
define and use a rich variety of control constructs to represent an
algorithm's potential parallelism.  Since control abstraction separates the
definition of a construct from its implementation, a construct may have
several different implementations, each exploiting a different subset of the
parallelism admitted by the construct.  By selecting an implementation for
each control construct using annotations, we can vary the parallelism we
choose to exploit without otherwise changing the source code.  This approach
produces programs that exhibit most of, if not all, the potential parallelism
in an algorithm, and whose performance can be tuned for a specific
architecture simply by choosing among the various implementations for the
control constructs in use.
%X 91.TR359.Control_Abstraction.ps.Z

%A J. M. Mellor-Crummey
%A M. L. Scott
%T Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors
%K mcs lock MellorCrummeyScott90
%J TOCS
%V 9
%N 1
%P 21-65
%D February 1991
%O Earlier version published as TR 342, URCSD, April 1990, and
COMP TR90-114, Center for Research on Parallel Computation,
Rice UNIV, May 1990
%X Busy-wait techniques are heavily used for mutual exclusion and barrier
synchronization in shared-memory parallel programs.
Unfortunately, typical implementations of busy-waiting tend to produce
large amounts of memory and interconnect contention, introducing
performance bottlenecks that become markedly more pronounced as
applications scale.
We argue that this problem is not fundamental, and that
one can in fact construct busy-wait synchronization algorithms that
induce no memory or interconnect contention.
The key to these algorithms is for every processor to spin on
separate {\em locally-accessible} flag variables, and for some other processor
to terminate the spin with a single remote write operation at an
appropriate time.
Flag variables may be locally-accessible as a result of coherent
caching, or by virtue of allocation in the local portion of
physically distributed shared memory.
%X We present a new scalable algorithm for spin locks that generates
$O(1)$ remote references per lock acquisition, independent of the number of
processors attempting to acquire the lock.
Our algorithm provides reasonable latency in the absence of contention,
requires only a constant amount of space per lock, and requires
no hardware support other than a swap-with-memory instruction.
We also present a new scalable barrier algorithm that generates $O(1)$ remote
references per processor reaching the barrier, and observe that two
previously-known barriers can likewise be cast in a form that spins
only on locally-accessible flag variables.
None of these barrier algorithms requires hardware support beyond the usual
atomicity of memory reads and writes.
%X We compare the performance of our scalable algorithms with other
software approaches to busy-wait synchronization on both a Sequent
Symmetry and a BBN Butterfly.
Our principal conclusion is that {\em contention due
to synchronization need not be a problem in large-scale shared-memory
multiprocessors}.
The existence of scalable algorithms greatly weakens the case for
costly special-purpose hardware support for synchronization, and
provides a case against so-called ``dance hall'' architectures, in
which shared memory locations are equally far from all processors.
%X The postscript here is the version from which the galleys were prepared.
91.TOCS.Scalable_Sync.ps.Z
Copyright 1991, ACM

%A M. Crovella
%A P. Das
%A C. Dubnicki
%A T. LeBlanc
%A E. Markatos
%T Multiprogramming on Multiprocessors
%R TR 385
%I URCSD
%D February 1991 (revised May 1991)
%K scheduling coscheduling time-slicing hardware partitions
%X Several studies have shown that applications may suffer significant
performance degradation unless the scheduling policy
minimizes the overhead due to multiprogramming.
This overhead includes
context switching among applications,
waiting time incurred by one process due to the preemption of another,
and various migration costs associated with moving a process
from one processor to another.
Many different multiprogramming solutions have been proposed,
but each has limited applicability
or fails to address an important source of overhead.
In addition, there has been little experimental comparison of the
various solutions in the presence of applications with varying
degrees of parallelism and synchronization.
%X In this paper we explore the tradeoffs between different
approaches to multiprogramming a multiprocessor.
We modified an existing operating system to implement three
different multiprogramming options:
time-slicing, coscheduling, and dynamic hardware partitions.
Using these three options,
we implemented applications that vary in the degree of parallelism,
and the frequency and type of synchronization.
We show that in most cases coscheduling is preferable to time-slicing.
We also show that although there are cases where coscheduling is beneficial,
dynamic hardware partitions do no worse, and will often do better.
We conclude that under most circumstances, hardware partitioning
is the best strategy for multiprogramming a multiprocessor,
no matter how much parallelism applications employ
or how frequently synchronization occurs.
%X 91.TR385.Multiprogramming_Multiprocessors.ps.Z

%A J. M. Mellor-Crummey
%A M. L. Scott
%T Synchronization Without Contention
%J PROC of the Fourth ASPLOS
%C Santa Clara, CA
%D 8-11 April 1991
%P 269-278
%O In \f2\&CAN 19\fP:2, \f2\&OSR 25\fP (special issue),
and \f2\&ACM SIGPLAN Notices 26\fP:4
%X Conventional wisdom holds that contention due to busy-wait
synchronization is a major obstacle to scalability and acceptable
performance in large shared-memory multiprocessors.
We argue the contrary, and present fast, simple algorithms for
contention-free mutual exclusion, reader-writer control, and barrier
synchronization. These algorithms, based on widely available
\fap{} instructions, exploit local access to shared memory to avoid
contention.
We compare our algorithms to previous approaches in both qualitative
and quantitative terms, presenting their performance on the Sequent
Symmetry and BBN Butterfly multiprocessors.
Our results highlight the importance of local access to shared memory,
provide a case against the construction of so-called ``dance hall''
machines, and suggest that special-purpose hardware support for
synchronization is unlikely to be cost effective on machines with sequentially
consistent memory.
%X This file is a little ugly; the version for the conference proceedings was
formatted for oversize mats.  This reconstruction was made with the scaling
facility of dvi2ps.  Its references were regenerated in the reconstruction,
and may not match the proceedings exactly.
91.ASPLOS.Contention-Free_Sync.ps.Z
Copyright 1991, ACM

%A W. J. Bolosky
%A M. L. Scott
%A R. P. Fitzgerald
%A R. J. Fowler
%A A. L. Cox
%T NUMA Policies and Their Relation to Memory Architecture
%J PROC of the Fourth ASPLOS
%C Santa Clara, CA
%D 8-11 April 1991
%P 212-221
%O In \f2\&CAN 19\fP:2, \f2\&OSR 25\fP (special issue),
and \f2\&ACM SIGPLAN Notices 26\fP:4
%X Multiprocessor memory reference traces provide a wealth of information on
the
behavior of parallel programs.
We have used this information to explore the relationship between
kernel-based NUMA management policies and multiprocessor memory
architecture.
Our trace analysis techniques employ an off-line, optimal cost policy
as a baseline against which to compare on-line policies, and as a
policy-insensitive tool for evaluating architectural design
alternatives.
We compare the performance of our optimal policy with that of three
implementable policies (two of which appear in previous work), on a
variety of applications, with varying relative speeds for page moves
and local, global, and remote memory references.
Our results indicate that a good NUMA policy must be chosen to match
its machine, and confirm that such policies can be both simple and
effective.
They also indicate that programs for NUMA machines must be written with
care to obtain the best performance.
%X 91.ASPLOS.NUMA_Policies.ps.Z
Copyright 1991, ACM

%A J. M. Mellor-Crummey
%A M. L. Scott
%T Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors
%J PROC of the Third PPOPP
%C Williamsburg, VA
%D 21-24 April 1991
%P 106-113
%X Reader-writer synchronization relaxes the constraints of mutual
exclusion to permit more than one process to inspect a shared object
concurrently, as long as none of them changes its value.  On
uniprocessors, mutual exclusion and reader-writer locks are typically
designed to de-schedule blocked processes; however, on shared-memory
multiprocessors it is often advantageous to have processes busy wait.
Unfortunately, implementations of busy-wait locks on shared-memory
multiprocessors typically cause memory and network contention that
degrades performance.  Several researchers have
shown how to implement scalable mutual exclusion locks that exploit
locality in the memory hierarchies of shared-memory multiprocessors
to eliminate contention for memory and for the processor-memory interconnect.
In this paper we present reader-writer locks that similarly
exploit locality to achieve scalability, with variants for reader
preference, writer preference, and reader-writer fairness.  Performance
results on a BBN TC2000 multiprocessor demonstrate that our algorithms
provide low latency and excellent scalability.
%X 91.PPOPP.Read_Write_Sync.ps.Z
Copyright 1991, ACM

%A L. A. Crowl
%T Architectural Adaptability in Parallel Programming
%I URCSD
%R Ph.\|D. thesis, TR 381
%D May 1991
%K architectural independence closure Matroshka Natasha
%X To create a parallel program, programmers must decide what parallelism to
exploit, and choose the associated data distribution and communication.  Since
a typical algorithm has much more potential parallelism than any single
architecture can effectively exploit, programmers usually express only the
exploitation of parallelism appropriate to a single machine.  Unfortunately,
parallel architectures vary widely.  A program that executes efficiently on
one architecture may execute badly, if at all, on another architecture.  To
port such a program to a new architecture, we must rewrite the program to
remove any ineffective parallelism, to introduce any parallelism appropriate
for the new machine, to re-distribute data and processing, and to alter the
form of communication.
%X Architectural adaptability is the ease with which programmers can tune or
port a program to a different architecture.  The thesis of this dissertation
is that control abstraction is fundamental to architectural adaptability for
parallel programs.  With control abstraction, we can define and use a rich
variety of control constructs to represent an algorithm's potential
parallelism.  Since control abstraction separates the definition of a
construct from its implementation, a construct may have several different
implementations, each providing different exploitations of parallelism.  By
selecting an implementation for each use of a control construct with
annotations, we can vary the parallelism we choose to exploit without
otherwise changing the source code.
%X We present Matroshka, a programming model that supports architectural
adaptability in parallel programs through object-based data abstraction and
closure-based control abstraction.  Using the model, we develop several
working example programs, and show that the example programs adapt well to
different architectures.  We also outline a programming method based on
abstraction.  To show the implementation feasibility of our approach, we
describe a prototype language based on Matroshka, describe its implementation,
and compare the performance of the prototype with existing programs.
%X 91.TR381.Crowl_thesis.ps.Z

%A E. Markatos
%A M. Crovella
%A P. Das
%A C. Dubnicki
%A T. LeBlanc
%T The Effects of Multiprogramming on Barrier Synchronization
%K tr380 tree barriers
%R TR 380
%I URCSD
%D May 1991
%X One of the most common ways to share a multiprocessor among several
applications is to give each application a set of dedicated processors.
To ensure fairness,
an application may receive fewer processors than it has processes.
Unless an application can easily adjust the number of processes
it employs during execution,
several processes from the same application may have to share a processor.
In this paper we quantify the performance penalty that arises when
more than one process from the same application runs on a single processor
of a NUMA (NonUniform Memory Access) multiprocessor.
We consider programs that use
coarse-grain parallelism and barrier synchronization because they
are particularly sensitive to multiprogramming.
We quantify the impact on the performance of an application
of quantum size, frequency of synchronization,
and the type of barrier used.
We conclude that dedicating processors to an application,
even without migration or dynamic adjustment of the number of processes,
is an effective scheduling policy even for programs
that synchronize frequently using barriers.
%X 91.TR380.Barrier_Multiprogramming.ps.Z

%A E. P. Markatos 
%T Multiprocessor Synchronization Primitives with Priorities 
%J PROC of the Eighth IEEE Workshop on Real-Time Operating
Systems and Software
%D May 1991 
%O Also published in the PROC of the 1991 IFAC workshop on Real-Time
Programming, Pergamon Press
%X Low-level multiprocessor synchronization primitives, such as spinlocks,
are usually
designed with little or no consideration about timing constraints, which makes
them inappropriate for real-time systems.
In this paper, we propose a new
synchronization mechanism, the  priority spinlock,
that takes into account the priorities of the processes that want to acquire
it, and favors high priority processes.  We define what a
priority spinlock is, and  propose two  algorithms to
implement priority spinlocks with local spinning.
Priority spinlocks can be used to provide prioritized 
mutually exclusive access
to shared resources in real-time multiprocessor systems.
They can also be used as building
blocks for higher level priority synchronization primitives,
such as priority semaphores.
%X 91.RTOS.Priority_Spinlocks.ps.Z
Copyright 1991, Pergamon Press, plc

%A E. Chaves
%A T. J. LeBlanc
%A B. D. Marsh
%A M. L. Scott
%T Kernel-Kernel Communication in a Shared-Memory Multiprocessor
%K psyche tr368
%R submitted for publication
%D 1991
%O Earlier version published as TR 368, URCSD, January 1991, and presented
at the \f2Second Symposium on Distributed and Multiprocessor Systems\fP,
Atlanta, GA, 21-22 March 1991, pp. 105-116.
%X In the standard kernel organization on a shared-memory multiprocessor
all processors share the code and data of the operating system;
explicit synchronization is used to control access to kernel data
structures.
Distributed-memory multicomputers use an alternative approach, in which
each instance of the kernel performs local operations directly and uses
remote invocation to perform remote operations.
Either approach to inter-kernel communication can be used in a
NonUniform Memory Access (NUMA) multiprocessor, although the
performance and conceptual tradeoffs may not be apparent in advance.
%X
In this paper we compare the use of remote memory access and remote
invocation in the kernel of a NUMA multiprocessor operating system.
We discuss the issues and architectural features that must be
considered when choosing an inter-kernel communication scheme, and
describe a series of experiments on the BBN Butterfly Plus designed to
empirically evaluate the tradeoffs between remote memory access and
remote invocation.
We conclude that the Butterfly architecture is biased towards the use
of remote invocation for most kernel operations, but that a small set
of frequently executed operations can benefit from the use of remote
memory access.
%X 91.SEDMS.Psyche_Kernel_Comm.ps.Z

%A W. J. Bolosky
%A M. L. Scott
%T Evaluation of Multiprocessor Memory Systems Using Off-Line Optimal Behavior
%K algorithm tracing trace based analysis
%R TR 403
%I URCSD
%D September 1991
%O Submitted for publication
%X In recent years, much effort has been devoted to analyzing the
performance of distributed memory systems for multiprocessors.
Such systems usually consist of a set of memories or caches, some
device such as a bus or switch to connect the memories and processors,
and a policy for determining when to put which addressable objects in which
memories.
In attempting to evaluate such systems, it has generally proven
difficult to separate the performance implications of the hardware
architecture from those of the policy that controls the hardware
(whether implemented in software
or hardware).
In this paper we describe the use of off-line optimal analysis to
achieve this separation.
Using a trace-driven dynamic programming algorithm, we compute the
policy decisions that would maximize overall memory system performance
for a given program execution.
The result allows us to eliminate the artifacts of any arbitrarily
chosen policy when evaluating hardware performance, and provides a
baseline against which to compare the performance of particular,
realizable, policies.
We illustrate this technique in the context of software-controlled page
migration and replication, and argue for its applicability to other
forms of multiprocessor memory management.
%X 91.TR403.Optimal_Analysis.ps.Z

%A B. D. Marsh
%A M. L. Scott
%A T. J. LeBlanc
%A E. P. Markatos
%T First-Class User-Level Threads
%K psyche
%J PROC of the Thirteenth SOSP
%C Pacific Grove, CA
%D 14-16 October 1991
%P 110-121
%O In \f2\&OSR 25\fP:5
%X It is often desirable,
for reasons of clarity, portability, and efficiency,
to write parallel programs in which the number of processes is independent of
the number of available processors.
Several modern operating systems support more than one process in an
address space, but the overhead of creating and synchronizing kernel
processes can be high.
Many runtime environments implement
lightweight processes (threads) in user space,
but this approach usually results in second-class
status for threads, making it difficult or impossible to perform
scheduling operations at appropriate times (e.g. when the current
thread blocks in the kernel).
In addition,
a lack of common assumptions may also make it difficult for parallel
programs or library routines that use dissimilar thread packages to
communicate with each other, or to synchronize access to shared data.
%X
We describe a set of kernel mechanisms and conventions designed to
accord
.i "first-class status"
to user-level threads, allowing them to be used in any reasonable
way that traditional kernel-provided processes can be used, while
leaving the details of their implementation to user-level code.
The key features of our approach are
(1) shared memory for asynchronous communication between
the kernel and the user,
(2) software interrupts for events that might require
action on the part of a user-level scheduler, and
(3) a scheduler interface convention that facilitates
interactions in user space between dissimilar kinds of threads.
We have incorporated these mechanisms in the Psyche parallel operating
system, and have used them to implement several different kinds of
user-level threads.
We argue for our approach in terms of both flexibility and performance.
%X 91.SOSP.Psyche_First_Class_Threads.ps.Z
Copyright 1991, ACM

%A M. L. Scott
%A W. Garrett
%T Shared Memory Ought to be Commonplace (position paper)
%R submitted for publication
%D November 1991
%X Shared memory as a programming abstraction is widely used within
parallel applications.
It is
.i not
widely used
.i between
applications, but we argue that it should be.
Specifically, we suggest that shared memory is both faster and more
intuitive than the principal alternatives in many cases, and that the
general disuse of shared memory facilities in systems such as Unix is
due in large part to a simple lack of amenities, rather than to any
fundamental limitation.
We are pursuing a series of measures to make shared memory more
convenient.
%X 92.Shared_Memory.ps.Z

%A W. J. Bolosky
%A M. L. Scott
%T A Trace-Based Comparison of Shared Memory Multiprocessors Using
Optimal Off-Line Analysis
%K algorithm tracing trace based
%R submitted for publication
%D December 1991
%X There are three major classes of MIMD multiprocessors: cache-coherent
machines, NUMA (non-uniform memory reference) machines without cache
coherence, and distributed-memory multicomputers.
All three classes can be used to run shared-memory applications, though
the third requires software support in order to do so, and the second
requires software support in order to do so well.
We use trace-driven simulation to compare the performance of these
classes, in an attempt to determine the value of various architectural
features and parameters on overall program performance.
For those systems whose hardware or software supports both coherent
caching (migration, replication) and remote reference, we use optimal
off-line analysis to make the correct decision in all cases.
This technique allows us to evaluate architectural alternatives without
worrying that the results may be biased by a poor data placement
policy.
We find that the size of the unit of coherence (page or cache line) is
the dominant factor in performance; that NUMA systems can have
performance comparable to that of cache coherent machines; and that
even relatively expensive, software-implemented remote reference is
beneficial in distributed shared memory machines.
%X 91.Architectural_Comparison.ps.Z

%A B. D. Marsh
%A C. M. Brown
%A T. J. LeBlanc
%A M. L. Scott
%A T. G. Becker
%A P. Das
%A J. Karlsson
%A C. A. Quiroz
%T Multi-Model Parallel Programming for Animate Vision
%K checkers psyche
%J IEEEC
%V 25
%N 2
%D February 1992
%X Animate vision systems couple computer vision, robotics, and cognition.
These activities need cooperative parallel processing at several granularities.
The Psyche system provides this multi-model capability. 
%X The postscript in this directory is missing a black-and-white photo.
There will be minor editorial changes in the final Computer version.
92.Computer.Psyche_Animate_Vision.ps.Z
Copyright 1992, IEEE

%A B. D. Marsh
%A C. M. Brown
%A T. J. LeBlanc
%A M. L. Scott
%A T. G. Becker
%A P. Das
%A J. Karlsson
%A C. A. Quiroz
%T Operating System Support for Animate Vision
%K tr374 checkers psyche
%J JPDC
%D to appear
%O Earlier version published as TR 374, URCSD, June 1991
%X Animate vision systems couple computer vision and robotics
to achieve robust and accurate vision,
as well as other complex behavior.
These systems combine low-level sensory processing and effector output with
high-level cognitive planning -- all computationally
intensive tasks that can benefit from parallel processing.
A typical animate vision application will likely consist of many tasks,
each of which may require a different parallel programming model,
and all of which must cooperate to achieve the desired behavior.
These {\em multi-model} programs require an underlying software system
that not only supports several different models of parallel
computation simultaneously,
but which also allows tasks implemented in different models to interact.
%X
This paper describes the Psyche multiprocessor operating system,
which was designed to support multi-model programming,
and the Rochester Checkers Player,
a multi-model robotics program that plays checkers against a human opponent.
Psyche supports a variety of parallel programming models
within a single operating system
by according first-class status to processes implemented in user space.
It also supports interactions between programming models
using model-independent communication,
wherein different types of processes communicate and synchronize
without relying on the semantics or implementation of a particular
programming model.
The implementation of the Checkers Player,
in which different parallel programming models
are used for vision, robot motion planning, and strategy,
illustrates the use of the Psyche mechanisms in an application program,
and demonstrates many of the advantages of multi-model programming
for animate vision systems.
%X The postscript in this directory is missing a black-and-white photo.
92.JPDC.Psyche_Checkers.ps.Z