Selecting Locking Designs for Parallel Programs

By: P.E. McKenney
Published in: PLoPD2
Pages: 501-535
Category: Parallel Programming

Locking designs for parallel programs.

Url: http://c2.com/ppr/mutex/mutexpat.html

Pattern: Sequential Program

Pages: 510-511

To eliminate the complexity of parallelization, construct an entirely sequential program. Eliminate all synchronization primitives and the overhead and complexity associated with them.

Pattern: Code Locking

Pages: 511-513

Use code locking when a small amount of execution time is spent in critical sections or when you only need modest scaling. This is the simplest locking design.

Pattern: Data Locking

Pages: 511-513

To get better speedups than straightforward parallelizations such as Code Locking, partition data structures so that each portion can be processed in parallel. Each portion has its own independent critical section.

Pattern: Data Ownership

Pages: 515-519

Each CPU or thread should own its own data and shouldn't need locking primitives to access it, but must use a communication mechanism to access another CPU's or thread's data.

Active Object describes an object-oriented approach to this kind of communication.

Pattern: Parallel Fastpath

Pages: 520-521

To get speedups in programs that can't use aggressive locking patterns throughout, use an aggressive locking pattern for most of the workload (the fastpath) and a more conservative approach for the rest.

Pattern: Reader/Writer Locking

Pages: 521-523

To speedup programs that rarely modify shared data, allow readers to proceed in parallel, but require writers to exclude readers and other writers.

Pattern: Hierarchical Locking

Pages: 523-524

To speedup programs when updates are complex and expensive, but infrequent operations such as insertion and deletion require coarse-grained locking, partition data structures into coarse- and fine-grained portions. For example, use a single lock for the internal nodes and links of a search tree, but use a separate lock for each of the leaves.

If updates to the leaves are expensive compared to searches and synchronization primitives, this approach can produce better speedups than Code Locking or Data Locking

Pattern: Allocator Caches

Pages: 524-527

To get speedups in global memory allocators, create a per-CPU or per-process cache of data structure instances. A CPU owns the instances in its cache, so it won't incur overhead and contention penalties to allocate and free them.

Use a global allocator with a less aggressive locking pattern (e.g., Code Locking) when the per-CPU cache overflows or underflows.

Pattern: Critical-Section Fusing

Pages: 527-529

To get speedups in programs with frequent, small critical sections on machines with high synchronization overheads, combine small critical sections into larger ones.

Pattern: Critical-Section Partitioning

Pages: 529-530

To get speedups in programs with infrequent, large critical sections on machines with low synchronization overheads, split large critical sections into smaller ones.