Preface
About the authors
List of acronyms and abbreviation
1 Modern processors
1.1 Stored-program computer architecture
1.2 General-purpose cache-based microprocessor architecture
1.2.1 Performance metrics and benchmarks
1.2.2 Transistors galore: Moore’s Law
1.2.3 Pipelining
1.2.4 Superscalarity
1.2.5 SIMD
1.3 Memory hierarchies
1.3.1 Cache
1.3.2 Cache mapping
1.3.3 Prefetch
1.4 Multicore processors
1.5 Multithreaded processors
1.6 Vector processors
1.6.1 Design principles
1.6.2 Maximum performance estimates
1.6.3 Programming for vector architectures
2 Basic optimization techniques for serial code
2.1 Scalar profiling
2.1.1 Function- and line-based runtime profiling
2.1.2 Hardware performance counters
2.2 Common sense optimizations
2.2.1 Do less work!
2.2.2 Avoid expensive operations!
2.2.3 Shrink the working set!
2.3 Simple measures, large impact
2.3.1 Elimination of common subexpressions
2.3.2 Avoiding branches
2.3.3 Using SIMD instruction sets
2.4 The role of compilers
2.4.1 General optimization options
2.4.2 Inlining
2.4.3 Aliasing
2.4.4 Computational accuracy
2.4.5 Register optimizations
2.4.6 Using compiler logs
2.5 C++ optimizations
2.5.1 Temporaries
2.5.2 Dynamic memory management
2.5.3 Loop kernels and iterators
3 Data access optimization
3.1 Balance analysis and lightspeed estimates
3.1.1 Bandwidth-based performance modeling
3.1.2 The STREAM benchmarks
3.2 Storage order
3.3 Case study: The Jacobi algorithm
3.4 Case study: Dense matrix transpose
3.5 Algorithm classification and access optimizations
3.5.1 O(N)/O(N)
3.5.2 O(N2)/O(N2)
3.5.3 O(N3)/O(N2)
3.6 Case study: Sparse matrix-vector multiply
3.6.1 Sparse matrix storage schemes
3.6.2 Optimizing JDS sparse MVM
4 Parallel computers
4.1 Taxonomy of parallel computing paradigms
4.2 Shared-memory computers
4.2.1 Cache coherence
4.2.2 UMA
4.2.3 ccNUMA
4.3 Distributed-memory computers
4.4 Hierarchical (hybrid) systems
4.5 Networks
4.5.1 Basic performance characteristics of networks
4.5.2 Buses
4.5.3 Switched and fat-tree networks
4.5.4 Mesh networks
4.5.5 Hybrids
5 Basics of parallelization
5.1 Why parallelize?
5.2 Parallelism
5.2.1 Data parallelism
5.2.2 Functional parallelism
5.3 Parallel scalability
5.3.1 Factors that limit parallel execution
5.3.2 Scalability metrics
5.3.3 Simple scalability laws
5.3.4 Parallel efficiency
5.3.5 Serial performance versus strong scalability
5.3.6 Refined performance models
5.3.7 Choosing the right scaling baseline
5.3.8 Case study: Can slower processors compute faster?
5.3.9 Load imbalance
6 Shared-memory parallel programming with OpenMP
6.1 Short introduction to OpenMP
6.1.1 Parallel execution
6.1.2 Data scoping
6.1.3 OpenMP worksharing for loops
6.1.4 Synchronization
6.1.5 Reductions
6.1.6 Loop scheduling
6.1.7 Tasking
6.1.8 Miscellaneous
6.2 Case study: OpenMP-parallel Jacobi algorithm
6.3 Advanced OpenMP: Wavefront parallelization
7 Efficient OpenMP programming
7.1 Profiling OpenMP programs
7.2 Performance pitfalls
7.2.1 Ameliorating the impact of OpenMP worksharing constructs
7.2.2 Determining OpenMP overhead for short loops
7.2.3 Serialization
7.2.4 False sharing
7.3 Case study: Parallel sparse matrix-vector multiply
8 Locality optimizations on ccNUMA architectures
8.1 Locality of access on ccNUMA
8.1.1 Page placement by first touch
8.1.2 Access locality by other means
8.2 Case study: ccNUMA optimization of sparse MVM
8.3 Placement pitfalls
8.3.1 NUMA-unfriendly OpenMP scheduling
8.3.2 File system cache
8.4 ccNUMA issues with C++
8.4.1 Arrays of objects
8.4.2 Standard Template Library
9 Distributed-memory parallel programming with MPI
9.1 Message passing
9.2 A short introduction to MPI
9.2.1 A simple example
9.2.2 Messages and point-to-point communication
9.2.3 Collective communication
9.2.4 Nonblocking point-to-point communication
9.2.5 Virtual topologies
9.3 Example: MPI parallelization of a Jacobi solver
9.3.1 MPI implementation
9.3.2 Performance properties
10 Efficient MPI programming
10.1 MPI performance tools
10.2 Communication parameters
10.3 Synchronization, serialization, contention
10.3.1 Implicit serialization and synchronization
10.3.2 Contention
10.4 Reducing communication overhead
10.4.1 Optimal domain decomposition
10.4.2 Aggregating messages
10.4.3 Nonblocking vs. asynchronous communication
10.4.4 Collective communication
10.5 Understanding intranode point-to-point communication
11 Hybrid parallelization with MPI and OpenMP
11.1 Basic MPI/OpenMP programming models
11.1.1 Vector mode implementation
11.1.2 Task mode implementation
11.1.3 Case study: Hybrid Jacobi solver
11.2 MPI taxonomy of thread interoperability
11.3 Hybrid decomposition and mapping
11.4 Potential benefits and drawbacks of hybrid programming
A Topology and affinity in multicore environments
A.1 Topology
A.2 Thread and process placement
A.2.1 External affinity control
A.2.2 Affinity under program control
A.3 Page placement beyond first touch
B Solutions to the problems
Bibliography