1 Introduction
1.1 Classical Use of Parallelism
1.2 Parallelism in Today’s Hardware
1.3 Basic Concepts
1.4 Overview of the Book
2 Parallel Computer Architecture
2.1 Processor Architecture and Technology Trends
2.2 Flynn’s Taxonomy of Parallel Architectures
2.3 Memory Organization of Parallel Computers
2.3.1 Computers with Distributed Memory Organization
2.3.2 Computers with Shared Memory Organization
2.3.3 Reducing memory access times
2.4 Thread-Level Parallelism
2.4.1 Simultaneous Multithreading
2.4.2 Energy Consumption of Processors
2.4.3 Multicore Processors
2.4.4 Architecture of Multicore Processors
2.4.5 Example: Architecture of the Intel Core i7
2.5 Interconnection Networks
2.5.1 Properties of Interconnection Networks
2.5.2 Direct Interconnection Networks
2.5.3 Embeddings
2.5.4 Dynamic Interconnection Networks
2.6 Routing and Switching
2.6.1 Routing Algorithms
2.6.2 Routing in the Omega Network
2.6.3 Switching
2.6.4 Flow control mechanisms
2.7 Caches and Memory Hierarchy
2.7.1 Characteristics of Caches
2.7.2 Write Policy
2.7.3 Cache coherency
2.7.4 Memory consistency
2.8 Example: IBM Blue Gene supercomputer
2.9 Exercises for Chapter 2
3 Parallel Programming Models
3.1 Models for parallel systems
3.2 Parallelization of programs
3.3 Levels of parallelism
3.3.1 Parallelism at instruction level
3.3.2 Data parallelism
3.3.3 Loop parallelism
3.3.4 Functional parallelism
3.3.5 Explicit and implicit representation of parallelism
3.3.6 Parallel programming patterns
3.4 SIMD Computations
3.4.1 Execution of vector operations
3.4.2 SIMD instructions
3.5 Data distributions for arrays
3.5.1 Data distribution for one-dimensional arrays
3.5.2 Data distribution for two-dimensional arrays
3.5.3 Parameterized data distribution
3.6 Information exchange
3.6.1 Shared variables
3.6.2 Communication operations
3.7 Parallel matrix-vector product
3.7.1 Parallel computation of scalar products
3.7.2 Parallel computation of the linear combinations
3.8 Processes and Threads
3.8.1 Processes
3.8.2 Threads
3.8.3 Synchronization mechanisms
3.8.4 Developing efficient and correct thread programs
3.9 Further parallel programming approaches
3.9.1 Approaches for new parallel languages
3.9.2 Transactional memory
3.10 Exercices for Chapter 3
4 Performance Analysis of Parallel Programs
4.1 Performance Evaluation of Computer Systems
4.1.1 Evaluation of CPU Performance
4.1.2 MIPS and MFLOPS
4.1.3 Performance of Processors with a Memory Hierarchy
4.1.4 Benchmark Programs
4.2 Performance Metrics for Parallel Programs
4.2.1 Speedup and Efficiency
4.2.2 Scalability of Parallel Programs
4.3 Asymptotic Times for Global Communication
4.3.1 Implementing Global Communication Operations
4.3.2 Communications Operations on a Hypercube
4.4 Analysis of Parallel Execution Times
4.4.1 Parallel Scalar Product
4.4.2 Parallel Matrix-vector Product
4.5 Parallel Computational Models
4.5.1 PRAM Model
4.5.2 BSP Model
4.5.3 LogP Model
4.6 Loop Scheduling and Loop Tiling
4.6.1 Loop Scheduling
4.6.2 Loop Tiling
4.7 Exercises for Chapter 4
5 Message-Passing Programming
5.1 Introduction to MPI
5.1.1 MPI point-to-point communication
5.1.2 Deadlocks with Point-to-point Communications
5.1.3 Nonblocking Operations and Communication Modes
5.1.4 Communication mode
5.2 Collective Communication Operations
5.2.1 Collective Communication in MPI
5.2.2 Deadlocks with Collective Communication
5.3 Process Groups and Communicators
5.3.1 Process Groups in MPI
5.3.2 Process Topologies
5.3.3 Timings and aborting processes
5.4 Introduction to MPI-2
5.4.1 Dynamic Process Generation and Management
5.4.2 One-sided communication
5.5 Exercises for Chapter 5
6 Thread Programming
6.1 Programming with Pthreads
6.1.1 Creating and Merging Threads
6.1.2 Thread Coordination with Pthreads
6.1.3 Condition Variables
6.1.4 Extended Lock Mechanism
6.1.5 One-Time Initialization
6.1.6 Implementation of a Task Pool
6.1.7 Parallelism by Pipelining
6.1.8 Implementation of a Client-Server Model
6.1.9 Thread Attributes and Cancellation
6.1.10 Thread Scheduling with Pthreads
6.1.11 Priority Inversion
6.1.12 Thread-specific Data
6.2 Java Threads
6.2.1 Thread Generation in Java
6.2.2 Synchronization of Java Threads
6.2.3 Wait and Notify
6.2.4 Extended Synchronization Patterns
6.2.5 Thread Scheduling in Java
6.2.6 Package java.util.concurrent
6.3 OpenMP
6.3.1 Compiler directives
6.3.2 Execution environment routines
6.3.3 Coordination and synchronization of threads
6.4 Exercises for Chapter 6
7 General Purpose GPU Programming
7.1 The Architecture of GPUs
7.2 Introduction to CUDA Programming
7.3 Synchronization and Shared Memory
7.4 CUDA Thread Scheduling
7.5 Efficient Memory Access and Tiling Technique
7.6 Introduction to OpenCL
7.7 Exercises for Chapter 7
8 Algorithms for Systems of Linear Equations
8.1 Gaussian Elimination
8.1.1 Gaussian Elimination and LU Decomposition
8.1.2 Parallel Row-Cyclic Implementation
8.1.3 Parallel Implementation with Checkerboard Distribution
8.1.4 Analysis of the Parallel Execution Time
8.2 Direct Methods for Linear Systems with Banded Structure
8.2.1 Discretization of the Poisson Equation
8.2.2 Tridiagonal Systems
8.2.3 Generalization to Banded Matrices
8.2.4 Solving the Discretized Poisson Equation
8.3 Iterative Methods for Linear Systems
8.3.1 Standard Iteration Methods
8.3.2 Parallel implementation of the Jacobi Iteration
8.3.3 Parallel Implementation of the Gauss-Seidel Iteration
8.3.4 Gauss-Seidel Iteration for Sparse Systems
8.3.5 Red-black Ordering
8.4 Conjugate Gradient Method
8.4.1 Sequential CG method
8.4.2 Parallel CG Method
8.5 Cholesky Factorization for Sparse Matrices
8.5.1 Sequential Algorithm
8.5.2 Storage Scheme for Sparse Matrices
8.5.3 Implementation for Shared Variables
8.6 Exercises for Chapter 8
References
Index