Parallel and High Performance Computing

Detalles Bibliográficos
Autor principal:	Robey, Robert (-)
Otros Autores:	Zamora, Yuliana
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	New York : Manning Publications Co. LLC 2021.
Materias:	Parallel programming (Computer science) Electronic data processing. Big data. C# (Computer program language) Instructional and educational works.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009633549506719

Tabla de Contenidos:

Intro
Parallel and High Performance Computing
Copyright
Dedication
contents
front matter
foreword
Yulie Zamora, University of Chicago, Illinois
How we came to write this book
acknowledgments
about this book
Who should read this book
Part 1 Introduction to parallel computing
1 Why parallel computing?
1.1 Why should you learn about parallel computing?
1.1.1 What are the potential benefits of parallel computing?
1.1.2 Parallel computing cautions
1.2 The fundamental laws of parallel computing
1.2.1 The limit to parallel computing: Amdahl's Law
1.2.2 Breaking through the parallel limit: Gustafson-Barsis's Law
1.3 How does parallel computing work?
1.3.1 Walking through a sample application
1.3.2 A hardware model for today's heterogeneous parallel systems
1.3.3 The application/software model for today's heterogeneous parallel systems
1.4 Categorizing parallel approaches
1.5 Parallel strategies
1.6 Parallel speedup versus comparative speedups: Two different measures
1.7 What will you learn in this book?
1.7.1 Additional reading
1.7.2 Exercises
Summary
2 Planning for parallelization
2.1 Approaching a new project: The preparation
2.1.1 Version control: Creating a safety vault for your parallel code
2.1.2 Test suites: The first step to creating a robust, reliable application
2.1.3 Finding and fixing memory issues
2.1.4 Improving code portability
2.2 Profiling: Probing the gap between system capabilities and application performance
2.3 Planning: A foundation for success
2.3.1 Exploring with benchmarks and mini-apps
2.3.2 Design of the core data structures and code modularity
2.3.3 Algorithms: Redesign for parallel
2.4 Implementation: Where it all happens
2.5 Commit: Wrapping it up with quality
2.6 Further explorations.
2.6.1 Additional reading
2.6.2 Exercises
Summary
3 Performance limits and profiling
3.1 Know your application's potential performance limits
3.2 Determine your hardware capabilities: Benchmarking
3.2.1 Tools for gathering system characteristics
3.2.2 Calculating theoretical maximum flops
3.2.3 The memory hierarchy and theoretical memory bandwidth
3.2.4 Empirical measurement of bandwidth and flops
3.2.5 Calculating the machine balance between flops and bandwidth
3.3 Characterizing your application: Profiling
3.3.1 Profiling tools
3.3.2 Empirical measurement of processor clock frequency and energy consumption
3.3.3 Tracking memory during run time
3.4 Further explorations
3.4.1 Additional reading
3.4.2 Exercises
Summary
4 Data design and performance models
4.1 Performance data structures: Data-oriented design
4.1.1 Multidimensional arrays
4.1.2 Array of Structures (AoS) versus Structures of Arrays (SoA)
4.1.3 Array of Structures of Arrays (AoSoA)
4.2 Three Cs of cache misses: Compulsory, capacity, conflict
4.3 Simple performance models: A case study
4.3.1 Full matrix data representations
4.3.2 Compressed sparse storage representations
4.4 Advanced performance models
4.5 Network messages
4.6 Further explorations
4.6.1 Additional reading
4.6.2 Exercises
Summary
5 Parallel algorithms and patterns
5.1 Algorithm analysis for parallel computing applications
5.2 Performance models versus algorithmic complexity
5.3 Parallel algorithms: What are they?
5.4 What is a hash function?
5.5 Spatial hashing: A highly-parallel algorithm
5.5.1 Using perfect hashing for spatial mesh operations
5.5.2 Using compact hashing for spatial mesh operations
5.6 Prefix sum (scan) pattern and its importance in parallel computing.
5.6.1 Step-efficient parallel scan operation
5.6.2 Work-efficient parallel scan operation
5.6.3 Parallel scan operations for large arrays
5.7 Parallel global sum: Addressing the problem of associativity
5.8 Future of parallel algorithm research
5.9 Further explorations
5.9.1 Additional reading
5.9.2 Exercises
Summary
Part 2 CPU: The parallel workhorse
6 Vectorization: FLOPs for free
6.1 Vectorization and single instruction, multiple data (SIMD) overview
6.2 Hardware trends for vectorization
6.3 Vectorization methods
6.3.1 Optimized libraries provide performance for little effort
6.3.2 Auto-vectorization: The easy way to vectorization speedup (most of the time1)
6.3.3 Teaching the compiler through hints: Pragmas and directives
6.3.4 Crappy loops, we got them: Use vector intrinsics
6.3.5 Not for the faint of heart: Using assembler code for vectorization
6.4 Programming style for better vectorization
6.5 Compiler flags relevant for vectorization for various compilers
6.6 OpenMP SIMD directives for better portability
6.7 Further explorations
6.7.1 Additional reading
6.7.2 Exercises
Summary
7 OpenMP that performs
7.1 OpenMP introduction
7.1.1 OpenMP concepts
7.1.2 A simple OpenMP program
7.2 Typical OpenMP use cases: Loop-level, high-level, and MPI plus OpenMP
7.2.1 Loop-level OpenMP for quick parallelization
7.2.2 High-level OpenMP for better parallel performance
7.2.3 MPI plus OpenMP for extreme scalability
7.3 Examples of standard loop-level OpenMP
7.3.1 Loop level OpenMP: Vector addition example
7.3.2 Stream triad example
7.3.3 Loop level OpenMP: Stencil example
7.3.4 Performance of loop-level examples
7.3.5 Reduction example of a global sum using OpenMP threading
7.3.6 Potential loop-level OpenMP issues.
7.4 Variable scope importance for correctness in OpenMP
7.5 Function-level OpenMP: Making a whole function thread parallel
7.6 Improving parallel scalability with high-level OpenMP
7.6.1 How to implement high-level OpenMP
7.6.2 Example of implementing high-level OpenMP
7.7 Hybrid threading and vectorization with OpenMP
7.8 Advanced examples using OpenMP
7.8.1 Stencil example with a separate pass for the x and y directions
7.8.2 Kahan summation implementation with OpenMP threading
7.8.3 Threaded implementation of the prefix scan algorithm
7.9 Threading tools essential for robust implementations
7.9.1 Using Allinea/ARM MAP to get a quick high-level profile of your application
7.9.2 Finding your thread race conditions with Intel® Inspector
7.10 Example of a task-based support algorithm
7.11 Further explorations
7.11.1 Additional reading
7.11.2 Exercises
Summary
8 MPI: The parallel backbone
8.1 The basics for an MPI program
8.1.1 Basic MPI function calls for every MPI program
8.1.2 Compiler wrappers for simpler MPI programs
8.1.3 Using parallel startup commands
8.1.4 Minimum working example of an MPI program
8.2 The send and receive commands for process-to-process communication
8.3 Collective communication: A powerful component of MPI
8.3.1 Using a barrier to synchronize timers
8.3.2 Using the broadcast to handle small file input
8.3.3 Using a reduction to get a single value from across all processes
8.3.4 Using gather to put order in debug printouts
8.3.5 Using scatter and gather to send data out to processes for work
8.4 Data parallel examples
8.4.1 Stream triad to measure bandwidth on the node
8.4.2 Ghost cell exchanges in a two-dimensional (2D) mesh
8.4.3 Ghost cell exchanges in a three-dimensional (3D) stencil calculation.
8.5 Advanced MPI functionality to simplify code and enable optimizations
8.5.1 Using custom MPI data types for performance and code simplification
8.5.2 Cartesian topology support in MPI
8.5.3 Performance tests of ghost cell exchange variants
8.6 Hybrid MPI plus OpenMP for extreme scalability
8.6.1 The benefits of hybrid MPI plus OpenMP
8.6.2 MPI plus OpenMP example
8.7 Further explorations
8.7.1 Additional reading
8.7.2 Exercises
Summary
Part 3 GPUs: Built to accelerate
9 GPU architectures and concepts
9.1 The CPU-GPU system as an accelerated computational platform
9.1.1 Integrated GPUs: An underused option on commodity-based systems
9.1.2 Dedicated GPUs: The workhorse option
9.2 The GPU and the thread engine
9.2.1 The compute unit is the streaming multiprocessor (or subslice)
9.2.2 Processing elements are the individual processors
9.2.3 Multiple data operations by each processing element
9.2.4 Calculating the peak theoretical flops for some leading GPUs
9.3 Characteristics of GPU memory spaces
9.3.1 Calculating theoretical peak memory bandwidth
9.3.2 Measuring the GPU stream benchmark
9.3.3 Roofline performance model for GPUs
9.3.4 Using the mixbench performance tool to choose the best GPU for a workload
9.4 The PCI bus: CPU to GPU data transfer overhead
9.4.1 Theoretical bandwidth of the PCI bus
9.4.2 A benchmark application for PCI bandwidth
9.5 Multi-GPU platforms and MPI
9.5.1 Optimizing the data movement between GPUs across the network
9.5.2 A higher performance alternative to the PCI bus
9.6 Potential benefits of GPU-accelerated platforms
9.6.1 Reducing time-to-solution
9.6.2 Reducing energy use with GPUs
9.6.3 Reduction in cloud computing costs with GPUs
9.7 When to use GPUs
9.8 Further explorations
9.8.1 Additional reading.
9.8.2 Exercises.

Parallel and High Performance Computing

Ejemplares similares