Parallel and High Performance Computing
Autor principal: | |
---|---|
Otros Autores: | |
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
New York :
Manning Publications Co. LLC
2021.
|
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009633549506719 |
Tabla de Contenidos:
- Intro
- Parallel and High Performance Computing
- Copyright
- Dedication
- contents
- front matter
- foreword
- Yulie Zamora, University of Chicago, Illinois
- How we came to write this book
- acknowledgments
- about this book
- Who should read this book
- Part 1 Introduction to parallel computing
- 1 Why parallel computing?
- 1.1 Why should you learn about parallel computing?
- 1.1.1 What are the potential benefits of parallel computing?
- 1.1.2 Parallel computing cautions
- 1.2 The fundamental laws of parallel computing
- 1.2.1 The limit to parallel computing: Amdahl's Law
- 1.2.2 Breaking through the parallel limit: Gustafson-Barsis's Law
- 1.3 How does parallel computing work?
- 1.3.1 Walking through a sample application
- 1.3.2 A hardware model for today's heterogeneous parallel systems
- 1.3.3 The application/software model for today's heterogeneous parallel systems
- 1.4 Categorizing parallel approaches
- 1.5 Parallel strategies
- 1.6 Parallel speedup versus comparative speedups: Two different measures
- 1.7 What will you learn in this book?
- 1.7.1 Additional reading
- 1.7.2 Exercises
- Summary
- 2 Planning for parallelization
- 2.1 Approaching a new project: The preparation
- 2.1.1 Version control: Creating a safety vault for your parallel code
- 2.1.2 Test suites: The first step to creating a robust, reliable application
- 2.1.3 Finding and fixing memory issues
- 2.1.4 Improving code portability
- 2.2 Profiling: Probing the gap between system capabilities and application performance
- 2.3 Planning: A foundation for success
- 2.3.1 Exploring with benchmarks and mini-apps
- 2.3.2 Design of the core data structures and code modularity
- 2.3.3 Algorithms: Redesign for parallel
- 2.4 Implementation: Where it all happens
- 2.5 Commit: Wrapping it up with quality
- 2.6 Further explorations.
- 2.6.1 Additional reading
- 2.6.2 Exercises
- Summary
- 3 Performance limits and profiling
- 3.1 Know your application's potential performance limits
- 3.2 Determine your hardware capabilities: Benchmarking
- 3.2.1 Tools for gathering system characteristics
- 3.2.2 Calculating theoretical maximum flops
- 3.2.3 The memory hierarchy and theoretical memory bandwidth
- 3.2.4 Empirical measurement of bandwidth and flops
- 3.2.5 Calculating the machine balance between flops and bandwidth
- 3.3 Characterizing your application: Profiling
- 3.3.1 Profiling tools
- 3.3.2 Empirical measurement of processor clock frequency and energy consumption
- 3.3.3 Tracking memory during run time
- 3.4 Further explorations
- 3.4.1 Additional reading
- 3.4.2 Exercises
- Summary
- 4 Data design and performance models
- 4.1 Performance data structures: Data-oriented design
- 4.1.1 Multidimensional arrays
- 4.1.2 Array of Structures (AoS) versus Structures of Arrays (SoA)
- 4.1.3 Array of Structures of Arrays (AoSoA)
- 4.2 Three Cs of cache misses: Compulsory, capacity, conflict
- 4.3 Simple performance models: A case study
- 4.3.1 Full matrix data representations
- 4.3.2 Compressed sparse storage representations
- 4.4 Advanced performance models
- 4.5 Network messages
- 4.6 Further explorations
- 4.6.1 Additional reading
- 4.6.2 Exercises
- Summary
- 5 Parallel algorithms and patterns
- 5.1 Algorithm analysis for parallel computing applications
- 5.2 Performance models versus algorithmic complexity
- 5.3 Parallel algorithms: What are they?
- 5.4 What is a hash function?
- 5.5 Spatial hashing: A highly-parallel algorithm
- 5.5.1 Using perfect hashing for spatial mesh operations
- 5.5.2 Using compact hashing for spatial mesh operations
- 5.6 Prefix sum (scan) pattern and its importance in parallel computing.
- 5.6.1 Step-efficient parallel scan operation
- 5.6.2 Work-efficient parallel scan operation
- 5.6.3 Parallel scan operations for large arrays
- 5.7 Parallel global sum: Addressing the problem of associativity
- 5.8 Future of parallel algorithm research
- 5.9 Further explorations
- 5.9.1 Additional reading
- 5.9.2 Exercises
- Summary
- Part 2 CPU: The parallel workhorse
- 6 Vectorization: FLOPs for free
- 6.1 Vectorization and single instruction, multiple data (SIMD) overview
- 6.2 Hardware trends for vectorization
- 6.3 Vectorization methods
- 6.3.1 Optimized libraries provide performance for little effort
- 6.3.2 Auto-vectorization: The easy way to vectorization speedup (most of the time1)
- 6.3.3 Teaching the compiler through hints: Pragmas and directives
- 6.3.4 Crappy loops, we got them: Use vector intrinsics
- 6.3.5 Not for the faint of heart: Using assembler code for vectorization
- 6.4 Programming style for better vectorization
- 6.5 Compiler flags relevant for vectorization for various compilers
- 6.6 OpenMP SIMD directives for better portability
- 6.7 Further explorations
- 6.7.1 Additional reading
- 6.7.2 Exercises
- Summary
- 7 OpenMP that performs
- 7.1 OpenMP introduction
- 7.1.1 OpenMP concepts
- 7.1.2 A simple OpenMP program
- 7.2 Typical OpenMP use cases: Loop-level, high-level, and MPI plus OpenMP
- 7.2.1 Loop-level OpenMP for quick parallelization
- 7.2.2 High-level OpenMP for better parallel performance
- 7.2.3 MPI plus OpenMP for extreme scalability
- 7.3 Examples of standard loop-level OpenMP
- 7.3.1 Loop level OpenMP: Vector addition example
- 7.3.2 Stream triad example
- 7.3.3 Loop level OpenMP: Stencil example
- 7.3.4 Performance of loop-level examples
- 7.3.5 Reduction example of a global sum using OpenMP threading
- 7.3.6 Potential loop-level OpenMP issues.
- 7.4 Variable scope importance for correctness in OpenMP
- 7.5 Function-level OpenMP: Making a whole function thread parallel
- 7.6 Improving parallel scalability with high-level OpenMP
- 7.6.1 How to implement high-level OpenMP
- 7.6.2 Example of implementing high-level OpenMP
- 7.7 Hybrid threading and vectorization with OpenMP
- 7.8 Advanced examples using OpenMP
- 7.8.1 Stencil example with a separate pass for the x and y directions
- 7.8.2 Kahan summation implementation with OpenMP threading
- 7.8.3 Threaded implementation of the prefix scan algorithm
- 7.9 Threading tools essential for robust implementations
- 7.9.1 Using Allinea/ARM MAP to get a quick high-level profile of your application
- 7.9.2 Finding your thread race conditions with Intel® Inspector
- 7.10 Example of a task-based support algorithm
- 7.11 Further explorations
- 7.11.1 Additional reading
- 7.11.2 Exercises
- Summary
- 8 MPI: The parallel backbone
- 8.1 The basics for an MPI program
- 8.1.1 Basic MPI function calls for every MPI program
- 8.1.2 Compiler wrappers for simpler MPI programs
- 8.1.3 Using parallel startup commands
- 8.1.4 Minimum working example of an MPI program
- 8.2 The send and receive commands for process-to-process communication
- 8.3 Collective communication: A powerful component of MPI
- 8.3.1 Using a barrier to synchronize timers
- 8.3.2 Using the broadcast to handle small file input
- 8.3.3 Using a reduction to get a single value from across all processes
- 8.3.4 Using gather to put order in debug printouts
- 8.3.5 Using scatter and gather to send data out to processes for work
- 8.4 Data parallel examples
- 8.4.1 Stream triad to measure bandwidth on the node
- 8.4.2 Ghost cell exchanges in a two-dimensional (2D) mesh
- 8.4.3 Ghost cell exchanges in a three-dimensional (3D) stencil calculation.
- 8.5 Advanced MPI functionality to simplify code and enable optimizations
- 8.5.1 Using custom MPI data types for performance and code simplification
- 8.5.2 Cartesian topology support in MPI
- 8.5.3 Performance tests of ghost cell exchange variants
- 8.6 Hybrid MPI plus OpenMP for extreme scalability
- 8.6.1 The benefits of hybrid MPI plus OpenMP
- 8.6.2 MPI plus OpenMP example
- 8.7 Further explorations
- 8.7.1 Additional reading
- 8.7.2 Exercises
- Summary
- Part 3 GPUs: Built to accelerate
- 9 GPU architectures and concepts
- 9.1 The CPU-GPU system as an accelerated computational platform
- 9.1.1 Integrated GPUs: An underused option on commodity-based systems
- 9.1.2 Dedicated GPUs: The workhorse option
- 9.2 The GPU and the thread engine
- 9.2.1 The compute unit is the streaming multiprocessor (or subslice)
- 9.2.2 Processing elements are the individual processors
- 9.2.3 Multiple data operations by each processing element
- 9.2.4 Calculating the peak theoretical flops for some leading GPUs
- 9.3 Characteristics of GPU memory spaces
- 9.3.1 Calculating theoretical peak memory bandwidth
- 9.3.2 Measuring the GPU stream benchmark
- 9.3.3 Roofline performance model for GPUs
- 9.3.4 Using the mixbench performance tool to choose the best GPU for a workload
- 9.4 The PCI bus: CPU to GPU data transfer overhead
- 9.4.1 Theoretical bandwidth of the PCI bus
- 9.4.2 A benchmark application for PCI bandwidth
- 9.5 Multi-GPU platforms and MPI
- 9.5.1 Optimizing the data movement between GPUs across the network
- 9.5.2 A higher performance alternative to the PCI bus
- 9.6 Potential benefits of GPU-accelerated platforms
- 9.6.1 Reducing time-to-solution
- 9.6.2 Reducing energy use with GPUs
- 9.6.3 Reduction in cloud computing costs with GPUs
- 9.7 When to use GPUs
- 9.8 Further explorations
- 9.8.1 Additional reading.
- 9.8.2 Exercises.