Multicore and gpu programming an integrated approach
Multicore and GPU Programming offers broad coverage of the key parallel computing skillsets: multicore CPU programming and manycore "massively parallel" computing. Using threads, OpenMP, MPI, and CUDA, it teaches the design and development of software capable of taking advantage of today’s...
Otros Autores: | |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Amsterdam :
Morgan Kaufmann
[2015]
|
Edición: | First edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009629823706719 |
Tabla de Contenidos:
- Front Cover
- Multicore and GPU Programming: An Integrated Approach
- Copyright
- Dedication
- Contents
- List of Tables
- Preface
- What Is in This Book
- Using This Book as a Textbook
- Software and Hardware Requirements
- Sample Code
- Chapter 1: Introduction
- 1.1 The era of multicore machines
- 1.2 A taxonomy of parallel machines
- 1.3 A glimpse of contemporary computing machines
- 1.3.1 The cell BE processor
- 1.3.2 Nvidia's Kepler
- 1.3.3 AMD's APUs
- 1.3.4 Multicore to many-core: tilera's TILE-Gx8072 and intel's xeon phi
- 1.4 Performance metrics
- 1.5 Predicting and measuring parallel program performance
- 1.5.1 Amdahl's law
- 1.5.2 Gustafson-barsis's rebuttal
- Exercises
- Chapter 2: Multicore and parallel program design
- 2.1 Introduction
- 2.2 The PCAM methodology
- 2.3 Decomposition patterns
- 2.3.1 Task parallelism
- 2.3.2 Divide-and-conquer decomposition
- 2.3.3 Geometric decomposition
- 2.3.4 Recursive data decomposition
- 2.3.5 Pipeline decomposition
- 2.3.6 Event-based coordination decomposition
- 2.4 Program structure patterns
- 2.4.1 Single-program, multiple-data
- 2.4.2 Multiple-program, multiple-data
- 2.4.3 Master-worker
- 2.4.4 Map-reduce
- 2.4.5 Fork/join
- 2.4.6 Loop parallelism
- 2.5 Matching decomposition patterns with program structure patterns
- Exercises
- Chapter 3: Shared-memory programming: threads
- 3.1 Introduction
- 3.2 Threads
- 3.2.1 What is a thread?
- 3.2.2 What are threads good for?
- 3.2.3 Thread creation and initialization
- 3.2.3.1 Implicit thread creation
- 3.2.4 Sharing data between threads
- 3.3 Design concerns
- 3.4 Semaphores
- 3.5 Applying semaphores in classical problems
- 3.5.1 Producers-consumers
- 3.5.2 Dealing with termination
- 3.5.2.1 Termination using a shared data item
- 3.5.2.2 Termination using messages.
- 3.5.3 The barbershop problem: introducing fairness
- 3.5.4 Readers-writers
- 3.5.4.1 A solution favoring the readers
- 3.5.4.2 Giving priority to the writers
- 3.5.4.3 A fair solution
- 3.6 Monitors
- 3.6.1 Design approach 1: critical section inside the monitor
- 3.6.2 Design approach 2: monitor controls entry to critical section
- 3.7 Applying monitors in classical problems
- 3.7.1 Producers-consumers revisited
- 3.7.1.1 Producers-consumers: buffer manipulation within the monitor
- 3.7.1.2 Producers-consumers: buffer insertion/extraction exterior to the monitor
- 3.7.2 Readers-writers
- 3.7.2.1 A solution favoring the readers
- 3.7.2.2 Giving priority to the writers
- 3.7.2.3 A fair solution
- 3.8 Dynamic vs. static thread management
- 3.8.1 Qt's thread pool
- 3.8.2 Creating and managing a pool of threads
- 3.9 Debugging multithreaded applications
- 3.10 Higher-level constructs: multithreaded programming without threads
- 3.10.1 Concurrent map
- 3.10.2 Map-reduce
- 3.10.3 Concurrent filter
- 3.10.4 Filter-reduce
- 3.10.5 A case study: multithreaded sorting
- 3.10.6 A case study: multithreaded image matching
- Exercises
- Chapter 4: Shared-memory programming: OpenMP
- 4.1 Introduction
- 4.2 Your First OpenMP Program
- 4.3 Variable Scope
- 4.3.1 OpenMP Integration V.0: Manual Partitioning
- 4.3.2 OpenMP Integration V.1: Manual Partitioning Without a Race Condition
- 4.3.3 OpenMP Integration V.2: Implicit Partitioning with Locking
- 4.3.4 OpenMP Integration V.3: Implicit Partitioning with Reduction
- 4.3.5 Final Words on Variable Scope
- 4.4 Loop-Level Parallelism
- 4.4.1 Data Dependencies
- 4.4.1.1 Flow Dependencies
- 4.4.1.2 Antidependencies
- 4.4.1.3 Output Dependencies
- 4.4.2 Nested Loops
- 4.4.3 Scheduling
- 4.5 Task Parallelism
- 4.5.1 The sections Directive
- 4.5.1.1 Producers-Consumers in OpenMP.
- 4.5.2 The task Directive
- 4.6 Synchronization Constructs
- 4.7 Correctness and Optimization Issues
- 4.7.1 Thread Safety
- 4.7.2 False Sharing
- 4.8 A Case Study: Sorting in OpenMP
- 4.8.1 Bottom-Up Mergesort in OpenMP
- 4.8.2 Top-Down Mergesort in OpenMP
- 4.8.3 Performance Comparison
- Exercises
- Chapter 5: Distributed memory programming
- 5.1 Communicating Processes
- 5.2 MPI
- 5.3 Core concepts
- 5.4 Your first MPI program
- 5.5 Program architecture
- 5.5.1 SPMD
- 5.5.2 MPMD
- 5.6 Point-to-Point communication
- 5.7 Alternative Point-to-Point communication modes
- 5.7.1 Buffered Communications
- 5.8 Non blocking communications
- 5.9 Point-to-Point Communications: Summary
- 5.10 Error reporting and handling
- 5.11 Collective communications
- 5.11.1 Scattering
- 5.11.2 Gathering
- 5.11.3 Reduction
- 5.11.4 All-to-All Gathering
- 5.11.5 All-to-All Scattering
- 5.11.6 All-to-All Reduction
- 5.11.7 Global Synchronization
- 5.12 Communicating objects
- 5.12.1 Derived Datatypes
- 5.12.2 Packing/Unpacking
- 5.13 Node management: communicators and groups
- 5.13.1 Creating Groups
- 5.13.2 Creating Intra-Communicators
- 5.14 One-sided communications
- 5.14.1 RMA Communication Functions
- 5.14.2 RMA Synchronization Functions
- 5.15 I/O considerations
- 5.16 Combining MPI processes with threads
- 5.17 Timing and Performance Measurements
- 5.18 Debugging and profiling MPI programs
- 5.19 The Boost.MPI library
- 5.19.1 Blocking and non blocking Communications
- 5.19.2 Data Serialization
- 5.19.3 Collective Operations
- 5.20 A case study: diffusion-limited aggregation
- 5.21 A case study: brute-force encryption cracking
- 5.21.1 Version #1 : "plain-vanilla'' MPI
- 5.21.2 Version #2 : combining MPI and OpenMP
- 5.22 A Case Study: MPI Implementation of the Master-Worker Pattern.
- 5.22.1 A Simple Master-Worker Setup
- 5.22.2 A Multithreaded Master-Worker Setup
- Exercises
- Chapter 6: GPU programming
- 6.1 GPU Programming
- 6.2 CUDA's programming model: threads, blocks, and grids
- 6.3 CUDA's execution model: streaming multiprocessors and warps
- 6.4 CUDA compilation process
- 6.5 Putting together a CUDA project
- 6.6 Memory hierarchy
- 6.6.1 Local Memory/Registers
- 6.6.2 Shared Memory
- 6.6.3 Constant Memory
- 6.6.4 Texture and Surface Memory
- 6.7 Optimization techniques
- 6.7.1 Block and Grid Design
- 6.7.2 Kernel Structure
- 6.7.3 Shared Memory Access
- 6.7.4 Global Memory Access
- 6.7.5 Page-Locked and Zero-Copy Memory
- 6.7.6 Unified Memory
- 6.7.7 Asynchronous Execution and Streams
- 6.7.7.1 Stream Synchronization: Events and Callbacks
- 6.8 Dynamic parallelism
- 6.9 Debugging CUDA programs
- 6.10 Profiling CUDA programs
- 6.11 CUDA and MPI
- 6.12 Case studies
- 6.12.1 Fractal Set Calculation
- 6.12.1.1 Version #1: One thread per pixel
- 6.12.1.2 Version #2: Pinned host and pitched device memory
- 6.12.1.3 Version #3: Multiple pixels per thread
- 6.12.1.4 Evaluation
- 6.12.2 Block Cipher Encryption
- 6.12.2.1 Version #1: The case of a standalone GPU machine
- 6.12.2.2 Version #2: Overlapping GPU communication and computation
- 6.12.2.3 Version #3: Using a cluster of GPU machines
- 6.12.2.4 Evaluation
- Exercises
- Chapter 7: The Thrust template library
- 7.1 Introduction
- 7.2 First steps in Thrust
- 7.3 Working with Thrust datatypes
- 7.4 Thrust algorithms
- 7.4.1 Transformations
- 7.4.2 Sorting and searching
- 7.4.3 Reductions
- 7.4.4 Scans/prefix sums
- 7.4.5 Data management and manipulation
- 7.5 Fancy iterators
- 7.6 Switching device back ends
- 7.7 Case studies
- 7.7.1 Monte carlo integration
- 7.7.2 DNA Sequence alignment
- Exercises.
- Chapter 8: Load balancing
- 8.1 Introduction
- 8.2 Dynamic load balancing: the Linda legacy
- 8.3 Static Load Balancing: The Divisible LoadTheory Approach
- 8.3.1 Modeling Costs
- 8.3.2 Communication Configuration
- 8.3.3 Analysis
- 8.3.3.1 N-Port, Block-Type, Single-Installment Solution
- 8.3.3.2 One-Port, Block-Type, Single-Installment Solution
- 8.3.4 Summary - Short Literature Review
- 8.4 DLTlib: A library for partitioning workloads
- 8.5 Case studies
- 8.5.1 Hybrid Computation of a Mandelbrot Set "Movie'':A Case Study in Dynamic Load Balancing
- 8.5.2 Distributed Block Cipher Encryption: A Case Study in Static Load Balancing
- Appendix A: Compiling Qt programs
- A.1 Using an IDE
- A.2 The qmake Utility
- Appendix B: Running MPI programs
- B.1 Preparatory Steps
- B.2 Computing Nodes discovery for MPI Program Deployment
- B.2.1 Host Discovery with the nmap Utility
- B.2.2 Automatic Generation of a Hostfile
- Appendix C: Time measurement
- C.1 Introduction
- C.2 POSIX High-Resolution Timing
- C.3 Timing in Qt
- C.4 Timing in OpenMP
- C.5 Timing in MPI
- C.6 Timing in CUDA
- Appendix D: Boost.MPI
- D.1 Mapping from MPI C to Boost.MPI
- Appendix E: Setting up CUDA
- E.1 Installation
- E.2 Issues with GCC
- E.3 Running CUDA without an Nvidia GPU
- E.4 Running CUDA on Optimus-Equipped Laptops
- E.5 Combining CUDA with Third-Party Libraries
- Appendix F: DLTlib
- F.1 DLTlib Functions
- F.1.1 Class Network: Generic Methods
- F.1.2 Class Network: Query Processing
- F.1.3 Class Network: Image Processing
- F.1.4 Class Network: Image Registration
- F.2 DLTlib Files
- Glossary
- Bibliography
- Index.