CUDA Fortran for Scientists and Engineers Best Practices for Efficient CUDA Fortran Programming
CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presu...
Otros Autores: | , |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Cambridge, MA :
Morgan Kaufmann
[2024]
|
Edición: | Second edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009840465706719 |
Tabla de Contenidos:
- Front Cover
- CUDA Fortran for Scientists and Engineers
- Copyright
- Contents
- Preface to the Second Edition
- Preface to the First Edition
- Acknowledgments
- 1 CUDA Fortran programming
- 1 Introduction
- 1.1 A brief history of GPU computing
- 1.2 Parallel computation
- 1.3 Basic concepts
- 1.3.1 A first CUDA Fortran program
- 1.3.1.1 CUDA Fortran compilation
- 1.3.2 Extending to larger arrays
- 1.3.3 Multidimensional arrays
- 1.3.4 Interfaces for device code
- 1.3.5 Managed data
- 1.3.6 Kernel loop directives and CUF kernels
- 1.4 Determining CUDA hardware features and limits
- 1.4.1 Choosing a device to run on
- 1.4.2 Floating point precision
- 1.4.2.1 Accommodating variable precision
- 1.5 Error handling
- 1.6 Compiling CUDA Fortran code
- 1.7 CUDA Driver, Toolkit, and compatibility
- 2 Correctness, accuracy, and debugging
- 2.1 Assessing correctness of results
- 2.1.1 Non-associativity of floating point arithmetic
- 2.1.2 Fused-multiply add
- 2.1.3 Flags affecting floating-point accuracy
- 2.2 Debugging
- 2.2.1 Printing from device code
- 2.2.2 Debugging with cuda-gdb
- 2.2.2.1 System requirements
- 2.2.2.2 Compilation
- 2.2.2.3 Setting breakpoints
- 2.2.2.4 Focus - software and hardware coordinates
- 2.2.2.5 CUDA activity status
- 2.2.2.6 Single-stepping in device code
- 2.2.2.7 Examining program state
- 2.2.3 compute-sanitizer
- 3 Performance measurement and metrics
- 3.1 Measuring execution time
- 3.1.1 Host-device synchronization and CPU timers
- 3.1.2 Timing via CUDA events
- 3.1.3 Nsight Systems command-line interface nsys
- 3.1.3.1 Nsight Systems graphical user interface nsys-ui
- 3.1.4 Customizing profiling with nvtx
- 3.1.4.1 Basic NVTX tooling interfaces
- 3.1.4.2 Advanced NVTX tooling interfaces
- 3.1.4.3 Automated NVTX instrumentation.
- 3.2 Instruction, bandwidth, and latency bound kernels
- 3.3 Memory bandwidth
- 3.3.1 Theoretical peak bandwidth
- 3.3.2 Effective bandwidth
- 3.3.3 Actual data throughput vs. effective bandwidth
- 4 Synchronization
- 4.1 Synchronization of kernel execution and data transfers
- 4.1.1 Pageable versus pinned host memory
- 4.1.2 Streams
- 4.1.2.1 Creating streams
- 4.1.3 Asynchronous transfers via cudaMemcpyAsync()
- 4.1.4 Synchronization barriers
- 4.1.4.1 cudaDeviceSynchronize()
- 4.1.4.2 cudaStreamSynchronize()
- 4.1.4.3 cudaEventSynchronize()
- 4.1.4.4 Querying streams and events
- 4.1.5 Advanced stream topics
- 4.1.5.1 The default stream
- 4.1.5.2 Non-blocking streams
- 4.1.5.3 Stream priorities
- 4.2 Synchronization of kernel threads on the device
- 4.2.1 Shared memory
- 4.2.2 Synchronizing threads within a block
- 4.2.3 Warps, warp synchronization, and warp-level primitives
- 4.2.3.1 SHFL functions
- 4.2.4 Atomics
- 4.2.5 Memory fences
- 4.2.6 Cooperative groups
- 4.2.6.1 Grid synchronization
- 4.2.6.2 Thread block clusters
- 5 Optimization
- 5.1 Transfers between host and device
- 5.1.1 Pinned memory
- 5.2 Device memory
- 5.2.1 ECC (Error Correcting Code)
- 5.2.2 Global memory
- 5.2.2.1 Declaring global array arguments in kernels
- 5.2.2.2 Coalesced global memory access
- 5.2.3 Local memory
- 5.2.4 Constant memory
- 5.2.5 L1 and L2 caches
- 5.2.6 Shared memory
- 5.2.6.1 Configuring shared memory
- 5.2.6.2 Global memory coalescing through shared memory
- 5.2.6.3 Shared memory bank conflicts
- 5.2.7 Registers
- 5.2.7.1 Exchanging register data between threads in a warp
- 5.3 Execution configuration
- 5.3.1 Thread-level parallelism
- 5.3.2 Instruction-level parallelism
- 5.3.2.1 Asynchronous data transfers between global and shared memory
- 5.3.2.2 Instruction-level parallelism in CUF kernels.
- 5.4 Instruction optimization
- 5.4.1 Device intrinsics
- 5.4.1.1 Directed rounding
- 5.4.1.2 C intrinsics
- 5.4.1.3 Fast math intrinsics
- 5.4.1.4 Compiler options
- 5.4.2 Divergent warps
- 6 Porting tips and techniques
- 6.1 CUF kernels
- 6.2 Conditional inclusion of code
- 6.3 Renaming variables
- 6.3.1 Renaming via use statements
- 6.3.2 Renaming via the associate construct
- 6.4 Minimizing memory footprint for work arrays
- 6.5 Array compaction
- 7 Interfacing with CUDA C code and CUDA libraries
- 7.1 Calling user-written CUDA C code
- 7.1.1 The ignore"80"137tkr directive
- 7.2 cuBLAS
- 7.2.1 Legacy cuBLAS API
- 7.2.2 New cuBLAS API
- 7.2.3 Batched cuBLAS routines
- 7.2.4 GEMM with tensor cores
- 7.3 cuSPARSE
- 7.4 cuSOLVER
- 7.5 cuTENSOR
- 7.5.1 Low-level cuTENSOR interfaces
- 7.6 Thrust
- 8 Multi-GPU programming
- 8.1 CUDA multi-GPU features
- 8.1.1 Peer-to-peer communication
- 8.1.1.1 Requirements for peer-to-peer communication
- 8.1.2 Peer-to-peer direct transfers
- 8.1.3 Peer-to-peer transpose
- 8.2 Multi-GPU programming with MPI
- 8.2.1 Assigning devices to MPI ranks
- 8.2.2 MPI transpose
- 8.2.3 GPU-aware MPI transpose
- 2 Case studies
- 9 Monte Carlo method
- 9.1 CURAND
- 9.2 Computing π with CUF kernels
- 9.2.1 IEEE-754 precision
- 9.3 Computing π with reduction kernels
- 9.3.1 Reductions with SHFL instructions
- 9.3.2 Reductions with atomic locks
- 9.3.3 Reductions using the grid"80"137group cooperative group
- 9.4 Accuracy of summation
- 9.5 Option pricing
- 10 Finite difference method
- 10.1 Nine-point 1D finite difference stencil
- 10.1.1 Data reuse and shared memory
- 10.1.2 The x-derivative kernel
- 10.1.2.1 Performance of the x-derivative kernel
- 10.1.3 Derivatives in y and z
- 10.1.4 Nonuniform grids
- 10.2 2D Laplace equation.
- 11 Applications of the fast Fourier transform
- 11.1 CUFFT
- 11.2 Spectral derivatives
- 11.3 Convolution
- 11.4 Poisson solver
- 11.4.1 Vortex dynamics
- 12 Ray tracing
- 12.1 Generating an image file
- PPM file format
- Implementation
- 12.2 Vectors in CUDA Fortran
- Implementation of the RGB module
- 12.3 Rays, a simple camera, and background
- First ray-tracing code
- 12.4 Adding a sphere
- Theory
- Implementation
- 12.5 Surface normals and multiple objects
- Surface normals
- Multiple objects
- 12.6 Antialiasing
- Implementation
- CUDA implementation
- 12.7 Material types
- 12.7.1 Diffuse materials
- Implementation
- Random number generation
- Modifications to the color() routine
- 12.7.2 Metal
- Theory
- Implementation
- Material derived types
- Initialization of spheres and material types
- The scatter() function and Cray pointers
- 12.7.3 Dielectrics
- Theory
- Implementation
- 12.8 Positionable camera
- 12.9 Defocus blur
- Implementation
- 12.10 Where next?
- 12.11 Triangles
- Theory
- Implementation
- 12.12 Lights
- Implementation
- 12.13 Textures
- Theory
- Implementation
- 3 Appendices
- A System and environment management
- A.1 Environment variables
- A.1.1 General
- A.1.2 Just-in-time compilation
- A.2 nvidia-smi - System Management Interface
- A.2.1 Enabling and disabling ECC
- A.2.2 Compute mode
- A.2.3 Persistence mode
- A.2.4 Topology
- References
- Index
- Back Cover.