CUDA Fortran for Scientists and Engineers Best Practices for Efficient CUDA Fortran Programming

CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presu...

Descripción completa

Detalles Bibliográficos
Otros Autores: Ruetsch, Gregory, author (author), Fatica, Massimiliano, author
Formato: Libro electrónico
Idioma:Inglés
Publicado: Cambridge, MA : Morgan Kaufmann [2024]
Edición:Second edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009840465706719
Tabla de Contenidos:
  • Front Cover
  • CUDA Fortran for Scientists and Engineers
  • Copyright
  • Contents
  • Preface to the Second Edition
  • Preface to the First Edition
  • Acknowledgments
  • 1 CUDA Fortran programming
  • 1 Introduction
  • 1.1 A brief history of GPU computing
  • 1.2 Parallel computation
  • 1.3 Basic concepts
  • 1.3.1 A first CUDA Fortran program
  • 1.3.1.1 CUDA Fortran compilation
  • 1.3.2 Extending to larger arrays
  • 1.3.3 Multidimensional arrays
  • 1.3.4 Interfaces for device code
  • 1.3.5 Managed data
  • 1.3.6 Kernel loop directives and CUF kernels
  • 1.4 Determining CUDA hardware features and limits
  • 1.4.1 Choosing a device to run on
  • 1.4.2 Floating point precision
  • 1.4.2.1 Accommodating variable precision
  • 1.5 Error handling
  • 1.6 Compiling CUDA Fortran code
  • 1.7 CUDA Driver, Toolkit, and compatibility
  • 2 Correctness, accuracy, and debugging
  • 2.1 Assessing correctness of results
  • 2.1.1 Non-associativity of floating point arithmetic
  • 2.1.2 Fused-multiply add
  • 2.1.3 Flags affecting floating-point accuracy
  • 2.2 Debugging
  • 2.2.1 Printing from device code
  • 2.2.2 Debugging with cuda-gdb
  • 2.2.2.1 System requirements
  • 2.2.2.2 Compilation
  • 2.2.2.3 Setting breakpoints
  • 2.2.2.4 Focus - software and hardware coordinates
  • 2.2.2.5 CUDA activity status
  • 2.2.2.6 Single-stepping in device code
  • 2.2.2.7 Examining program state
  • 2.2.3 compute-sanitizer
  • 3 Performance measurement and metrics
  • 3.1 Measuring execution time
  • 3.1.1 Host-device synchronization and CPU timers
  • 3.1.2 Timing via CUDA events
  • 3.1.3 Nsight Systems command-line interface nsys
  • 3.1.3.1 Nsight Systems graphical user interface nsys-ui
  • 3.1.4 Customizing profiling with nvtx
  • 3.1.4.1 Basic NVTX tooling interfaces
  • 3.1.4.2 Advanced NVTX tooling interfaces
  • 3.1.4.3 Automated NVTX instrumentation.
  • 3.2 Instruction, bandwidth, and latency bound kernels
  • 3.3 Memory bandwidth
  • 3.3.1 Theoretical peak bandwidth
  • 3.3.2 Effective bandwidth
  • 3.3.3 Actual data throughput vs. effective bandwidth
  • 4 Synchronization
  • 4.1 Synchronization of kernel execution and data transfers
  • 4.1.1 Pageable versus pinned host memory
  • 4.1.2 Streams
  • 4.1.2.1 Creating streams
  • 4.1.3 Asynchronous transfers via cudaMemcpyAsync()
  • 4.1.4 Synchronization barriers
  • 4.1.4.1 cudaDeviceSynchronize()
  • 4.1.4.2 cudaStreamSynchronize()
  • 4.1.4.3 cudaEventSynchronize()
  • 4.1.4.4 Querying streams and events
  • 4.1.5 Advanced stream topics
  • 4.1.5.1 The default stream
  • 4.1.5.2 Non-blocking streams
  • 4.1.5.3 Stream priorities
  • 4.2 Synchronization of kernel threads on the device
  • 4.2.1 Shared memory
  • 4.2.2 Synchronizing threads within a block
  • 4.2.3 Warps, warp synchronization, and warp-level primitives
  • 4.2.3.1 SHFL functions
  • 4.2.4 Atomics
  • 4.2.5 Memory fences
  • 4.2.6 Cooperative groups
  • 4.2.6.1 Grid synchronization
  • 4.2.6.2 Thread block clusters
  • 5 Optimization
  • 5.1 Transfers between host and device
  • 5.1.1 Pinned memory
  • 5.2 Device memory
  • 5.2.1 ECC (Error Correcting Code)
  • 5.2.2 Global memory
  • 5.2.2.1 Declaring global array arguments in kernels
  • 5.2.2.2 Coalesced global memory access
  • 5.2.3 Local memory
  • 5.2.4 Constant memory
  • 5.2.5 L1 and L2 caches
  • 5.2.6 Shared memory
  • 5.2.6.1 Configuring shared memory
  • 5.2.6.2 Global memory coalescing through shared memory
  • 5.2.6.3 Shared memory bank conflicts
  • 5.2.7 Registers
  • 5.2.7.1 Exchanging register data between threads in a warp
  • 5.3 Execution configuration
  • 5.3.1 Thread-level parallelism
  • 5.3.2 Instruction-level parallelism
  • 5.3.2.1 Asynchronous data transfers between global and shared memory
  • 5.3.2.2 Instruction-level parallelism in CUF kernels.
  • 5.4 Instruction optimization
  • 5.4.1 Device intrinsics
  • 5.4.1.1 Directed rounding
  • 5.4.1.2 C intrinsics
  • 5.4.1.3 Fast math intrinsics
  • 5.4.1.4 Compiler options
  • 5.4.2 Divergent warps
  • 6 Porting tips and techniques
  • 6.1 CUF kernels
  • 6.2 Conditional inclusion of code
  • 6.3 Renaming variables
  • 6.3.1 Renaming via use statements
  • 6.3.2 Renaming via the associate construct
  • 6.4 Minimizing memory footprint for work arrays
  • 6.5 Array compaction
  • 7 Interfacing with CUDA C code and CUDA libraries
  • 7.1 Calling user-written CUDA C code
  • 7.1.1 The ignore"80"137tkr directive
  • 7.2 cuBLAS
  • 7.2.1 Legacy cuBLAS API
  • 7.2.2 New cuBLAS API
  • 7.2.3 Batched cuBLAS routines
  • 7.2.4 GEMM with tensor cores
  • 7.3 cuSPARSE
  • 7.4 cuSOLVER
  • 7.5 cuTENSOR
  • 7.5.1 Low-level cuTENSOR interfaces
  • 7.6 Thrust
  • 8 Multi-GPU programming
  • 8.1 CUDA multi-GPU features
  • 8.1.1 Peer-to-peer communication
  • 8.1.1.1 Requirements for peer-to-peer communication
  • 8.1.2 Peer-to-peer direct transfers
  • 8.1.3 Peer-to-peer transpose
  • 8.2 Multi-GPU programming with MPI
  • 8.2.1 Assigning devices to MPI ranks
  • 8.2.2 MPI transpose
  • 8.2.3 GPU-aware MPI transpose
  • 2 Case studies
  • 9 Monte Carlo method
  • 9.1 CURAND
  • 9.2 Computing π with CUF kernels
  • 9.2.1 IEEE-754 precision
  • 9.3 Computing π with reduction kernels
  • 9.3.1 Reductions with SHFL instructions
  • 9.3.2 Reductions with atomic locks
  • 9.3.3 Reductions using the grid"80"137group cooperative group
  • 9.4 Accuracy of summation
  • 9.5 Option pricing
  • 10 Finite difference method
  • 10.1 Nine-point 1D finite difference stencil
  • 10.1.1 Data reuse and shared memory
  • 10.1.2 The x-derivative kernel
  • 10.1.2.1 Performance of the x-derivative kernel
  • 10.1.3 Derivatives in y and z
  • 10.1.4 Nonuniform grids
  • 10.2 2D Laplace equation.
  • 11 Applications of the fast Fourier transform
  • 11.1 CUFFT
  • 11.2 Spectral derivatives
  • 11.3 Convolution
  • 11.4 Poisson solver
  • 11.4.1 Vortex dynamics
  • 12 Ray tracing
  • 12.1 Generating an image file
  • PPM file format
  • Implementation
  • 12.2 Vectors in CUDA Fortran
  • Implementation of the RGB module
  • 12.3 Rays, a simple camera, and background
  • First ray-tracing code
  • 12.4 Adding a sphere
  • Theory
  • Implementation
  • 12.5 Surface normals and multiple objects
  • Surface normals
  • Multiple objects
  • 12.6 Antialiasing
  • Implementation
  • CUDA implementation
  • 12.7 Material types
  • 12.7.1 Diffuse materials
  • Implementation
  • Random number generation
  • Modifications to the color() routine
  • 12.7.2 Metal
  • Theory
  • Implementation
  • Material derived types
  • Initialization of spheres and material types
  • The scatter() function and Cray pointers
  • 12.7.3 Dielectrics
  • Theory
  • Implementation
  • 12.8 Positionable camera
  • 12.9 Defocus blur
  • Implementation
  • 12.10 Where next?
  • 12.11 Triangles
  • Theory
  • Implementation
  • 12.12 Lights
  • Implementation
  • 12.13 Textures
  • Theory
  • Implementation
  • 3 Appendices
  • A System and environment management
  • A.1 Environment variables
  • A.1.1 General
  • A.1.2 Just-in-time compilation
  • A.2 nvidia-smi - System Management Interface
  • A.2.1 Enabling and disabling ECC
  • A.2.2 Compute mode
  • A.2.3 Persistence mode
  • A.2.4 Topology
  • References
  • Index
  • Back Cover.