Advances in GPU research and practice

Advances in GPU Research and Practice focuses on research and practices in GPU based systems. The topics treated cover a range of issues, ranging from hardware and architectural issues, to high level issues, such as application systems, parallel programming, middleware, and power and energy issues....

Descripción completa

Detalles Bibliográficos
Otros Autores: Sarbazi-Azad, Hamid, author (author), Sarbazi-Azad, Hamid, editor (editor)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Amsterdam : Elsevier [2017]
Edición:First edition
Colección:Emerging trends in computer science & applied computing.
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630332406719
Tabla de Contenidos:
  • Front Cover
  • Advances in GPU Research and Practice
  • Copyright
  • Dedication
  • Contents
  • List of Contributors
  • Preface
  • Acknowledgments
  • Part 1: Programming and tools
  • Chapter 1: Formal analysis techniques for reliable GPU programming: current solutions and call to action
  • 1 GPUs in Support of Parallel Computing
  • Bugs in parallel and GPU code
  • 2 A quick introduction to GPUs
  • Organization of threads
  • Memory spaces
  • Barrier synchronization
  • Warps and lock-step execution
  • Dot product example
  • 3 Correctness issues in GPU programming
  • Data races
  • Lack of forward progress guarantees
  • Floating-point accuracy
  • 4 The need for effective tools
  • 4.1 A Taxonomy of Current Tools
  • 4.2 Canonical Schedules and the Two-Thread Reduction
  • Race freedom implies determinism
  • Detecting races: ``all for one and one for all''
  • Restricting to a canonical schedule
  • Reduction to a pair of threads
  • 4.3 Symbolic Bug-Finding Case Study: GKLEE
  • 4.4 Verification Case Study: GPUVerify
  • 5 Call to Action
  • GPUs will become more pervasive
  • Current tools show promise
  • Solving basic correctness issues
  • Equivalence checking
  • Clarity from vendors and standards bodies
  • User validation of tools
  • Acknowledgments
  • References
  • Chapter 2: SnuCL: A unified OpenCL framework for heterogeneous clusters
  • 1 Introduction
  • 2 OpenCL
  • 2.1 Platform Model
  • 2.2 Execution Model
  • 2.3 Memory Model
  • 2.4 Synchronization
  • 2.5 Memory Consistency
  • 2.6 OpenCL ICD
  • 3 Overview of SnuCL framework
  • 3.1 Limitations of OpenCL
  • 3.2 SnuCL CPU
  • 3.3 SnuCL Single
  • 3.4 SnuCL Cluster
  • 3.4.1 Processing synchronization commands
  • 4 Memory management in SnuCL Cluster
  • 4.1 Space Allocation to Memory Objects
  • 4.2 Minimizing Copying Overhead
  • 4.3 Processing Memory Commands
  • 4.4 Consistency Management.
  • 4.5 Detecting Memory Objects Written by a Kernel
  • 5 SnuCL extensions to OpenCL
  • 6 Performance evaluation
  • 6.1 Evaluation Methodology
  • 6.2 Performance
  • 6.2.1 Scalability on the medium-scale GPU cluster
  • 6.2.2 Scalability on the large-scale CPU cluster
  • 7 Conclusions
  • Acknowledgments
  • References
  • Chapter 3: Thread communication and synchronization on massively parallel GPUs
  • 1 Introduction
  • 2 Coarse-Grained Communication and Synchronization
  • 2.1 Global Barrier at the Kernel Level
  • 2.2 Local Barrier at the Work-Group Level
  • 2.3 Implicit Barrier at the Wavefront Level
  • 3 Built-In Atomic Functions on Regular Variables
  • 4 Fine-Grained Communication and Synchronization
  • 4.1 Memory Consistency Model
  • 4.1.1 Sequential consistency
  • 4.1.2 Relaxed consistency
  • 4.2 The OpenCL 2.0 Memory Model
  • 4.2.1 Relationships between two memory operations
  • 4.2.2 Special atomic operations and stand-alone memory fence
  • 4.2.3 Release and acquire semantics
  • 4.2.4 Memory order parameters
  • 4.2.5 Memory scope parameters
  • 5 Conclusion and Future Research Direction
  • References
  • Chapter 4: Software-level task scheduling on GPUs
  • 1 Introduction, Problem Statement, and Context
  • 2 Nondeterministic behaviors caused by the hardware
  • 2.1 P1: Greedy
  • 2.2 P2: Not Round-Robin
  • 2.3 P3: Nondeterministic Across Runs
  • 2.4 P4: Oblivious to Nonuniform Data Sharing
  • 2.5 P5: Serializing Multikernel Co-Runs
  • 3 SM-centric transformation
  • 3.1 Core Ideas
  • 3.1.1 SM-centric task selection
  • Correctness issues
  • 3.1.2 Filling-retreating scheme
  • 3.2 Implementation
  • 3.2.1 Details
  • 4 Scheduling-enabled optimizations
  • 4.1 Parallelism Control
  • 4.2 Affinity-Based Scheduling
  • 4.2.1 Evaluation
  • 4.3 SM Partitioning for Multi-Kernel Co-Runs
  • 4.3.1 Evaluation
  • 5 Other scheduling work on GPUs
  • 5.1 Software Approaches.
  • 5.2 Hardware Approaches
  • 6 Conclusion and future work
  • Acknowledgments
  • References
  • Chapter 5: Data placement on GPUs
  • 1 Introduction
  • 2 Overview
  • 3 Memory specification through MSL
  • 4 Compiler support
  • 4.1 Deriving Data Access Patterns: A Hybrid Approach
  • 4.1.1 Reuse distance model
  • 4.1.2 Staging code to be agnostic to placement
  • 5 Runtime support
  • 5.1 Lightweight Performance Model
  • 5.2 Finding the Best Placements
  • 5.2.1 Dealing with phases
  • 6 Results
  • 6.1 Results With Irregular Benchmarks
  • 6.2 Results With Regular Benchmarks
  • 7 Related Work
  • 8 Summary
  • References
  • Part 2: Algorithms and applications
  • Chapter 6: Biological sequence analysis on GPU
  • 1 Introduction
  • 2 Pairwise Sequence Comparison and Sequence-Profile Comparison
  • 2.1 Pairwise Sequence Comparison
  • 2.1.1 Smith-Waterman algorithm
  • Phase 1: Create the similarity matrix
  • Phase 2: Obtain the best local alignment
  • Smith-Waterman variations
  • 2.1.2 Basic Local Alignment Tool
  • Phase 1: Seeding
  • Phase 2: Extension
  • Phase 3: Evaluation
  • BLAST
  • 2.2 Sequence-Profile Comparison
  • 2.2.1 Hidden Markov models
  • 2.2.2 The Viterbi algorithm
  • 2.2.3 The MSV algorithm
  • 3 Design Aspects of GPU Solutions for Biological Sequence Analysis
  • 3.1 Task-Parallelism vs Data-Parallelism
  • 3.2 Sorting Sequence Database to Achieve Load Balance
  • 3.3 Use of GPU Memory Hierarchy
  • 3.4 GPU Solution Used as a Filter
  • 4 GPU Solutions for Pairwise Sequence Comparison
  • 4.1 GPU Solutions Using Exact Algorithms
  • 4.1.1 Manavski and Valle [27]
  • 4.1.2 Ligowski and Rudnicki [38]
  • 4.1.3 Blazwicz et al. [29]
  • 4.1.4 Li et al. [52]
  • 4.1.5 Ino et al. [28,30]
  • 4.1.6 Liu et al. [45-47]
  • 4.1.7 Sandes and de Melo [39-41] and Sandes et al. [42]
  • 4.1.8 Comparative overview
  • 4.2 GPU Solutions Using BLAST
  • 4.2.1 Vouzis and Sahinidis [31].
  • 4.2.2 Liu et al. [48]
  • 4.2.3 Zhang et al. [43]
  • 4.2.4 Comparative overview
  • 5 GPU Solutions for Sequence-Profile Comparison
  • 5.1 GPU Solutions Using the Viterbi Algorithm
  • 5.1.1 Horn et al. [32]
  • 5.1.2 Du et al. [44]
  • 5.1.3 Walters et al. [33]
  • 5.1.4 Yao et al. [34]
  • 5.1.5 Ganesan et al. [49]
  • 5.1.6 Ferraz and Moreano [36]
  • 5.2 GPU Solutions Using the MSV Algorithm
  • 5.2.1 Li et al. [35]
  • 5.2.2 Cheng and Butler [37]
  • 5.2.3 Araújo Neto and Moreano [50]
  • 5.3 Comparative Overview
  • 6 Conclusion and Perspectives
  • References
  • Chapter 7: Graph algorithms on GPUs
  • 1 Graph representation for GPUs
  • 1.1 Adjacency Matrices
  • 1.2 Adjacency Lists
  • 1.3 Edge Lists
  • 2 Graph traversal algorithms: the breadth first search (BFS)
  • 2.1 The Frontier-Based Parallel Implementation of BFS
  • 2.2 BFS-4K
  • 3 The single-source shortest path (SSSP) problem
  • 3.1 The SSSP Implementations for GPUs
  • 3.2 H-BF: An Efficient Implementation of the Bellman-Ford Algorithm
  • 4 The APSP problem
  • 4.1 The APSP Implementations for GPUs
  • 5 Load Balancing and Memory Accesses: Issuesand Management Techniques
  • 5.1 Static Mapping Techniques
  • 5.1.1 Work-items to threads
  • 5.1.2 Virtual warps
  • 5.2 Semidynamic Mapping Techniques
  • 5.2.1 Dynamic virtual warps + dynamic parallelism
  • 5.2.2 CTA + warp + scan
  • 5.3 Dynamic Mapping Techniques
  • 5.3.1 Direct search
  • 5.3.2 Local warp search
  • 5.3.3 Block search
  • 5.3.4 Two-phase search
  • 5.4 The Multiphase Search Technique
  • 5.5 Coalesced Expansion
  • 5.6 Iterated Searches
  • References
  • Chapter 8: GPU alignment of two and three sequences
  • 1 Introduction
  • 1.1 Pairwise alignment
  • 1.2 Alignment of Three Sequences
  • 2 GPU architecture
  • 3 Pairwise alignment
  • 3.1 Smith-Waterman Algorithm
  • 3.2 Computing the Score of the Best Local Alignment.
  • 3.3 Computing the Best Local Alignment
  • 3.3.1 StripedAlignment
  • 3.3.2 ChunkedAlignment1
  • 3.3.3 ChunkedAlignment2
  • 3.3.4 Memory requirements
  • 3.4 Experimental Results
  • StripedScore
  • StripedAlignment
  • ChunkedAlignment1
  • 4 Alignment of three sequences
  • 4.1 Three-Sequence Alignment Algorithm
  • 4.2 Computing the Score of the Best Alignment
  • 4.2.1 Layered algorithm
  • GPU computational strategy
  • Analysis
  • 4.2.2 Sloped algorithm
  • GPU computational strategy
  • Analysis
  • 4.3 Computing the Best Alignment
  • 4.3.1 Layered-BT1
  • 4.3.2 Layered-BT2
  • 4.3.3 Layered-BT3
  • 4.4 Experimental Results
  • 4.4.1 Computing the score of the best alignment
  • 4.4.2 Computing the alignment
  • 5 Conclusion
  • References
  • Chapter 9: Augmented block cimmino distributed algorithm for solving tridiagonal systems on GPU
  • 1 Introduction
  • 2 ABCD Solver for tridiagonal systems
  • 3 GPU implementation and optimization
  • 3.1 QR Method and Givens Rotation
  • 3.2 Sparse Storage Format
  • 3.3 Coalesced Memory Access
  • 3.4 Boundary Padding
  • 3.4.1 Padding of the augmented matrix
  • 3.4.2 Padding for Givens rotation
  • 4 Performance evaluation
  • 4.1 Comparison With CPU Implementation
  • 4.2 Speedup by Memory Coalescing
  • 4.3 Boundary Padding
  • 5 Conclusion and future work
  • References
  • Chapter 10: GPU computing applied to linear and mixed-integer programming
  • 1 Introduction
  • 2 Operations Research in Practice
  • 3 Exact Optimization Algorithms
  • 3.1 The Simplex Method
  • 3.2 Dynamic Programming
  • 3.2.1 Knapsack problems
  • 3.2.2 Multiple-choice knapsack problem
  • 3.3 Branch-and-Bound
  • 3.3.1 Knapsack problem
  • 3.3.2 Flow-shop scheduling problem
  • 3.3.3 Traveling salesman problem
  • 4 Metaheuristics
  • 4.1 Genetic Algorithms
  • 4.1.1 The traveling salesman problem
  • 4.1.2 Scheduling problems
  • 4.1.3 Knapsack problems.
  • 4.2 Ant Colony Optimization.