Distributed machine learning with Python accelerating model training and serving with distributed systems

Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter...

Descripción completa

Detalles Bibliográficos
Otros Autores: Wang, Guanhua, author (author)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham ; Mumbai : Packt Publishing 2022.
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009660437406719
Tabla de Contenidos:
  • Intro
  • Title page
  • Copyright and Credits
  • Dedication
  • Contributors
  • Table of Contents
  • Preface
  • Section 1 - Data Parallelism
  • Chapter 1: Splitting Input Data
  • Single-node training is too slow
  • The mismatch between data loading bandwidth and model training bandwidth
  • Single-node training time on popular datasets
  • Accelerating the training process with data parallelism
  • Data parallelism - the high-level bits
  • Stochastic gradient descent
  • Model synchronization
  • Hyperparameter tuning
  • Global batch size
  • Learning rate adjustment
  • Model synchronization schemes
  • Summary
  • Chapter 2: Parameter Server and All-Reduce
  • Technical requirements
  • Parameter server architecture
  • Communication bottleneck in the parameter server architecture
  • Sharding the model among parameter servers
  • Implementing the parameter server
  • Defining model layers
  • Defining the parameter server
  • Defining the worker
  • Passing data between the parameter server and worker
  • Issues with the parameter server
  • The parameter server architecture introduces a high coding complexity for practitioners
  • All-Reduce architecture
  • Reduce
  • All-Reduce
  • Ring All-Reduce
  • Collective communication
  • Broadcast
  • Gather
  • All-Gather
  • Summary
  • Chapter 3: Building a Data Parallel Training and Serving Pipeline
  • Technical requirements
  • The data parallel training pipeline in a nutshell
  • Input pre-processing
  • Input data partition
  • Data loading
  • Training
  • Model synchronization
  • Model update
  • Single-machine multi-GPUs and multi-machine multi-GPUs
  • Single-machine multi-GPU
  • Multi-machine multi-GPU
  • Checkpointing and fault tolerance
  • Model checkpointing
  • Load model checkpoints
  • Model evaluation and hyperparameter tuning
  • Model serving in data parallelism
  • Summary
  • Chapter 4: Bottlenecks and Solutions.
  • Communication bottlenecks in data parallel training
  • Analyzing the communication workloads
  • Parameter server architecture
  • The All-Reduce architecture
  • The inefficiency of state-of-the-art communication schemes
  • Leveraging idle links and host resources
  • Tree All-Reduce
  • Hybrid data transfer over PCIe and NVLink
  • On-device memory bottlenecks
  • Recomputation and quantization
  • Recomputation
  • Quantization
  • Summary
  • Section 2 - Model Parallelism
  • Chapter 5: Splitting the Model
  • Technical requirements
  • Single-node training error - out of memory
  • Fine-tuning BERT on a single GPU
  • Trying to pack a giant model inside one state-of-the-art GPU
  • ELMo, BERT, and GPT
  • Basic concepts
  • RNN
  • ELMo
  • BERT
  • GPT
  • Pre-training and fine-tuning
  • State-of-the-art hardware
  • P100, V100, and DGX-1
  • NVLink
  • A100 and DGX-2
  • NVSwitch
  • Summary
  • Chapter 6: Pipeline Input and Layer Split
  • Vanilla model parallelism is inefficient
  • Forward propagation
  • Backward propagation
  • GPU idle time between forward and backward propagation
  • Pipeline input
  • Pros and cons of pipeline parallelism
  • Advantages of pipeline parallelism
  • Disadvantages of pipeline parallelism
  • Layer split
  • Notes on intra-layer model parallelism
  • Summary
  • Chapter 7: Implementing Model Parallel Training and Serving Workflows
  • Technical requirements
  • Wrapping up the whole model parallelism pipeline
  • A model parallel training overview
  • Implementing a model parallel training pipeline
  • Specifying communication protocol among GPUs
  • Model parallel serving
  • Fine-tuning transformers
  • Hyperparameter tuning in model parallelism
  • Balancing the workload among GPUs
  • Enabling/disabling pipeline parallelism
  • NLP model serving
  • Summary
  • Chapter 8: Achieving Higher Throughput and Lower Latency
  • Technical requirements.
  • Freezing layers
  • Freezing layers during forward propagation
  • Reducing computation cost during forward propagation
  • Freezing layers during backward propagation
  • Exploring memory and storage resources
  • Understanding model decomposition and distillation
  • Model decomposition
  • Model distillation
  • Reducing bits in hardware
  • Summary
  • Section 3 - Advanced Parallelism Paradigms
  • Chapter 9: Hybrid of Data and Model Parallelism
  • Technical requirements
  • Case study of Megatron-LM
  • Layer split for model parallelism
  • Row-wise trial-and-error approach
  • Column-wise trial-and-error approach
  • Cross-machine for data parallelism
  • Implementation of Megatron-LM
  • Case study of Mesh-TensorFlow
  • Implementation of Mesh-TensorFlow
  • Pros and cons of Megatron-LM and Mesh-TensorFlow
  • Summary
  • Chapter 10: Federated Learning and Edge Devices
  • Technical requirements
  • Sharing knowledge without sharing data
  • Recapping the traditional data parallel model training paradigm
  • No input sharing among workers
  • Communicating gradients for collaborative learning
  • Case study: TensorFlow Federated
  • Running edge devices with TinyML
  • Case study: TensorFlow Lite
  • Summary
  • Chapter 11: Elastic Model Training and Serving
  • Technical requirements
  • Introducing adaptive model training
  • Traditional data parallel training
  • Adaptive model training in data parallelism
  • Adaptive model training (AllReduce-based)
  • Adaptive model training (parameter server-based)
  • Traditional model-parallel model training paradigm
  • Adaptive model training in model parallelism
  • Implementing adaptive model training in the cloud
  • Elasticity in model inference
  • Serverless
  • Summary
  • Chapter 12: Advanced Techniques for Further Speed-Ups
  • Technical requirements
  • Debugging and performance analytics.
  • General concepts in the profiling results
  • Communication results analysis
  • Computation results analysis
  • Job migration and multiplexing
  • Job migration
  • Job multiplexing
  • Model training in a heterogeneous environment
  • Summary
  • Index
  • About Packt
  • Other Books You May Enjoy.