Distributed machine learning with Python accelerating model training and serving with distributed systems
Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter...
Otros Autores: | |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham ; Mumbai :
Packt Publishing
2022.
|
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009660437406719 |
Tabla de Contenidos:
- Intro
- Title page
- Copyright and Credits
- Dedication
- Contributors
- Table of Contents
- Preface
- Section 1 - Data Parallelism
- Chapter 1: Splitting Input Data
- Single-node training is too slow
- The mismatch between data loading bandwidth and model training bandwidth
- Single-node training time on popular datasets
- Accelerating the training process with data parallelism
- Data parallelism - the high-level bits
- Stochastic gradient descent
- Model synchronization
- Hyperparameter tuning
- Global batch size
- Learning rate adjustment
- Model synchronization schemes
- Summary
- Chapter 2: Parameter Server and All-Reduce
- Technical requirements
- Parameter server architecture
- Communication bottleneck in the parameter server architecture
- Sharding the model among parameter servers
- Implementing the parameter server
- Defining model layers
- Defining the parameter server
- Defining the worker
- Passing data between the parameter server and worker
- Issues with the parameter server
- The parameter server architecture introduces a high coding complexity for practitioners
- All-Reduce architecture
- Reduce
- All-Reduce
- Ring All-Reduce
- Collective communication
- Broadcast
- Gather
- All-Gather
- Summary
- Chapter 3: Building a Data Parallel Training and Serving Pipeline
- Technical requirements
- The data parallel training pipeline in a nutshell
- Input pre-processing
- Input data partition
- Data loading
- Training
- Model synchronization
- Model update
- Single-machine multi-GPUs and multi-machine multi-GPUs
- Single-machine multi-GPU
- Multi-machine multi-GPU
- Checkpointing and fault tolerance
- Model checkpointing
- Load model checkpoints
- Model evaluation and hyperparameter tuning
- Model serving in data parallelism
- Summary
- Chapter 4: Bottlenecks and Solutions.
- Communication bottlenecks in data parallel training
- Analyzing the communication workloads
- Parameter server architecture
- The All-Reduce architecture
- The inefficiency of state-of-the-art communication schemes
- Leveraging idle links and host resources
- Tree All-Reduce
- Hybrid data transfer over PCIe and NVLink
- On-device memory bottlenecks
- Recomputation and quantization
- Recomputation
- Quantization
- Summary
- Section 2 - Model Parallelism
- Chapter 5: Splitting the Model
- Technical requirements
- Single-node training error - out of memory
- Fine-tuning BERT on a single GPU
- Trying to pack a giant model inside one state-of-the-art GPU
- ELMo, BERT, and GPT
- Basic concepts
- RNN
- ELMo
- BERT
- GPT
- Pre-training and fine-tuning
- State-of-the-art hardware
- P100, V100, and DGX-1
- NVLink
- A100 and DGX-2
- NVSwitch
- Summary
- Chapter 6: Pipeline Input and Layer Split
- Vanilla model parallelism is inefficient
- Forward propagation
- Backward propagation
- GPU idle time between forward and backward propagation
- Pipeline input
- Pros and cons of pipeline parallelism
- Advantages of pipeline parallelism
- Disadvantages of pipeline parallelism
- Layer split
- Notes on intra-layer model parallelism
- Summary
- Chapter 7: Implementing Model Parallel Training and Serving Workflows
- Technical requirements
- Wrapping up the whole model parallelism pipeline
- A model parallel training overview
- Implementing a model parallel training pipeline
- Specifying communication protocol among GPUs
- Model parallel serving
- Fine-tuning transformers
- Hyperparameter tuning in model parallelism
- Balancing the workload among GPUs
- Enabling/disabling pipeline parallelism
- NLP model serving
- Summary
- Chapter 8: Achieving Higher Throughput and Lower Latency
- Technical requirements.
- Freezing layers
- Freezing layers during forward propagation
- Reducing computation cost during forward propagation
- Freezing layers during backward propagation
- Exploring memory and storage resources
- Understanding model decomposition and distillation
- Model decomposition
- Model distillation
- Reducing bits in hardware
- Summary
- Section 3 - Advanced Parallelism Paradigms
- Chapter 9: Hybrid of Data and Model Parallelism
- Technical requirements
- Case study of Megatron-LM
- Layer split for model parallelism
- Row-wise trial-and-error approach
- Column-wise trial-and-error approach
- Cross-machine for data parallelism
- Implementation of Megatron-LM
- Case study of Mesh-TensorFlow
- Implementation of Mesh-TensorFlow
- Pros and cons of Megatron-LM and Mesh-TensorFlow
- Summary
- Chapter 10: Federated Learning and Edge Devices
- Technical requirements
- Sharing knowledge without sharing data
- Recapping the traditional data parallel model training paradigm
- No input sharing among workers
- Communicating gradients for collaborative learning
- Case study: TensorFlow Federated
- Running edge devices with TinyML
- Case study: TensorFlow Lite
- Summary
- Chapter 11: Elastic Model Training and Serving
- Technical requirements
- Introducing adaptive model training
- Traditional data parallel training
- Adaptive model training in data parallelism
- Adaptive model training (AllReduce-based)
- Adaptive model training (parameter server-based)
- Traditional model-parallel model training paradigm
- Adaptive model training in model parallelism
- Implementing adaptive model training in the cloud
- Elasticity in model inference
- Serverless
- Summary
- Chapter 12: Advanced Techniques for Further Speed-Ups
- Technical requirements
- Debugging and performance analytics.
- General concepts in the profiling results
- Communication results analysis
- Computation results analysis
- Job migration and multiplexing
- Job migration
- Job multiplexing
- Model training in a heterogeneous environment
- Summary
- Index
- About Packt
- Other Books You May Enjoy.