LARGE LANGUAGE MODEL-BASED SOLUTIONS how to deliver value with cost-effective generative AI applications
Learn to build cost-effective apps using Large Language Models In Large Language Model-Based Solutions: How to Deliver Value with Cost-Effective Generative AI Applications, Principal Data Scientist at Amazon Web Services, Shreyas Subramanian, delivers a practical guide for developers and data scient...
Otros Autores: | |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Hoboken, New Jersey :
JOHN WILEY
2024.
Newark : 2024. |
Edición: | 1st ed |
Colección: | Tech Today Series
|
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009811313906719 |
Tabla de Contenidos:
- Cover
- Contents At A Glance
- Title Page
- Copyright Page
- Dedication Page
- About the Author
- About the Technical Editor
- Contents
- Introduction
- GenAI Applications and Large Language Models
- Importance of Cost Optimization
- Challenges and Opportunities
- Micro Case Studies
- OpenAI: Leading the Way
- Hugging Face: Open-Source Community Building
- Bloomberg GPT: LLMs in Large Commercial Institutions
- Who Is This Book For?
- Summary
- Chapter 1 Introduction
- Overview of GenAI Applications and Large Language Models
- The Rise of Large Language Models
- Neural Networks, Transformers, and Beyond
- GenAI vs. LLMs: What's the Difference?
- The Three-Layer GenAI Application Stack
- The Infrastructure Layer
- The Model Layer
- The Application Layer
- Paths to Productionizing GenAI Applications
- Sample LLM-Powered Chat Application
- The Importance of Cost Optimization
- Cost Assessment of the Model Inference Component
- Cost Assessment of the Vector Database Component
- Benchmarking Setup and Results
- Other Factors to Consider
- Cost Assessment of the Large Language Model Component
- Summary
- Chapter 2 Tuning Techniques for Cost Optimization
- Fine-Tuning and Customizability
- Basic Scaling Laws You Should Know
- Parameter-Efficient Fine-Tuning Methods
- Adapters Under the Hood
- Prompt Tuning
- Prefix Tuning
- P-tuning
- IA3
- Low-Rank Adaptation
- Cost and Performance Implications of PEFT Methods
- Summary
- Chapter 3 Inference Techniques for Cost Optimization
- Introduction to Inference Techniques
- Prompt Engineering
- Impact of Prompt Engineering on Cost
- Estimating Costs for Other Models
- Clear and Direct Prompts
- Adding Qualifying Words for Brief Responses
- Breaking Down the Request
- Example of Using Claude for PII Removal
- Conclusion
- Providing Context.
- Examples of Providing Context
- RAG and Long Context Models
- Recent Work Comparing RAG with Long Content Models
- Conclusion
- Context and Model Limitations
- Indicating a Desired Format
- Example of Formatted Extraction with Claude
- Trade-Off Between Verbosity and Clarity
- Caching with Vector Stores
- What Is a Vector Store?
- How to Implement Caching Using Vector Stores
- Conclusion
- Chains for Long Documents
- What Is Chaining?
- Implementing Chains
- Example Use Case
- Common Components
- Tools That Implement Chains
- Comparing Results
- Conclusion
- Summarization
- Summarization in the Context of Cost and Performance
- Efficiency in Data Processing
- Cost-Effective Storage
- Enhanced Downstream Applications
- Improved Cache Utilization
- Summarization as a Preprocessing Step
- Enhanced User Experience
- Conclusion
- Batch Prompting for Efficient Inference
- Batch Inference
- Experimental Results
- Using the accelerate Library
- Using the DeepSpeed Library
- Batch Prompting
- Example of Using Batch Prompting
- Model Optimization Methods
- Quantization
- Code Example
- Recent Advancements: GPTQ
- Parameter-Efficient Fine-Tuning Methods
- Recap of PEFT Methods
- Code Example
- Cost and Performance Implications
- Summary
- References
- Chapter 4 Model Selection and Alternatives
- Introduction to Model Selection
- Motivating Example: The Tale of Two Models
- The Role of Compact and Nimble Models
- Examples of Successful Smaller Models
- Quantization for Powerful but Smaller Models
- Text Generation with Mistral 7B
- Zephyr 7B and Aligned Smaller Models
- CogVLM for Language-Vision Multimodality
- Prometheus for Fine-Grained Text Evaluation
- Orca 2 and Teaching Smaller Models to Reason
- Breaking Traditional Scaling Laws with Gemini and Phi
- Phi 1, 1.5, and 2 B Models
- Gemini Models.
- Domain-Specific Models
- Step 1 - Training Your Own Tokenizer
- Step 2 - Training Your Own Domain-Specific Model
- More References for Fine-Tuning
- Evaluating Domain-Specific Models vs. Generic Models
- The Power of Prompting with General-Purpose Models
- Summary
- Chapter 5 Infrastructure and Deployment Tuning Strategies
- Introduction to Tuning Strategies
- Hardware Utilization and Batch Tuning
- Memory Occupancy
- Strategies to Fit Larger Models in Memory
- KV Caching
- PagedAttention
- How Does PagedAttention Work?
- Comparisons, Limitations, and Cost Considerations
- AlphaServe
- How Does AlphaServe Work?
- Impact of Batching
- Cost and Performance Considerations
- S3: Scheduling Sequences with Speculation
- How Does S3 Work?
- Performance and Cost
- Streaming LLMs with Attention Sinks
- Fixed to Sliding Window Attention
- Extending the Context Length
- Working with Infinite Length Context
- How Does StreamingLLM Work?
- Performance and Results
- Cost Considerations
- Batch Size Tuning
- Frameworks for Deployment Configuration Testing
- Cloud-NativeInference Frameworks
- Deep Dive into Serving Stack Choices
- Batching Options
- Options in DJL Serving
- High-Level Guidance for Selecting Serving Parameters
- Automatically Finding Good Inference Configurations
- Creating a Generic Template
- Defining a HPO Space
- Searching the Space for Optimal Configurations
- Results of Inference HPO
- Inference Acceleration Tools
- TensorRT and GPU Acceleration Tools
- CPU Acceleration Tools
- Monitoring and Observability
- LLMOps and Monitoring
- Why Is Monitoring Important for LLMs?
- Monitoring and Updating Guardrails
- Summary
- Conclusion
- Index
- EULA.