LARGE LANGUAGE MODEL-BASED SOLUTIONS how to deliver value with cost-effective generative AI applications

Learn to build cost-effective apps using Large Language Models In Large Language Model-Based Solutions: How to Deliver Value with Cost-Effective Generative AI Applications, Principal Data Scientist at Amazon Web Services, Shreyas Subramanian, delivers a practical guide for developers and data scient...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Subramanian, Shreyas, author (author)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Hoboken, New Jersey : JOHN WILEY 2024. Newark : 2024.
Edición:	1st ed
Colección:	Tech Today Series
Materias:	Natural language generation (Computer science) Artificial intelligence > Computer programs.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009811313906719

Tabla de Contenidos:

Cover
Contents At A Glance
Title Page
Copyright Page
Dedication Page
About the Author
About the Technical Editor
Contents
Introduction
GenAI Applications and Large Language Models
Importance of Cost Optimization
Challenges and Opportunities
Micro Case Studies
OpenAI: Leading the Way
Hugging Face: Open-Source Community Building
Bloomberg GPT: LLMs in Large Commercial Institutions
Who Is This Book For?
Summary
Chapter 1 Introduction
Overview of GenAI Applications and Large Language Models
The Rise of Large Language Models
Neural Networks, Transformers, and Beyond
GenAI vs. LLMs: What's the Difference?
The Three-Layer GenAI Application Stack
The Infrastructure Layer
The Model Layer
The Application Layer
Paths to Productionizing GenAI Applications
Sample LLM-Powered Chat Application
The Importance of Cost Optimization
Cost Assessment of the Model Inference Component
Cost Assessment of the Vector Database Component
Benchmarking Setup and Results
Other Factors to Consider
Cost Assessment of the Large Language Model Component
Summary
Chapter 2 Tuning Techniques for Cost Optimization
Fine-Tuning and Customizability
Basic Scaling Laws You Should Know
Parameter-Efficient Fine-Tuning Methods
Adapters Under the Hood
Prompt Tuning
Prefix Tuning
P-tuning
IA3
Low-Rank Adaptation
Cost and Performance Implications of PEFT Methods
Summary
Chapter 3 Inference Techniques for Cost Optimization
Introduction to Inference Techniques
Prompt Engineering
Impact of Prompt Engineering on Cost
Estimating Costs for Other Models
Clear and Direct Prompts
Adding Qualifying Words for Brief Responses
Breaking Down the Request
Example of Using Claude for PII Removal
Conclusion
Providing Context.
Examples of Providing Context
RAG and Long Context Models
Recent Work Comparing RAG with Long Content Models
Conclusion
Context and Model Limitations
Indicating a Desired Format
Example of Formatted Extraction with Claude
Trade-Off Between Verbosity and Clarity
Caching with Vector Stores
What Is a Vector Store?
How to Implement Caching Using Vector Stores
Conclusion
Chains for Long Documents
What Is Chaining?
Implementing Chains
Example Use Case
Common Components
Tools That Implement Chains
Comparing Results
Conclusion
Summarization
Summarization in the Context of Cost and Performance
Efficiency in Data Processing
Cost-Effective Storage
Enhanced Downstream Applications
Improved Cache Utilization
Summarization as a Preprocessing Step
Enhanced User Experience
Conclusion
Batch Prompting for Efficient Inference
Batch Inference
Experimental Results
Using the accelerate Library
Using the DeepSpeed Library
Batch Prompting
Example of Using Batch Prompting
Model Optimization Methods
Quantization
Code Example
Recent Advancements: GPTQ
Parameter-Efficient Fine-Tuning Methods
Recap of PEFT Methods
Code Example
Cost and Performance Implications
Summary
References
Chapter 4 Model Selection and Alternatives
Introduction to Model Selection
Motivating Example: The Tale of Two Models
The Role of Compact and Nimble Models
Examples of Successful Smaller Models
Quantization for Powerful but Smaller Models
Text Generation with Mistral 7B
Zephyr 7B and Aligned Smaller Models
CogVLM for Language-Vision Multimodality
Prometheus for Fine-Grained Text Evaluation
Orca 2 and Teaching Smaller Models to Reason
Breaking Traditional Scaling Laws with Gemini and Phi
Phi 1, 1.5, and 2 B Models
Gemini Models.
Domain-Specific Models
Step 1 - Training Your Own Tokenizer
Step 2 - Training Your Own Domain-Specific Model
More References for Fine-Tuning
Evaluating Domain-Specific Models vs. Generic Models
The Power of Prompting with General-Purpose Models
Summary
Chapter 5 Infrastructure and Deployment Tuning Strategies
Introduction to Tuning Strategies
Hardware Utilization and Batch Tuning
Memory Occupancy
Strategies to Fit Larger Models in Memory
KV Caching
PagedAttention
How Does PagedAttention Work?
Comparisons, Limitations, and Cost Considerations
AlphaServe
How Does AlphaServe Work?
Impact of Batching
Cost and Performance Considerations
S3: Scheduling Sequences with Speculation
How Does S3 Work?
Performance and Cost
Streaming LLMs with Attention Sinks
Fixed to Sliding Window Attention
Extending the Context Length
Working with Infinite Length Context
How Does StreamingLLM Work?
Performance and Results
Cost Considerations
Batch Size Tuning
Frameworks for Deployment Configuration Testing
Cloud-NativeInference Frameworks
Deep Dive into Serving Stack Choices
Batching Options
Options in DJL Serving
High-Level Guidance for Selecting Serving Parameters
Automatically Finding Good Inference Configurations
Creating a Generic Template
Defining a HPO Space
Searching the Space for Optimal Configurations
Results of Inference HPO
Inference Acceleration Tools
TensorRT and GPU Acceleration Tools
CPU Acceleration Tools
Monitoring and Observability
LLMOps and Monitoring
Why Is Monitoring Important for LLMs?
Monitoring and Updating Guardrails
Summary
Conclusion
Index
EULA.

LARGE LANGUAGE MODEL-BASED SOLUTIONS how to deliver value with cost-effective generative AI applications

Ejemplares similares