Modern big data processing with Hadoop expert techniques for architecting end-to-end big data solutions to get valuable insights
A comprehensive guide to design, build and execute effective Big Data strategies using Hadoop About This Book Get an in-depth view of the Apache Hadoop ecosystem and an overview of the architectural patterns pertaining to the popular Big Data platform Conquer different data processing and analytics...
Otros Autores: | , |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham ; Mumbai :
Packt Publishing
2018.
|
Edición: | 1st edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009631753406719 |
Tabla de Contenidos:
- Cover
- Title Page
- Copyright and Credits
- Packt Upsell
- Contributors
- Table of Contents
- Preface
- Chapter 1: Enterprise Data Architecture Principles
- Data architecture principles
- Volume
- Velocity
- Variety
- Veracity
- The importance of metadata
- Data governance
- Fundamentals of data governance
- Data security
- Application security
- Input data
- Big data security
- RDBMS security
- BI security
- Physical security
- Data encryption
- Secure key management
- Data as a Service
- Evolution data architecture with Hadoop
- Hierarchical database architecture
- Network database architecture
- Relational database architecture
- Employees
- Devices
- Department
- Department and employee mapping table
- Hadoop data architecture
- Data layer
- Data management layer
- Job execution layer
- Summary
- Chapter 2: Hadoop Life Cycle Management
- Data wrangling
- Data acquisition
- Data structure analysis
- Information extraction
- Unwanted data removal
- Data transformation
- Data standardization
- Data masking
- Substitution
- Static
- Dynamic
- Encryption
- Hashing
- Hiding
- Erasing
- Truncation
- Variance
- Shuffling
- Data security
- What is Apache Ranger?
- Apache Ranger installation using Ambari
- Ambari admin UI
- Add service
- Service placement
- Service client placement
- Database creation on master
- Ranger database configuration
- Configuration changes
- Configuration review
- Deployment progress
- Application restart
- Apache Ranger user guide
- Login to UI
- Access manager
- Service details
- Policy definition and auditing for HDFS
- Summary
- Chapter 3: Hadoop Design Consideration
- Understanding data structure principles
- Installing Hadoop cluster
- Configuring Hadoop on NameNode
- Format NameNode
- Start all services
- Exploring HDFS architecture.
- Defining NameNode
- Secondary NameNode
- NameNode safe mode
- DataNode
- Data replication
- Rack awareness
- HDFS WebUI
- Introducing YARN
- YARN architecture
- Resource manager
- Node manager
- Configuration of YARN
- Configuring HDFS high availability
- During Hadoop 1.x
- During Hadoop 2.x and onwards
- HDFS HA cluster using NFS
- Important architecture points
- Configuration of HA NameNodes with shared storage
- HDFS HA cluster using the quorum journal manager
- Important architecture points
- Configuration of HA NameNodes with QJM
- Automatic failover
- Important architecture points
- Configuring automatic failover
- Hadoop cluster composition
- Typical Hadoop cluster
- Best practices Hadoop deployment
- Hadoop file formats
- Text/CSV file
- JSON
- Sequence file
- Avro
- Parquet
- ORC
- Which file format is better?
- Summary
- Chapter 4: Data Movement Techniques
- Batch processing versus real-time processing
- Batch processing
- Real-time processing
- Apache Sqoop
- Sqoop Import
- Import into HDFS
- Import a MySQL table into an HBase table
- Sqoop export
- Flume
- Apache Flume architecture
- Data flow using Flume
- Flume complex data flow architecture
- Flume setup
- Log aggregation use case
- Apache NiFi
- Main concepts of Apache NiFi
- Apache NiFi architecture
- Key features
- Real-time log capture dataflow
- Kafka Connect
- Kafka Connect - a brief history
- Why Kafka Connect?
- Kafka Connect features
- Kafka Connect architecture
- Kafka Connect workers modes
- Standalone mode
- Distributed mode
- Kafka Connect cluster distributed architecture
- Example 1
- Example 2
- Summary
- Chapter 5: Data Modeling in Hadoop
- Apache Hive
- Apache Hive and RDBMS
- Supported datatypes
- How Hive works
- Hive architecture
- Hive data model management
- Hive tables
- Managed tables.
- External tables
- Hive table partition
- Hive static partitions and dynamic partitions
- Hive partition bucketing
- How Hive bucketing works
- Creating buckets in a non-partitioned table
- Creating buckets in a partitioned table
- Hive views
- Syntax of a view
- Hive indexes
- Compact index
- Bitmap index
- JSON documents using Hive
- Example 1 - Accessing simple JSON documents with Hive (Hive 0.14 and later versions)
- Example 2 - Accessing nested JSON documents with Hive (Hive 0.14 and later versions)
- Example 3 - Schema evolution with Hive and Avro (Hive 0.14 and later versions)
- Apache HBase
- Differences between HDFS and HBase
- Differences between Hive and HBase
- Key features of HBase
- HBase data model
- Difference between RDBMS table and column - oriented data store
- HBase architecture
- HBase architecture in a nutshell
- HBase rowkey design
- Example 4 - loading data from MySQL table to HBase table
- Example 5 - incrementally loading data from MySQL table to HBase table
- Example 6 - Load the MySQL customer changed data into the HBase table
- Example 7 - Hive HBase integration
- Summary
- Chapter 6: Designing Real-Time Streaming Data Pipelines
- Real-time streaming concepts
- Data stream
- Batch processing versus real-time data processing
- Complex event processing
- Continuous availability
- Low latency
- Scalable processing frameworks
- Horizontal scalability
- Storage
- Real-time streaming components
- Message queue
- So what is Kafka?
- Kafka features
- Kafka architecture
- Kafka architecture components
- Kafka Connect deep dive
- Kafka Connect architecture
- Kafka Connect workers standalone versus distributed mode
- Install Kafka
- Create topics
- Generate messages to verify the producer and consumer
- Kafka Connect using file Source and Sink.
- Kafka Connect using JDBC and file Sink Connectors
- Apache Storm
- Features of Apache Storm
- Storm topology
- Storm topology components
- Installing Storm on a single node cluster
- Developing a real-time streaming pipeline with Storm
- Streaming a pipeline from Kafka to Storm to MySQL
- Streaming a pipeline with Kafka to Storm to HDFS
- Other popular real-time data streaming frameworks
- Kafka Streams API
- Spark Streaming
- Apache Flink
- Apache Flink versus Spark
- Apache Spark versus Storm
- Summary
- Chapter 7: Large-Scale Data Processing Frameworks
- MapReduce
- Hadoop MapReduce
- Streaming MapReduce
- Java MapReduce
- Summary
- Apache Spark 2
- Installing Spark using Ambari
- Service selection in Ambari Admin
- Add Service Wizard
- Server placement
- Clients and Slaves selection
- Service customization
- Software deployment
- Spark installation progress
- Service restarts and cleanup
- Apache Spark data structures
- RDDs, DataFrames and datasets
- Apache Spark programming
- Sample data for analysis
- Interactive data analysis with pyspark
- Standalone application with Spark
- Spark streaming application
- Spark SQL application
- Summary
- Chapter 8: Building Enterprise Search Platform
- The data search concept
- The need for an enterprise search engine
- Tools for building an enterprise search engine
- Elasticsearch
- Why Elasticsearch?
- Elasticsearch components
- Index
- Document
- Mapping
- Cluster
- Type
- How to index documents in Elasticsearch?
- Elasticsearch installation
- Installation of Elasticsearch
- Create index
- Primary shard
- Replica shard
- Ingest documents into index
- Bulk Insert
- Document search
- Meta fields
- Mapping
- Static mapping
- Dynamic mapping
- Elasticsearch-supported data types
- Mapping example
- Analyzer
- Elasticsearch stack components.
- Beats
- Logstash
- Kibana
- Use case
- Summary
- Chapter 9: Designing Data Visualization Solutions
- Data visualization
- Bar/column chart
- Line/area chart
- Pie chart
- Radar chart
- Scatter/bubble chart
- Other charts
- Practical data visualization in Hadoop
- Apache Druid
- Druid components
- Other required components
- Apache Druid installation
- Add service
- Select Druid and Superset
- Service placement on servers
- Choose Slaves and Clients
- Service configurations
- Service installation
- Installation summary
- Sample data ingestion into Druid
- MySQL database
- Sample database
- Download the sample dataset
- Copy the data to MySQL
- Verify integrity of the tables
- Single Normalized Table
- Apache Superset
- Accessing the Superset application
- Superset dashboards
- Understanding Wikipedia edits data
- Create Superset Slices using Wikipedia data
- Unique users count
- Word Cloud for top US regions
- Sunburst chart - top 10 cities
- Top 50 channels and namespaces via directed force layout
- Top 25 countries/channels distribution
- Creating wikipedia edits dashboard from Slices
- Apache Superset with RDBMS
- Supported databases
- Understanding employee database
- Employees table
- Departments table
- Department manager table
- Department Employees Table
- Titles table
- Salaries table
- Normalized employees table
- Superset Slices for employees database
- Register MySQL database/table
- Slices and Dashboard creation
- Department salary breakup
- Salary Diversity
- Salary Change Per Role Per Year
- Dashboard creation
- Summary
- Chapter 10: Developing Applications Using the Cloud
- What is the Cloud?
- Available technologies in the Cloud
- Planning the Cloud infrastructure
- Dedicated servers versus shared servers
- Dedicated servers
- Shared servers
- High availability.
- Business continuity planning.