Modern big data processing with Hadoop expert techniques for architecting end-to-end big data solutions to get valuable insights

A comprehensive guide to design, build and execute effective Big Data strategies using Hadoop About This Book Get an in-depth view of the Apache Hadoop ecosystem and an overview of the architectural patterns pertaining to the popular Big Data platform Conquer different data processing and analytics...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Kumar, V. Naresh, author (author), Shindgikar, Prashant, author
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham ; Mumbai : Packt Publishing 2018.
Edición:	1st edition
Materias:	Apache Hadoop. Electronic data processing > Distributed processing.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009631753406719

Tabla de Contenidos:

Cover
Title Page
Copyright and Credits
Packt Upsell
Contributors
Table of Contents
Preface
Chapter 1: Enterprise Data Architecture Principles
Data architecture principles
Volume
Velocity
Variety
Veracity
The importance of metadata
Data governance
Fundamentals of data governance
Data security
Application security
Input data
Big data security
RDBMS security
BI security
Physical security
Data encryption
Secure key management
Data as a Service
Evolution data architecture with Hadoop
Hierarchical database architecture
Network database architecture
Relational database architecture
Employees
Devices
Department
Department and employee mapping table
Hadoop data architecture
Data layer
Data management layer
Job execution layer
Summary
Chapter 2: Hadoop Life Cycle Management
Data wrangling
Data acquisition
Data structure analysis
Information extraction
Unwanted data removal
Data transformation
Data standardization
Data masking
Substitution
Static
Dynamic
Encryption
Hashing
Hiding
Erasing
Truncation
Variance
Shuffling
Data security
What is Apache Ranger?
Apache Ranger installation using Ambari
Ambari admin UI
Add service
Service placement
Service client placement
Database creation on master
Ranger database configuration
Configuration changes
Configuration review
Deployment progress
Application restart
Apache Ranger user guide
Login to UI
Access manager
Service details
Policy definition and auditing for HDFS
Summary
Chapter 3: Hadoop Design Consideration
Understanding data structure principles
Installing Hadoop cluster
Configuring Hadoop on NameNode
Format NameNode
Start all services
Exploring HDFS architecture.
Defining NameNode
Secondary NameNode
NameNode safe mode
DataNode
Data replication
Rack awareness
HDFS WebUI
Introducing YARN
YARN architecture
Resource manager
Node manager
Configuration of YARN
Configuring HDFS high availability
During Hadoop 1.x
During Hadoop 2.x and onwards
HDFS HA cluster using NFS
Important architecture points
Configuration of HA NameNodes with shared storage
HDFS HA cluster using the quorum journal manager
Important architecture points
Configuration of HA NameNodes with QJM
Automatic failover
Important architecture points
Configuring automatic failover
Hadoop cluster composition
Typical Hadoop cluster
Best practices Hadoop deployment
Hadoop file formats
Text/CSV file
JSON
Sequence file
Avro
Parquet
ORC
Which file format is better?
Summary
Chapter 4: Data Movement Techniques
Batch processing versus real-time processing
Batch processing
Real-time processing
Apache Sqoop
Sqoop Import
Import into HDFS
Import a MySQL table into an HBase table
Sqoop export
Flume
Apache Flume architecture
Data flow using Flume
Flume complex data flow architecture
Flume setup
Log aggregation use case
Apache NiFi
Main concepts of Apache NiFi
Apache NiFi architecture
Key features
Real-time log capture dataflow
Kafka Connect
Kafka Connect - a brief history
Why Kafka Connect?
Kafka Connect features
Kafka Connect architecture
Kafka Connect workers modes
Standalone mode
Distributed mode
Kafka Connect cluster distributed architecture
Example 1
Example 2
Summary
Chapter 5: Data Modeling in Hadoop
Apache Hive
Apache Hive and RDBMS
Supported datatypes
How Hive works
Hive architecture
Hive data model management
Hive tables
Managed tables.
External tables
Hive table partition
Hive static partitions and dynamic partitions
Hive partition bucketing
How Hive bucketing works
Creating buckets in a non-partitioned table
Creating buckets in a partitioned table
Hive views
Syntax of a view
Hive indexes
Compact index
Bitmap index
JSON documents using Hive
Example 1 - Accessing simple JSON documents with Hive (Hive 0.14 and later versions)
Example 2 - Accessing nested JSON documents with Hive (Hive 0.14 and later versions)
Example 3 - Schema evolution with Hive and Avro (Hive 0.14 and later versions)
Apache HBase
Differences between HDFS and HBase
Differences between Hive and HBase
Key features of HBase
HBase data model
Difference between RDBMS table and column - oriented data store
HBase architecture
HBase architecture in a nutshell
HBase rowkey design
Example 4 - loading data from MySQL table to HBase table
Example 5 - incrementally loading data from MySQL table to HBase table
Example 6 - Load the MySQL customer changed data into the HBase table
Example 7 - Hive HBase integration
Summary
Chapter 6: Designing Real-Time Streaming Data Pipelines
Real-time streaming concepts
Data stream
Batch processing versus real-time data processing
Complex event processing
Continuous availability
Low latency
Scalable processing frameworks
Horizontal scalability
Storage
Real-time streaming components
Message queue
So what is Kafka?
Kafka features
Kafka architecture
Kafka architecture components
Kafka Connect deep dive
Kafka Connect architecture
Kafka Connect workers standalone versus distributed mode
Install Kafka
Create topics
Generate messages to verify the producer and consumer
Kafka Connect using file Source and Sink.
Kafka Connect using JDBC and file Sink Connectors
Apache Storm
Features of Apache Storm
Storm topology
Storm topology components
Installing Storm on a single node cluster
Developing a real-time streaming pipeline with Storm
Streaming a pipeline from Kafka to Storm to MySQL
Streaming a pipeline with Kafka to Storm to HDFS
Other popular real-time data streaming frameworks
Kafka Streams API
Spark Streaming
Apache Flink
Apache Flink versus Spark
Apache Spark versus Storm
Summary
Chapter 7: Large-Scale Data Processing Frameworks
MapReduce
Hadoop MapReduce
Streaming MapReduce
Java MapReduce
Summary
Apache Spark 2
Installing Spark using Ambari
Service selection in Ambari Admin
Add Service Wizard
Server placement
Clients and Slaves selection
Service customization
Software deployment
Spark installation progress
Service restarts and cleanup
Apache Spark data structures
RDDs, DataFrames and datasets
Apache Spark programming
Sample data for analysis
Interactive data analysis with pyspark
Standalone application with Spark
Spark streaming application
Spark SQL application
Summary
Chapter 8: Building Enterprise Search Platform
The data search concept
The need for an enterprise search engine
Tools for building an enterprise search engine
Elasticsearch
Why Elasticsearch?
Elasticsearch components
Index
Document
Mapping
Cluster
Type
How to index documents in Elasticsearch?
Elasticsearch installation
Installation of Elasticsearch
Create index
Primary shard
Replica shard
Ingest documents into index
Bulk Insert
Document search
Meta fields
Mapping
Static mapping
Dynamic mapping
Elasticsearch-supported data types
Mapping example
Analyzer
Elasticsearch stack components.
Beats
Logstash
Kibana
Use case
Summary
Chapter 9: Designing Data Visualization Solutions
Data visualization
Bar/column chart
Line/area chart
Pie chart
Radar chart
Scatter/bubble chart
Other charts
Practical data visualization in Hadoop
Apache Druid
Druid components
Other required components
Apache Druid installation
Add service
Select Druid and Superset
Service placement on servers
Choose Slaves and Clients
Service configurations
Service installation
Installation summary
Sample data ingestion into Druid
MySQL database
Sample database
Download the sample dataset
Copy the data to MySQL
Verify integrity of the tables
Single Normalized Table
Apache Superset
Accessing the Superset application
Superset dashboards
Understanding Wikipedia edits data
Create Superset Slices using Wikipedia data
Unique users count
Word Cloud for top US regions
Sunburst chart - top 10 cities
Top 50 channels and namespaces via directed force layout
Top 25 countries/channels distribution
Creating wikipedia edits dashboard from Slices
Apache Superset with RDBMS
Supported databases
Understanding employee database
Employees table
Departments table
Department manager table
Department Employees Table
Titles table
Salaries table
Normalized employees table
Superset Slices for employees database
Register MySQL database/table
Slices and Dashboard creation
Department salary breakup
Salary Diversity
Salary Change Per Role Per Year
Dashboard creation
Summary
Chapter 10: Developing Applications Using the Cloud
What is the Cloud?
Available technologies in the Cloud
Planning the Cloud infrastructure
Dedicated servers versus shared servers
Dedicated servers
Shared servers
High availability.
Business continuity planning.

Modern big data processing with Hadoop expert techniques for architecting end-to-end big data solutions to get valuable insights

Ejemplares similares