Mastering Spark for data science master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products
Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products About This Book Develop and apply advanced analytical techniques with Spark Learn how to tell a compelling story with data science using Spark's...
Otros Autores: | |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham, England :
Packt Publishing
2017.
|
Edición: | 1st edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630128206719 |
Tabla de Contenidos:
- Cover
- Copyright
- Credits
- Foreword
- About the Authors
- About the Reviewer
- www.PacktPub.com
- Customer Feedback
- Table of Contents
- Preface
- Chapter 1: The Big Data Science Ecosystem
- Introducing the Big Data ecosystem
- Data management
- Data management responsibilities
- The right tool for the job
- Overall architecture
- Data Ingestion
- Data Lake
- Reliable storage
- Scalable data processing capability
- Data science platform
- Data Access
- Data technologies
- The role of Apache Spark
- Companion tools
- Apache HDFS
- Advantages
- Disadvantages
- Installation
- Amazon S3
- Advantages
- Disadvantages
- Installation
- Apache Kafka
- Advantages
- Disadvantages
- Installation
- Apache Parquet
- Advantages
- Disadvantages
- Installation
- Apache Avro
- Advantages
- Disadvantages
- Installation
- Apache NiFi
- Advantages
- Disadvantages
- Installation
- Apache YARN
- Advantages
- Disadvantages
- Installation
- Apache Lucene
- Advantages
- Disadvantages
- Installation
- Kibana
- Advantages
- Disadvantages
- Installation
- Elasticsearch
- Advantages
- Disadvantages
- Installation
- Accumulo
- Advantages
- Disadvantages
- Installation
- Summary
- Chapter 2: Data Acquisition
- Data pipelines
- Universal ingestion framework
- Introducing the GDELT news stream
- Discovering GDELT in real-time
- Our first GDELT feed
- Improving with publish and subscribe
- Content registry
- Choices and more choices
- Going with the flow
- Metadata model
- Kibana dashboard
- Quality assurance
- [Example 1 - Basic quality checking, no contending users]
- Example 1 - Basic quality checking, no contending users
- Example 2 - Advanced quality checking, no contending users
- Example 3 - Basic quality checking, 50% utility due to contending users
- Summary.
- Chapter 3: Input Formats and Schema
- A structured life is a good life
- GDELT dimensional modeling
- GDELT model
- First look at the data
- Core global knowledge graph model
- Hidden complexity
- Denormalized models
- Challenges with flattened data
- Issue 1 - Loss of contextual information
- Issue 2: Re-establishing dimensions
- Issue 3: Including reference data
- Loading your data
- Schema agility
- Reality check
- GKG ELT
- Position matters
- Avro
- Spark-Avro method
- Pedagogical method
- When to perform Avro transformation
- Parquet
- Summary
- Chapter 4: Exploratory Data Analysis
- The problem, principles and planning
- Understanding the EDA problem
- Design principles
- General plan of exploration
- Preparation
- Introducing mask based data profiling
- Introducing character class masks
- Building a mask based profiler
- Setting up Apache Zeppelin
- Constructing a reusable notebook
- Exploring GDELT
- GDELT GKG datasets
- The files
- Special collections
- Reference data
- Exploring the GKG v2.1
- The Translingual files
- A configurable GCAM time series EDA
- Plot.ly charting on Apache Zeppelin
- Exploring translation sourced GCAM sentiment with plot.ly
- Concluding remarks
- A configurable GCAM Spatio-Temporal EDA
- Introducing GeoGCAM
- Does our spatial pivot work?
- Summary
- Chapter 5: Spark for Geographic Analysis
- GDELT and oil
- GDELT events
- GDELT GKG
- Formulating a plan of action
- GeoMesa
- Installing
- GDELT Ingest
- GeoMesa Ingest
- MapReduce to Spark
- Geohash
- GeoServer
- Map layers
- CQL
- Gauging oil prices
- Using the GeoMesa query API
- Data preparation
- Machine learning
- Naive Bayes
- Results
- Analysis
- Summary
- Chapter 6: Scraping Link-Based External Data
- Building a web scale news scanner
- Accessing the web content
- The Goose library.
- Integration with Spark
- Scala compatibility
- Serialization issues
- Creating a scalable, production-ready library
- Build once, read many
- Exception handling
- Performance tuning
- Named entity recognition
- Scala libraries
- NLP walkthrough
- Extracting entities
- Abstracting methods
- Building a scalable code
- Build once, read many
- Scalability is also a state of mind
- Performance tuning
- GIS lookup
- GeoNames dataset
- Building an efficient join
- Offline strategy - Bloom filtering
- Online strategy - Hash partitioning
- Content deduplication
- Context learning
- Location scoring
- Names de-duplication
- Functional programming with Scalaz
- Our de-duplication strategy
- Using the mappend operator
- Simple clean
- DoubleMetaphone
- News index dashboard
- Summary
- Chapter 7: Building Communities
- Building a graph of persons
- Contact chaining
- Extracting data from Elasticsearch
- Using the Accumulo database
- Setup Accumulo
- Cell security
- Iterators
- Elasticsearch to Accumulo
- A graph data model in Accumulo
- Hadoop input and output formats
- Reading from Accumulo
- AccumuloGraphxInputFormat and EdgeWritable
- Building a graph
- Community detection algorithm
- Louvain algorithm
- Weighted Community Clustering (WCC)
- Description
- Preprocessing stage
- Initial communities
- Message passing
- Community back propagation
- WCC iteration
- Gathering community statistics
- WCC Computation
- WCC iteration
- GDELT dataset
- The Bowie effect
- Smaller communities
- Using Accumulo cell level security
- Summary
- Chapter 8: Building a Recommendation System
- Different approaches
- Collaborative filtering
- Content-based filtering
- Custom approach
- Uninformed data
- Processing bytes
- Creating a scalable code
- From time to frequency domain
- Fast Fourier transform.
- Sampling by time window
- Extracting audio signatures
- Building a song analyzer
- Selling data science is all about selling cupcakes
- Using Cassandra
- Using the Play framework
- Building a recommender
- The PageRank algorithm
- Building a Graph of Frequency Co-occurrence
- Running PageRank
- Building personalized playlists
- Expanding our cupcake factory
- Building a playlist service
- Leveraging the Spark job server
- User interface
- Summary
- Chapter 9: News Dictionary and Real-Time Tagging System
- The mechanical Turk
- Human intelligence tasks
- Bootstrapping a classification model
- Learning from Stack Exchange
- Building text features
- Training a Naive Bayes model
- Laziness, impatience, and hubris
- Designing a Spark Streaming application
- A tale of two architectures
- The CAP theorem
- The Greeks are here to help
- Importance of the Lambda architecture
- Importance of the Kappa architecture
- Consuming data streams
- Creating a GDELT data stream
- Creating a Kafka topic
- Publishing content to a Kafka topic
- Consuming Kafka from Spark Streaming
- Creating a Twitter data stream
- Processing Twitter data
- Extracting URLs and hashtags
- Keeping popular hashtags
- Expanding shortened URLs
- Fetching HTML content
- Using Elasticsearch as a caching layer
- Classifying data
- Training a Naive Bayes model
- Thread safety
- Predict the GDELT data
- Our Twitter mechanical Turk
- Summary
- Chapter 10: Story De-duplication and Mutation
- Detecting near duplicates
- First steps with hashing
- Standing on the shoulders of the Internet giants
- Simhashing
- The hamming weight
- Detecting near duplicates in GDELT
- Indexing the GDELT database
- Persisting our RDDs
- Building a REST API
- Area of improvement
- Building stories
- Building term frequency vectors.
- The curse of dimensionality, the data science plague
- Optimizing KMeans
- Story mutation
- The Equilibrium state
- Tracking stories over time
- Building a streaming application
- Streaming KMeans
- Visualization
- Building story connections
- Summary
- Chapter 11: Anomaly Detection on Sentiment Analysis
- Following the US elections on Twitter
- Acquiring data in stream
- Acquiring data in batch
- The search API
- Rate limit
- Analysing sentiment
- Massaging Twitter data
- Using the Stanford NLP
- Building the Pipeline
- Using Timely as a time series database
- Storing data
- Using Grafana to visualize sentiment
- Number of processed tweets
- Give me my Twitter account back
- Identifying the swing states
- Twitter and the Godwin point
- Learning context
- Visualizing our model
- Word2Graph and Godwin point
- Building a Word2Graph
- Random walks
- A Small Step into sarcasm detection
- Building features
- #LoveTrumpsHates
- Scoring Emojis
- Training a KMeans model
- Detecting anomalies
- Summary
- Chapter 12: TrendCalculus
- Studying trends
- The TrendCalculus algorithm
- Trend windows
- Simple trend
- User Defined Aggregate Functions
- Simple trend calculation
- Reversal rule
- Introducing the FHLS bar structure
- Visualize the data
- FHLS with reversals
- Edge cases
- Zero values
- Completing the gaps
- Stackable processing
- Practical applications
- Algorithm characteristics
- Advantages
- Disadvantages
- Possible use cases
- [Chart annotation]
- Chart annotation
- Co-trending
- Data reduction
- Indexing
- Fractal dimension
- Streaming proxy for piecewise linear regression
- Summary
- Chapter 13: Secure Data
- Data security
- The problem
- The basics
- Authentication and authorization
- Access control lists (ACL)
- Role-based access control (RBAC)
- Access
- Encryption.
- Data at rest.