Mastering Spark for data science master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products

Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products About This Book Develop and apply advanced analytical techniques with Spark Learn how to tell a compelling story with data science using Spark's...

Descripción completa

Detalles Bibliográficos
Otros Autores: Morgan, Andrew, author (author)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham, England : Packt Publishing 2017.
Edición:1st edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630128206719
Tabla de Contenidos:
  • Cover
  • Copyright
  • Credits
  • Foreword
  • About the Authors
  • About the Reviewer
  • www.PacktPub.com
  • Customer Feedback
  • Table of Contents
  • Preface
  • Chapter 1: The Big Data Science Ecosystem
  • Introducing the Big Data ecosystem
  • Data management
  • Data management responsibilities
  • The right tool for the job
  • Overall architecture
  • Data Ingestion
  • Data Lake
  • Reliable storage
  • Scalable data processing capability
  • Data science platform
  • Data Access
  • Data technologies
  • The role of Apache Spark
  • Companion tools
  • Apache HDFS
  • Advantages
  • Disadvantages
  • Installation
  • Amazon S3
  • Advantages
  • Disadvantages
  • Installation
  • Apache Kafka
  • Advantages
  • Disadvantages
  • Installation
  • Apache Parquet
  • Advantages
  • Disadvantages
  • Installation
  • Apache Avro
  • Advantages
  • Disadvantages
  • Installation
  • Apache NiFi
  • Advantages
  • Disadvantages
  • Installation
  • Apache YARN
  • Advantages
  • Disadvantages
  • Installation
  • Apache Lucene
  • Advantages
  • Disadvantages
  • Installation
  • Kibana
  • Advantages
  • Disadvantages
  • Installation
  • Elasticsearch
  • Advantages
  • Disadvantages
  • Installation
  • Accumulo
  • Advantages
  • Disadvantages
  • Installation
  • Summary
  • Chapter 2: Data Acquisition
  • Data pipelines
  • Universal ingestion framework
  • Introducing the GDELT news stream
  • Discovering GDELT in real-time
  • Our first GDELT feed
  • Improving with publish and subscribe
  • Content registry
  • Choices and more choices
  • Going with the flow
  • Metadata model
  • Kibana dashboard
  • Quality assurance
  • [Example 1 - Basic quality checking, no contending users]
  • Example 1 - Basic quality checking, no contending users
  • Example 2 - Advanced quality checking, no contending users
  • Example 3 - Basic quality checking, 50% utility due to contending users
  • Summary.
  • Chapter 3: Input Formats and Schema
  • A structured life is a good life
  • GDELT dimensional modeling
  • GDELT model
  • First look at the data
  • Core global knowledge graph model
  • Hidden complexity
  • Denormalized models
  • Challenges with flattened data
  • Issue 1 - Loss of contextual information
  • Issue 2: Re-establishing dimensions
  • Issue 3: Including reference data
  • Loading your data
  • Schema agility
  • Reality check
  • GKG ELT
  • Position matters
  • Avro
  • Spark-Avro method
  • Pedagogical method
  • When to perform Avro transformation
  • Parquet
  • Summary
  • Chapter 4: Exploratory Data Analysis
  • The problem, principles and planning
  • Understanding the EDA problem
  • Design principles
  • General plan of exploration
  • Preparation
  • Introducing mask based data profiling
  • Introducing character class masks
  • Building a mask based profiler
  • Setting up Apache Zeppelin
  • Constructing a reusable notebook
  • Exploring GDELT
  • GDELT GKG datasets
  • The files
  • Special collections
  • Reference data
  • Exploring the GKG v2.1
  • The Translingual files
  • A configurable GCAM time series EDA
  • Plot.ly charting on Apache Zeppelin
  • Exploring translation sourced GCAM sentiment with plot.ly
  • Concluding remarks
  • A configurable GCAM Spatio-Temporal EDA
  • Introducing GeoGCAM
  • Does our spatial pivot work?
  • Summary
  • Chapter 5: Spark for Geographic Analysis
  • GDELT and oil
  • GDELT events
  • GDELT GKG
  • Formulating a plan of action
  • GeoMesa
  • Installing
  • GDELT Ingest
  • GeoMesa Ingest
  • MapReduce to Spark
  • Geohash
  • GeoServer
  • Map layers
  • CQL
  • Gauging oil prices
  • Using the GeoMesa query API
  • Data preparation
  • Machine learning
  • Naive Bayes
  • Results
  • Analysis
  • Summary
  • Chapter 6: Scraping Link-Based External Data
  • Building a web scale news scanner
  • Accessing the web content
  • The Goose library.
  • Integration with Spark
  • Scala compatibility
  • Serialization issues
  • Creating a scalable, production-ready library
  • Build once, read many
  • Exception handling
  • Performance tuning
  • Named entity recognition
  • Scala libraries
  • NLP walkthrough
  • Extracting entities
  • Abstracting methods
  • Building a scalable code
  • Build once, read many
  • Scalability is also a state of mind
  • Performance tuning
  • GIS lookup
  • GeoNames dataset
  • Building an efficient join
  • Offline strategy - Bloom filtering
  • Online strategy - Hash partitioning
  • Content deduplication
  • Context learning
  • Location scoring
  • Names de-duplication
  • Functional programming with Scalaz
  • Our de-duplication strategy
  • Using the mappend operator
  • Simple clean
  • DoubleMetaphone
  • News index dashboard
  • Summary
  • Chapter 7: Building Communities
  • Building a graph of persons
  • Contact chaining
  • Extracting data from Elasticsearch
  • Using the Accumulo database
  • Setup Accumulo
  • Cell security
  • Iterators
  • Elasticsearch to Accumulo
  • A graph data model in Accumulo
  • Hadoop input and output formats
  • Reading from Accumulo
  • AccumuloGraphxInputFormat and EdgeWritable
  • Building a graph
  • Community detection algorithm
  • Louvain algorithm
  • Weighted Community Clustering (WCC)
  • Description
  • Preprocessing stage
  • Initial communities
  • Message passing
  • Community back propagation
  • WCC iteration
  • Gathering community statistics
  • WCC Computation
  • WCC iteration
  • GDELT dataset
  • The Bowie effect
  • Smaller communities
  • Using Accumulo cell level security
  • Summary
  • Chapter 8: Building a Recommendation System
  • Different approaches
  • Collaborative filtering
  • Content-based filtering
  • Custom approach
  • Uninformed data
  • Processing bytes
  • Creating a scalable code
  • From time to frequency domain
  • Fast Fourier transform.
  • Sampling by time window
  • Extracting audio signatures
  • Building a song analyzer
  • Selling data science is all about selling cupcakes
  • Using Cassandra
  • Using the Play framework
  • Building a recommender
  • The PageRank algorithm
  • Building a Graph of Frequency Co-occurrence
  • Running PageRank
  • Building personalized playlists
  • Expanding our cupcake factory
  • Building a playlist service
  • Leveraging the Spark job server
  • User interface
  • Summary
  • Chapter 9: News Dictionary and Real-Time Tagging System
  • The mechanical Turk
  • Human intelligence tasks
  • Bootstrapping a classification model
  • Learning from Stack Exchange
  • Building text features
  • Training a Naive Bayes model
  • Laziness, impatience, and hubris
  • Designing a Spark Streaming application
  • A tale of two architectures
  • The CAP theorem
  • The Greeks are here to help
  • Importance of the Lambda architecture
  • Importance of the Kappa architecture
  • Consuming data streams
  • Creating a GDELT data stream
  • Creating a Kafka topic
  • Publishing content to a Kafka topic
  • Consuming Kafka from Spark Streaming
  • Creating a Twitter data stream
  • Processing Twitter data
  • Extracting URLs and hashtags
  • Keeping popular hashtags
  • Expanding shortened URLs
  • Fetching HTML content
  • Using Elasticsearch as a caching layer
  • Classifying data
  • Training a Naive Bayes model
  • Thread safety
  • Predict the GDELT data
  • Our Twitter mechanical Turk
  • Summary
  • Chapter 10: Story De-duplication and Mutation
  • Detecting near duplicates
  • First steps with hashing
  • Standing on the shoulders of the Internet giants
  • Simhashing
  • The hamming weight
  • Detecting near duplicates in GDELT
  • Indexing the GDELT database
  • Persisting our RDDs
  • Building a REST API
  • Area of improvement
  • Building stories
  • Building term frequency vectors.
  • The curse of dimensionality, the data science plague
  • Optimizing KMeans
  • Story mutation
  • The Equilibrium state
  • Tracking stories over time
  • Building a streaming application
  • Streaming KMeans
  • Visualization
  • Building story connections
  • Summary
  • Chapter 11: Anomaly Detection on Sentiment Analysis
  • Following the US elections on Twitter
  • Acquiring data in stream
  • Acquiring data in batch
  • The search API
  • Rate limit
  • Analysing sentiment
  • Massaging Twitter data
  • Using the Stanford NLP
  • Building the Pipeline
  • Using Timely as a time series database
  • Storing data
  • Using Grafana to visualize sentiment
  • Number of processed tweets
  • Give me my Twitter account back
  • Identifying the swing states
  • Twitter and the Godwin point
  • Learning context
  • Visualizing our model
  • Word2Graph and Godwin point
  • Building a Word2Graph
  • Random walks
  • A Small Step into sarcasm detection
  • Building features
  • #LoveTrumpsHates
  • Scoring Emojis
  • Training a KMeans model
  • Detecting anomalies
  • Summary
  • Chapter 12: TrendCalculus
  • Studying trends
  • The TrendCalculus algorithm
  • Trend windows
  • Simple trend
  • User Defined Aggregate Functions
  • Simple trend calculation
  • Reversal rule
  • Introducing the FHLS bar structure
  • Visualize the data
  • FHLS with reversals
  • Edge cases
  • Zero values
  • Completing the gaps
  • Stackable processing
  • Practical applications
  • Algorithm characteristics
  • Advantages
  • Disadvantages
  • Possible use cases
  • [Chart annotation]
  • Chart annotation
  • Co-trending
  • Data reduction
  • Indexing
  • Fractal dimension
  • Streaming proxy for piecewise linear regression
  • Summary
  • Chapter 13: Secure Data
  • Data security
  • The problem
  • The basics
  • Authentication and authorization
  • Access control lists (ACL)
  • Role-based access control (RBAC)
  • Access
  • Encryption.
  • Data at rest.