Apache Spark machine learning blueprints develop a range of cutting-edge machine learning projects with Apache Spark using this actionable guide
Develop a range of cutting-edge machine learning projects with Apache Spark using this actionable guide About This Book Customize Apache Spark and R to fit your analytical needs in customer research, fraud detection, risk analytics, and recommendation engine development Develop a set of practical Ma...
Otros Autores: | |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham :
Packt Publishing
2016.
|
Edición: | 1st edition |
Colección: | Community experience distilled.
|
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630193006719 |
Tabla de Contenidos:
- Cover
- Copyright
- Credits
- About the Author
- About the Reviewer
- www.PacktPub.com
- Table of Contents
- Preface
- Chapter 1: Spark for Machine Learning
- Spark overview and Spark advantages
- Spark overview
- Spark advantages
- Spark computing for machine learning
- Machine learning algorithms
- MLlib
- Other ML libraries
- Spark RDD and dataframes
- Spark RDD
- Spark dataframes
- Dataframes API for R
- ML frameworks, RM4Es and Spark computing
- ML frameworks
- RM4Es
- The Spark computing framework
- ML workflows and Spark pipelines
- ML as a step-by-step workflow
- ML workflow examples
- Spark notebooks
- Notebook approach for ML
- Step 1: Getting the software ready
- Step 2: Installing the Knitr package
- Step 3: Creating a simple report
- Spark notebooks
- Summary
- Chapter 2: Data Preparation for Spark ML
- Accessing and loading datasets
- Accessing publicly available datasets
- Loading datasets into Spark
- Exploring and visualizing datasets
- Data cleaning
- Dealing with data incompleteness
- Data cleaning in Spark
- Data cleaning made easy
- Identity matching
- Identity issues
- Identity matching on Spark
- Entity resolution
- Short string comparison
- Long string comparison
- Record deduplication
- Identity matching made better
- Crowdsourced deduplication
- Configuring the crowd
- Using the crowd
- Dataset reorganizing
- Dataset reorganizing tasks
- Dataset reorganizing with Spark SQL
- Dataset reorganizing with R on Spark
- Dataset joining
- Dataset joining and its tool - the Spark SQL
- Dataset joining in Spark
- Dataset joining with the R data table package
- Feature extraction
- Feature development challenges
- Feature development with Spark MLlib
- Feature development with R
- Repeatability and automation
- Dataset preprocessing workflows.
- Spark pipelines for dataset preprocessing
- Dataset preprocessing automation
- Summary
- Chapter 3: A Holistic View on Spark
- Spark for a holistic view
- The use case
- Fast and easy computing
- Methods for a holistic view
- Regression modeling
- The SEM approach
- Decision trees
- Feature preparation
- PCA
- Grouping by category to use subject knowledge
- Feature selection
- Model estimation
- MLlib implementation
- The R notebooks' implementation
- Model evaluation
- Quick evaluations
- RMSE
- ROC curves
- Results explanation
- Impact assessments
- Deployment
- Dashboard
- Rules
- Summary
- Chapter 4: Fraud Detection on Spark
- Spark for fraud detection
- The use case
- Distributed computing
- Methods for fraud detection
- Random forest
- Decision trees
- Feature preparation
- Feature extraction from LogFile
- Data merging
- Model estimation
- MLlib implementation
- R notebooks implementation
- Model evaluation
- A quick evaluation
- Confusion matrix and false positive ratios
- Results explanation
- Big influencers and their impacts
- Deploying fraud detection
- Rules
- Scoring
- Summary
- Chapter 5: Risk Scoring on Spark
- Spark for risk scoring
- The use case
- Apache Spark notebooks
- Methods of risk scoring
- Logistic regression
- Preparing coding in R
- Random forest and decision trees
- Preparing coding
- Data and feature preparation
- OpenRefine
- Model estimation
- The DataScientistWorkbench for R notebooks
- R notebooks implementation
- Model evaluation
- Confusion matrix
- ROC
- Kolmogorov-Smirnov
- Results explanation
- Big influencers and their impacts
- Deployment
- Scoring
- Summary
- Chapter 6: Churn Prediction on Spark
- Spark for churn prediction
- The use case
- Spark computing
- Methods for churn prediction
- Regression models.
- Decision trees and Random forest
- Feature preparation
- Feature extraction
- Feature selection
- Model estimation
- Spark implementation with MLlib
- Model evaluation
- Results explanation
- Calculating the impact of interventions
- Deployment
- Scoring
- Intervention recommendations
- Summary
- Chapter 7: Recommendations on Spark
- Apache Spark for a recommendation engine
- The use case
- SPSS on Spark
- Methods for recommendation
- Collaborative filtering
- Preparing coding
- Data treatment with SPSS
- Missing data nodes on SPSS modeler
- Model estimation
- SPSS on Spark - the SPSS Analytics server
- Model evaluation
- Recommendation deployment
- Summary
- Chapter 8: Learning Analytics on Spark
- Spark for attrition prediction
- The use case
- Spark computing
- Methods of attrition prediction
- Regression models
- About regression
- Preparing for coding
- Decision trees
- Preparing for coding
- Feature preparation
- Feature development
- Feature selection
- Principal components analysis
- ML feature selection
- Model estimation
- Spark implementation with the Zeppelin notebook
- Model evaluation
- A quick evaluation
- The confusion matrix and error ratios
- Results explanation
- Calculating the impact of interventions
- Calculating the impact of main causes
- Deployment
- Rules
- Scoring
- Summary
- Chapter 9: City Analytics on Spark
- Spark for service forecasting
- The use case
- Spark computing
- Methods of service forecasting
- Regression models
- About regression
- Preparing for coding
- Time series modeling
- About time series
- Preparing for coding
- Data and feature preparation
- Data merging
- Feature selection
- Model estimation
- Spark implementation with the Zeppelin notebook
- Spark implementation with the R notebook
- Model evaluation.
- RMSE calculation with MLlib
- RMSE calculation with R
- Explanations of the results
- Biggest influencers
- Visualizing trends
- The rules of sending out alerts
- Scores to rank city zones
- Summary
- Chapter 10: Learning Telco Data on Spark
- Spark for using Telco Data
- The use case
- Spark computing
- Methods for learning from Telco Data
- Descriptive statistics and visualization
- Linear and logistic regression models
- Decision tree and random forest
- Data and feature development
- Data reorganizing
- Feature development and selection
- Model estimation
- SPSS on Spark - SPSS Analytics Server
- Model evaluation
- RMSE calculations with MLlib
- RMSE calculations with R
- Confusion matrix and error ratios with MLlib and R
- Results explanation
- Descriptive statistics and visualizations
- Biggest influencers
- Special insights
- Visualizing trends
- Model deployment
- Rules to send out alerts
- Scores subscribers for churn and for Call Center calls
- Scores subscribers for purchase propensity
- Summary
- Chapter 11: Modeling Open Data on Spark
- Spark for learning from open data
- The use case
- Spark computing
- Methods for scoring and ranking
- Cluster analysis
- Principal component analysis
- Regression models
- Score resembling
- Data and feature preparation
- Data cleaning
- Data merging
- Feature development
- Feature selection
- Model estimation
- SPSS on Spark - SPSS Analytics Server
- Model evaluation
- RMSE calculations with MLlib
- RMSE calculations with R
- Results explanation
- Comparing ranks
- Biggest influencers
- Deployment
- Rules for sending out alerts
- Scores for ranking school districts
- Summary
- Index.