R data mining implement data mining techniques through practical use cases and real-world datasets

Mine valuable insights from your data using popular tools and techniques in R About This Book Understand the basics of data mining and why R is a perfect tool for it. Manipulate your data using popular R packages such as ggplot2, dplyr, and so on to gather valuable business insights from it. Apply e...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Cirillo, Andrea, author (author)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham, England ; Mumbai, [India] : Packt Publishing 2017.
Edición:	1st edition
Materias:	R (Computer program language) Data mining.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630080406719

Tabla de Contenidos:

Cover
Copyright
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Chapter 1: Why to Choose R for Your Data Mining and Where to Start
What is R?
A bit of history
R's points of strength
Open source inside
Plugin ready
Data visualization friendly
Installing R and writing R code
Downloading R
R installation for Windows and macOS
R installation for Linux OS
Main components of a base R installation
Possible alternatives to write and run R code
RStudio (all OSs)
The Jupyter Notebook (all OSs)
Visual Studio (Windows users only)
R foundational notions
A preliminary R session
Executing R interactively through the R console
Creating an R script
Executing an R script
Vectors
Lists
Creating lists
Subsetting lists
Data frames
Functions
R's weaknesses and how to overcome them
Learning R effectively and minimizing the effort
The tidyverse
Leveraging the R community to learn R
Where to find the R community
Engaging with the community to learn R
Handling large datasets with R
Further references
Summary
Chapter 2: A First Primer on Data Mining Analysing Your Bank Account Data
Acquiring and preparing your banking data
Data model
Summarizing your data with pivot-like tables
A gentle introduction to the pipe operator
An even more gentle introduction to the dplyr package
Installing the necessary packages and loading your data into R
Installing and loading the necessary packages
Importing your data into R
Defining the monthly and daily sum of expenses
Visualizing your data with ggplot2
Basic data visualization principles
Less but better
Not every chart is good for your message
Scatter plot
Line chart
Bar plot
Other advanced charts.
Colors have to be chosen carefully
A bit of theory - chromatic circle, hue, and luminosity
Visualizing your data with ggplot
One more gentle introduction - the grammar of graphics
A layered grammar of graphics - ggplot2
Visualizing your banking movements with ggplot2
Visualizing the number of movements per day of the week
Further references
Summary
Chapter 3: The Data Mining Process - CRISP-DM Methodology
The Crisp-DM methodology data mining cycle
Business understanding
Data understanding
Data collection
How to perform data collection with R
Data import from TXT and CSV files
Data import from different types of format already structured as tables
Data import from unstructured sources
Data description
How to perform data description with R
Data exploration
What to use in R to perform this task
The summary() function
Box plot
Histograms
Data preparation
Modelling
Defining a data modeling strategy
How similar problems were solved in the past
Emerging techniques
Classification of modeling problems
How to perform data modeling with R
Evaluation
Clustering evaluation
Classification evaluation
Regression evaluation
How to judge the adequacy of a model's performance
What to use in R to perform this task
Deployment
Deployment plan development
Maintenance plan development
Summary
Chapter 4: Keeping the House Clean - The Data Mining Architecture
A general overview
Data sources
Types of data sources
Unstructured data sources
Structured data sources
Key issues of data sources
Databases and data warehouses
The third wheel - the data mart
One-level database
Two-level database
Three-level database
Technologies
SQL
MongoDB
Hadoop
The data mining engine
The interpreter.
The interface between the engine and the data warehouse
The data mining algorithms
User interface
Clarity
Clarity and mystery
Clarity and simplicity
Efficiency
Consistency
Syntax highlight
Auto-completion
How to build a data mining architecture in R
Data sources
The data warehouse
The data mining engine
The interface between the engine and the data warehouse
The data mining algorithms
The user interface
Further references
Summary
Chapter 5: How to Address a Data Mining Problem - Data Cleaning and Validation
On a quiet day
Data cleaning
Tidy data
Analysing the structure of our data
The str function
The describe function
head, tail, and View functions
Evaluating your data tidiness
Every row is a record
Every column shows an attribute
Every table represents an observational unit
Tidying our data
The tidyr package
Long versus wide data
The spread function
The gather function
The separate function
Applying tidyr to our dataset
Validating our data
Fitness for use
Conformance to standards
Data quality controls
Consistency checks
Data type checks
Logical checks
Domain checks
Uniqueness checks
Performing data validation on our data
Data type checks with str()
Domain checks
The final touch - data merging
left_join function
moving beyond left_join
Further references
Summary
Chapter 6: Looking into Your Data Eyes - Exploratory Data Analysis
Introducing summary EDA
Describing the population distribution
Quartiles and Median
Mean
The mean and phenomenon going on within sub populations
The mean being biased by outlier values
Computing the mean of our population
Variance
Standard deviation
Skewness
Measuring the relationship between variables
Correlation.
The Pearson correlation coefficient
Distance correlation
Weaknesses of summary EDA - the Anscombe quartet
Graphical EDA
Visualizing a variable distribution
Histogram
Reporting date histogram
Geographical area histogram
Cash flow histogram
Boxplot
Checking for outliers
Visualizing relationships between variables
Scatterplots
Adding title, subtitle, and caption to the plot
Setting axis and legend
Adding explicative text to the plot
Final touches on colors
Further references
Summary
Chapter 7: Our First Guess - a Linear Regression
Defining a data modelling strategy
Data modelling notions
Supervised learning
Unsupervised learning
The modeling strategy
Applying linear regression to our data
The intuition behind linear regression
The math behind the linear regression
Ordinary least squares technique
Model requirements - what to look for before applying the model
Residuals' uncorrelation
Residuals' homoscedasticity
How to apply linear regression in R
Fitting the linear regression model
Validating model assumption
Visualizing fitted values
Preparing the data for visualization
Developing the data visualization
Further references
Summary
Chapter 8: A Gentle Introduction to Model Performance Evaluation
Defining model performance
Fitting versus interpretability
Making predictions with models
Measuring performance in regression models
Mean squared error
R-squared
R-squared meaning and interpretation
R-squared computation in R
Adjusted R-squared
R-squared misconceptions
The R-squared doesn't measure the goodness of fit
A low R-squared doesn't mean your model is not statistically significant
Measuring the performance in classification problems
The confusion matrix
Confusion matrix in R
Accuracy.
How to compute accuracy in R
Sensitivity
How to compute sensitivity in R
Specificity
How to compute specificity in R
How to choose the right performance statistics
A final general warning - training versus test datasets
Further references
Summary
Chapter 9: Don't Give up - Power up Your Regression Including Multiple Variables
Moving from simple to multiple linear regression
Notation
Assumptions
Variables' collinearity
Tolerance
Variance inflation factors
Addressing collinearity
Dimensionality reduction
Stepwise regression
Backward stepwise regression
From the full model to the n-1 model
Forward stepwise regression
Double direction stepwise regression
Principal component regression
Fitting a multiple linear model with R
Model fitting
Variable assumptions validation
Residual assumptions validation
Dimensionality reduction
Principal component regression
Stepwise regression
Linear model cheat sheet
Further references
Summary
Chapter 10: A Different Outlook to Problems with Classification Models
What is classification and why do we need it?
Linear regression limitations for categorical variables
Common classification algorithms and models
Logistic regression
The intuition behind logistic regression
The logistic function estimates a response variable enclosed within an upper and lower bound
The logistic function estimates the probability of an observation pertaining to one of the two available categories
The math behind logistic regression
Maximum likelihood estimator
Model assumptions
Absence of multicollinearity between variables
Linear relationship between explanatory variables and log odds
Large enough sample size
How to apply logistic regression in R
Fitting the model
Reading the glm() estimation output.
The level of statistical significance of the association between the explanatory variable and the response variable.

R data mining implement data mining techniques through practical use cases and real-world datasets

Ejemplares similares