R data mining implement data mining techniques through practical use cases and real-world datasets
Mine valuable insights from your data using popular tools and techniques in R About This Book Understand the basics of data mining and why R is a perfect tool for it. Manipulate your data using popular R packages such as ggplot2, dplyr, and so on to gather valuable business insights from it. Apply e...
Otros Autores: | |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham, England ; Mumbai, [India] :
Packt Publishing
2017.
|
Edición: | 1st edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630080406719 |
Tabla de Contenidos:
- Cover
- Copyright
- Credits
- About the Author
- About the Reviewers
- www.PacktPub.com
- Customer Feedback
- Table of Contents
- Preface
- Chapter 1: Why to Choose R for Your Data Mining and Where to Start
- What is R?
- A bit of history
- R's points of strength
- Open source inside
- Plugin ready
- Data visualization friendly
- Installing R and writing R code
- Downloading R
- R installation for Windows and macOS
- R installation for Linux OS
- Main components of a base R installation
- Possible alternatives to write and run R code
- RStudio (all OSs)
- The Jupyter Notebook (all OSs)
- Visual Studio (Windows users only)
- R foundational notions
- A preliminary R session
- Executing R interactively through the R console
- Creating an R script
- Executing an R script
- Vectors
- Lists
- Creating lists
- Subsetting lists
- Data frames
- Functions
- R's weaknesses and how to overcome them
- Learning R effectively and minimizing the effort
- The tidyverse
- Leveraging the R community to learn R
- Where to find the R community
- Engaging with the community to learn R
- Handling large datasets with R
- Further references
- Summary
- Chapter 2: A First Primer on Data Mining Analysing Your Bank Account Data
- Acquiring and preparing your banking data
- Data model
- Summarizing your data with pivot-like tables
- A gentle introduction to the pipe operator
- An even more gentle introduction to the dplyr package
- Installing the necessary packages and loading your data into R
- Installing and loading the necessary packages
- Importing your data into R
- Defining the monthly and daily sum of expenses
- Visualizing your data with ggplot2
- Basic data visualization principles
- Less but better
- Not every chart is good for your message
- Scatter plot
- Line chart
- Bar plot
- Other advanced charts.
- Colors have to be chosen carefully
- A bit of theory - chromatic circle, hue, and luminosity
- Visualizing your data with ggplot
- One more gentle introduction - the grammar of graphics
- A layered grammar of graphics - ggplot2
- Visualizing your banking movements with ggplot2
- Visualizing the number of movements per day of the week
- Further references
- Summary
- Chapter 3: The Data Mining Process - CRISP-DM Methodology
- The Crisp-DM methodology data mining cycle
- Business understanding
- Data understanding
- Data collection
- How to perform data collection with R
- Data import from TXT and CSV files
- Data import from different types of format already structured as tables
- Data import from unstructured sources
- Data description
- How to perform data description with R
- Data exploration
- What to use in R to perform this task
- The summary() function
- Box plot
- Histograms
- Data preparation
- Modelling
- Defining a data modeling strategy
- How similar problems were solved in the past
- Emerging techniques
- Classification of modeling problems
- How to perform data modeling with R
- Evaluation
- Clustering evaluation
- Classification evaluation
- Regression evaluation
- How to judge the adequacy of a model's performance
- What to use in R to perform this task
- Deployment
- Deployment plan development
- Maintenance plan development
- Summary
- Chapter 4: Keeping the House Clean - The Data Mining Architecture
- A general overview
- Data sources
- Types of data sources
- Unstructured data sources
- Structured data sources
- Key issues of data sources
- Databases and data warehouses
- The third wheel - the data mart
- One-level database
- Two-level database
- Three-level database
- Technologies
- SQL
- MongoDB
- Hadoop
- The data mining engine
- The interpreter.
- The interface between the engine and the data warehouse
- The data mining algorithms
- User interface
- Clarity
- Clarity and mystery
- Clarity and simplicity
- Efficiency
- Consistency
- Syntax highlight
- Auto-completion
- How to build a data mining architecture in R
- Data sources
- The data warehouse
- The data mining engine
- The interface between the engine and the data warehouse
- The data mining algorithms
- The user interface
- Further references
- Summary
- Chapter 5: How to Address a Data Mining Problem - Data Cleaning and Validation
- On a quiet day
- Data cleaning
- Tidy data
- Analysing the structure of our data
- The str function
- The describe function
- head, tail, and View functions
- Evaluating your data tidiness
- Every row is a record
- Every column shows an attribute
- Every table represents an observational unit
- Tidying our data
- The tidyr package
- Long versus wide data
- The spread function
- The gather function
- The separate function
- Applying tidyr to our dataset
- Validating our data
- Fitness for use
- Conformance to standards
- Data quality controls
- Consistency checks
- Data type checks
- Logical checks
- Domain checks
- Uniqueness checks
- Performing data validation on our data
- Data type checks with str()
- Domain checks
- The final touch - data merging
- left_join function
- moving beyond left_join
- Further references
- Summary
- Chapter 6: Looking into Your Data Eyes - Exploratory Data Analysis
- Introducing summary EDA
- Describing the population distribution
- Quartiles and Median
- Mean
- The mean and phenomenon going on within sub populations
- The mean being biased by outlier values
- Computing the mean of our population
- Variance
- Standard deviation
- Skewness
- Measuring the relationship between variables
- Correlation.
- The Pearson correlation coefficient
- Distance correlation
- Weaknesses of summary EDA - the Anscombe quartet
- Graphical EDA
- Visualizing a variable distribution
- Histogram
- Reporting date histogram
- Geographical area histogram
- Cash flow histogram
- Boxplot
- Checking for outliers
- Visualizing relationships between variables
- Scatterplots
- Adding title, subtitle, and caption to the plot
- Setting axis and legend
- Adding explicative text to the plot
- Final touches on colors
- Further references
- Summary
- Chapter 7: Our First Guess - a Linear Regression
- Defining a data modelling strategy
- Data modelling notions
- Supervised learning
- Unsupervised learning
- The modeling strategy
- Applying linear regression to our data
- The intuition behind linear regression
- The math behind the linear regression
- Ordinary least squares technique
- Model requirements - what to look for before applying the model
- Residuals' uncorrelation
- Residuals' homoscedasticity
- How to apply linear regression in R
- Fitting the linear regression model
- Validating model assumption
- Visualizing fitted values
- Preparing the data for visualization
- Developing the data visualization
- Further references
- Summary
- Chapter 8: A Gentle Introduction to Model Performance Evaluation
- Defining model performance
- Fitting versus interpretability
- Making predictions with models
- Measuring performance in regression models
- Mean squared error
- R-squared
- R-squared meaning and interpretation
- R-squared computation in R
- Adjusted R-squared
- R-squared misconceptions
- The R-squared doesn't measure the goodness of fit
- A low R-squared doesn't mean your model is not statistically significant
- Measuring the performance in classification problems
- The confusion matrix
- Confusion matrix in R
- Accuracy.
- How to compute accuracy in R
- Sensitivity
- How to compute sensitivity in R
- Specificity
- How to compute specificity in R
- How to choose the right performance statistics
- A final general warning - training versus test datasets
- Further references
- Summary
- Chapter 9: Don't Give up - Power up Your Regression Including Multiple Variables
- Moving from simple to multiple linear regression
- Notation
- Assumptions
- Variables' collinearity
- Tolerance
- Variance inflation factors
- Addressing collinearity
- Dimensionality reduction
- Stepwise regression
- Backward stepwise regression
- From the full model to the n-1 model
- Forward stepwise regression
- Double direction stepwise regression
- Principal component regression
- Fitting a multiple linear model with R
- Model fitting
- Variable assumptions validation
- Residual assumptions validation
- Dimensionality reduction
- Principal component regression
- Stepwise regression
- Linear model cheat sheet
- Further references
- Summary
- Chapter 10: A Different Outlook to Problems with Classification Models
- What is classification and why do we need it?
- Linear regression limitations for categorical variables
- Common classification algorithms and models
- Logistic regression
- The intuition behind logistic regression
- The logistic function estimates a response variable enclosed within an upper and lower bound
- The logistic function estimates the probability of an observation pertaining to one of the two available categories
- The math behind logistic regression
- Maximum likelihood estimator
- Model assumptions
- Absence of multicollinearity between variables
- Linear relationship between explanatory variables and log odds
- Large enough sample size
- How to apply logistic regression in R
- Fitting the model
- Reading the glm() estimation output.
- The level of statistical significance of the association between the explanatory variable and the response variable.