Data Mining and Predictive Analysis

(SNHU-DM-PA.AE1)
Lessons
Lab
Get A Free Trial

Skills You’ll Get

1

Preface

  • What is Data Mining? What is Predictive Analytics?
  • Why is this Course Needed?
  • Who Will Benefit from this Course?
  • Danger! Data Mining is Easy to do Badly
  • “White-Box” Approach
  • Algorithm Walk-Throughs
  • Exciting New Topics
  • The R Zone
  • Appendix: Data Summarization and Visualization
  • The Case Study: Bringing it all Together
  • How the Course is Structured
2

An Introduction to Data Mining and Predictive Analytics

  • What is Data Mining? What Is Predictive Analytics?
  • Wanted: Data Miners
  • The Need For Human Direction of Data Mining
  • The Cross-Industry Standard Process for Data Mining: CRISP-DM
  • Fallacies of Data Mining
  • What Tasks can Data Mining Accomplish
  • The R Zone
  • R References
  • Exercises
3

Data Preprocessing

  • Why do We Need to Preprocess the Data?
  • Data Cleaning
  • Handling Missing Data
  • Identifying Misclassifications
  • Graphical Methods for Identifying Outliers
  • Measures of Center and Spread
  • Data Transformation
  • Min–Max Normalization
  • Z-Score Standardization
  • Decimal Scaling
  • Transformations to Achieve Normality
  • Numerical Methods for Identifying Outliers
  • Flag Variables
  • Transforming Categorical Variables into Numerical Variables
  • Binning Numerical Variables
  • Reclassifying Categorical Variables
  • Adding an Index Field
  • Removing Variables that are not Useful
  • Variables that Should Probably not be Removed
  • Removal of Duplicate Records
  • A Word About ID Fields
  • The R Zone
  • R Reference
  • Exercises
4

Exploratory Data Analysis

  • Hypothesis Testing Versus Exploratory Data Analysis
  • Getting to Know The Data Set
  • Exploring Categorical Variables
  • Exploring Numeric Variables
  • Exploring Multivariate Relationships
  • Selecting Interesting Subsets of the Data for Further Investigation
  • Using EDA to Uncover Anomalous Fields
  • Binning Based on Predictive Value
  • Deriving New Variables: Flag Variables
  • Deriving New Variables: Numerical Variables
  • Using EDA to Investigate Correlated Predictor Variables
  • Summary of Our EDA
  • The R Zone
  • R References
  • Exercises
5

Dimension-Reduction Methods

  • Need for Dimension-Reduction in Data Mining
  • Principal Components Analysis
  • Applying PCA to the Houses Data Set
  • How Many Components Should We Extract?
  • Profiling the Principal Components
  • Communalities
  • Validation of the Principal Components
  • Factor Analysis
  • Applying Factor Analysis to the Adult Data Set
  • Factor Rotation
  • User-Defined Composites
  • An Example of a User-Defined Composite
  • The R Zone
  • R References
  • Exercises
6

Univariate Statistical Analysis

  • Data Mining Tasks in Discovering Knowledge in Data
  • Statistical Approaches to Estimation and Prediction
  • Statistical Inference
  • How Confident are We in Our Estimates?
  • Confidence Interval Estimation of the Mean
  • How to Reduce the Margin of Error
  • Confidence Interval Estimation of the Proportion
  • Hypothesis Testing for the Mean
  • Assessing The Strength of Evidence Against The Null Hypothesis
  • Using Confidence Intervals to Perform Hypothesis Tests
  • Hypothesis Testing for The Proportion
  • Reference
  • The R Zone
  • R Reference
  • Exercises
7

Multivariate Statistics

  • Two-Sample t-Test for Difference in Means
  • Two-Sample Z-Test for Difference in Proportions
  • Test for the Homogeneity of Proportions
  • Chi-Square Test for Goodness of Fit of Multinomial Data
  • Analysis of Variance
  • Reference
  • The R Zone
  • R Reference
  • Exercises
8

Preparing to Model the Data

  • Supervised Versus Unsupervised Methods
  • Statistical Methodology and Data Mining Methodology
  • Cross-Validation
  • Overfitting
  • Bias–Variance Trade-Off
  • Balancing The Training Data Set
  • Establishing Baseline Performance
  • The R Zone
  • R Reference
  • Exercises
9

Simple Linear Regression

  • An Example of Simple Linear Regression
  • Dangers of Extrapolation
  • How Useful is the Regression? The Coefficient of Determination, r2
  • Standard Error of the Estimate, s
  • Correlation Coefficient r
  • Anova Table for Simple Linear Regression
  • Outliers, High Leverage Points, and Influential Observations
  • Population Regression Equation
  • Verifying The Regression Assumptions
  • Inference in Regression
  • t-Test for the Relationship Between x and y
  • Confidence Interval for the Slope of the Regression Line
  • Confidence Interval for the Correlation Coefficient ρ
  • Confidence Interval for the Mean Value of y Given x
  • Prediction Interval for a Randomly Chosen Value of y Given x
  • Transformations to Achieve Linearity
  • Box–Cox Transformations
  • The R Zone
  • R References
  • Exercises
10

Logistic Regression

  • Simple Example of Logistic Regression
  • Maximum Likelihood Estimation
  • Interpreting Logistic Regression Output
  • Inference: Are the Predictors Significant?
  • Odds Ratio and Relative Risk
  • Interpreting Logistic Regression for a Dichotomous Predictor
  • Interpreting Logistic Regression for a Polychotomous Predictor
  • Interpreting Logistic Regression for a Continuous Predictor
  • Assumption of Linearity
  • Zero-Cell Problem
  • Multiple Logistic Regression
  • Introducing Higher Order Terms to Handle Nonlinearity
  • Validating the Logistic Regression Model
  • WEKA: Hands-On Analysis Using Logistic Regression
  • The R Zone
  • R References
  • Exercises
11

Multiple Regression and Model Building

  • An Example of Multiple Regression
  • The Population Multiple Regression Equation
  • Inference in Multiple Regression
  • Regression With Categorical Predictors, Using Indicator Variables
  • Adjusting R2: Penalizing Models For Including Predictors That Are Not Useful
  • Sequential Sums of Squares
  • Multicollinearity
  • Variable Selection Methods
  • Gas Mileage Data Set
  • An Application of Variable Selection Methods
  • Using the Principal Components as Predictors in Multiple Regression
  • The R Zone
  • R References
  • Exercises
12

k-Nearest Neighbor Algorithm

  • Classification Task
  • k-Nearest Neighbor Algorithm
  • Distance Function
  • Combination Function
  • Quantifying Attribute Relevance: Stretching the Axes
  • Database Considerations
  • k-Nearest Neighbor Algorithm for Estimation and Prediction
  • Choosing k
  • Application of k-Nearest Neighbor Algorithm Using IBM/SPSS Modeler
  • The R Zone
  • R References
  • Exercises
13

Decision Trees

  • What is a Decision Tree?
  • Requirements for Using Decision Trees
  • Classification and Regression Trees
  • C4.5 Algorithm
  • Decision Rules
  • Comparison of the C5.0 and CART Algorithms Applied to Real Data
  • The R Zone
  • R References
  • Exercises
14

Neural Networks

  • Input and Output Encoding
  • Neural Networks for Estimation and Prediction
  • Simple Example of a Neural Network
  • Sigmoid Activation Function
  • Back-Propagation
  • Gradient-Descent Method
  • Back-Propagation Rules
  • Example of Back-Propagation
  • Termination Criteria
  • Learning Rate
  • Momentum Term
  • Sensitivity Analysis
  • Application of Neural Network Modeling
  • The R Zone
  • R References
  • Exercises
15

NaïVe Bayes and Bayesian Networks

  • Bayesian Approach
  • Maximum A Posteriori (MAP) Classification
  • Posterior Odds Ratio
  • Balancing The Data
  • Naïve Bayes Classification
  • Interpreting The Log Posterior Odds Ratio
  • Zero-Cell Problem
  • Numeric Predictors for Naïve Bayes Classification
  • WEKA: Hands-on Analysis Using Naïve Bayes
  • Bayesian Belief Networks
  • Clothing Purchase Example
  • Using The Bayesian Network to Find Probabilities
  • The R Zone
  • R References
  • Exercises
16

Model Evaluation Techniques

  • Model Evaluation Techniques for the Description Task
  • Model Evaluation Techniques for the Estimation and Prediction Tasks
  • Model Evaluation Measures for the Classification Task
  • Accuracy and Overall Error Rate
  • Sensitivity and Specificity
  • False-Positive Rate and False-Negative Rate
  • Proportions of True Positives, True Negatives, False Positives, and False Negatives
  • Misclassification Cost Adjustment to Reflect Real-World Concerns
  • Decision Cost/Benefit Analysis
  • Lift Charts and Gains Charts
  • Interweaving Model Evaluation with Model Building
  • Confluence of Results: Applying a Suite of Models
  • The R Zone
  • R References
  • Exercises
  • Hands-On Analysis
17

Cost-Benefit Analysis Using Data-Driven Costs

  • Decision Invariance Under Row Adjustment
  • Positive Classification Criterion
  • Demonstration Of The Positive Classification Criterion
  • Constructing The Cost Matrix
  • Decision Invariance Under Scaling
  • Direct Costs and Opportunity Costs
  • Case Study: Cost-Benefit Analysis Using Data-Driven Misclassification Costs
  • Rebalancing as a Surrogate for Misclassification Costs
  • The R Zone
  • R References
  • Exercises
18

Cost-Benefit Analysis for Trinary and -Nary Classification Models

  • Classification Evaluation Measures for a Generic Trinary Target
  • Application of Evaluation Measures for Trinary Classification to the Loan Approval Problem
  • Data-Driven Cost-Benefit Analysis for Trinary Loan Classification Problem
  • Comparing Cart Models With and Without Data-Driven Misclassification Costs
  • Classification Evaluation Measures for a Generic k-Nary Target
  • Example of Evaluation Measures and Data-Driven Misclassification Costs for k-Nary Classification
  • The R Zone
  • R References
  • Exercises
19

Graphical Evaluation of Classification Models

  • Review of Lift Charts and Gains Charts
  • Lift Charts and Gains Charts Using Misclassification Costs
  • Response Charts
  • Profits Charts
  • Return on Investment (ROI) Charts
  • The R Zone
  • R References
  • Exercises
  • Hands-On Exercises
20

Hierarchical and k-Means Clustering

  • The Clustering Task
  • Hierarchical Clustering Methods
  • Single-Linkage Clustering
  • Complete-Linkage Clustering
  • k-Means Clustering
  • Example of k-Means Clustering at Work
  • Behavior of MSB, MSE, and Pseudo-F as the k-Means Algorithm Proceeds
  • Application of k-Means Clustering Using SAS Enterprise Miner
  • Using Cluster Membership to Predict Churn
  • The R Zone
  • R References
  • Exercises
  • Hands-On Analysis
21

Kohonen Networks

  • Self-Organizing Maps
  • Kohonen Networks
  • Example of a Kohonen Network Study
  • Cluster Validity
  • Application of Clustering Using Kohonen Networks
  • Interpreting The Clusters
  • Using Cluster Membership as Input to Downstream Data Mining Models
  • The R Zone
  • R References
  • Exercises
22

BIRCH Clustering

  • Rationale for BIRCH Clustering
  • Cluster Features
  • Cluster Feature TREE
  • Phase 1: Building The CF Tree
  • Phase 2: Clustering The Sub-Clusters
  • Example of Birch Clustering, Phase 1: Building The CF Tree
  • Example of BIRCH Clustering, Phase 2: Clustering The Sub-Clusters
  • Evaluating The Candidate Cluster Solutions
  • Case Study: Applying BIRCH Clustering to The Bank Loans Data Set
  • The R Zone
  • R References
  • Exercises
23

Measuring Cluster Goodness

  • Rationale for Measuring Cluster Goodness
  • The Silhouette Method
  • Silhouette Example
  • Silhouette Analysis of the IRIS Data Set
  • The Pseudo-F Statistic
  • Example of the Pseudo-F Statistic
  • Pseudo-F Statistic Applied to the IRIS Data Set
  • Cluster Validation
  • Cluster Validation Applied to the Loans Data Set
  • The R Zone
  • R References
  • Exercises
24

Association Rules

  • Affinity Analysis and Market Basket Analysis
  • Support, Confidence, Frequent Itemsets, and the A Priori Property
  • How Does The A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets
  • How Does The A Priori Algorithm Work (Part 2)? Generating Association Rules
  • Extension From Flag Data to General Categorical Data
  • Information-Theoretic Approach: Generalized Rule Induction Method
  • Association Rules are Easy to do Badly
  • How Can We Measure the Usefulness of Association Rules?
  • Do Association Rules Represent Supervised or Unsupervised Learning?
  • Local Patterns Versus Global Models
  • The R Zone
  • R References
  • Exercises
25

Segmentation Models

  • The Segmentation Modeling Process
  • Segmentation Modeling Using EDA to Identify the Segments
  • Segmentation Modeling using Clustering to Identify the Segments
  • The R Zone
  • R References
  • Exercises
26

Ensemble Methods: Bagging and Boosting

  • Rationale for Using an Ensemble of Classification Models
  • Bias, Variance, and Noise
  • When to Apply, and not to apply, Bagging
  • Bagging
  • Boosting
  • Application of Bagging and Boosting Using IBM/SPSS Modeler
  • References
  • The R Zone
  • R Reference
  • Exercises
27

Model Voting and Propensity Averaging

  • Simple Model Voting
  • Alternative Voting Methods
  • Model Voting Process
  • An Application of Model Voting
  • What is Propensity Averaging?
  • Propensity Averaging Process
  • An Application of Propensity Averaging
  • The R Zone
  • R References
  • Exercises
  • Hands-On Analysis
28

Genetic Algorithms

  • Introduction To Genetic Algorithms
  • Basic Framework of a Genetic Algorithm
  • Simple Example of a Genetic Algorithm at Work
  • Modifications and Enhancements: Selection
  • Modifications and Enhancements: Crossover
  • Genetic Algorithms for Real-Valued Variables
  • Using Genetic Algorithms to Train a Neural Network
  • WEKA: Hands-On Analysis Using Genetic Algorithms
  • The R Zone
  • R References
29

Imputation of Missing Data

  • Need for Imputation of Missing Data
  • Imputation of Missing Data: Continuous Variables
  • Standard Error of the Imputation
  • Imputation of Missing Data: Categorical Variables
  • Handling Patterns in Missingness
  • Reference
  • The R Zone
  • R References
30

Case Study, Part 1: Business Understanding, Data Preparation, and EDA

  • Cross-Industry Standard Practice for Data Mining
  • Business Understanding Phase
  • Data Understanding Phase, Part 1: Getting a Feel for the Data Set
  • Data Preparation Phase
  • Data Understanding Phase, Part 2: Exploratory Data Analysis
31

Case Study, Part 2: Clustering and Principal Components Analysis

  • Partitioning the Data
  • Developing the Principal Components
  • Validating the Principal Components
  • Profiling the Principal Components
  • Choosing the Optimal Number of Clusters Using Birch Clustering
  • Choosing the Optimal Number of Clusters Using k-Means Clustering
  • Application of k-Means Clustering
  • Validating the Clusters
  • Profiling the Clusters
32

Case Study, Part 3: Modeling And Evaluation For Performance And Interpretability

  • Do You Prefer The Best Model Performance, Or A Combination Of Performance And Interpretability?
  • Modeling And Evaluation Overview
  • Cost-Benefit Analysis Using Data-Driven Costs
  • Variables to be Input To The Models
  • Establishing The Baseline Model Performance
  • Models That Use Misclassification Costs
  • Models That Need Rebalancing as a Surrogate for Misclassification Costs
  • Combining Models Using Voting and Propensity Averaging
  • Interpreting The Most Profitable Model
33

Case Study, Part 4: Modeling and Evaluation for High Performance Only

  • Variables to be Input to the Models
  • Models that use Misclassification Costs
  • Models that Need Rebalancing as a Surrogate for Misclassification Costs
  • Combining Models using Voting and Propensity Averaging
  • Lessons Learned
  • Conclusions
A

Appendix A

  • Data Summarization and Visualization
  • Part 1: Summarization 1: Building Blocks Of Data Analysis
  • Part 2: Visualization: Graphs and Tables For Summarizing And Organizing Data
  • Part 3: Summarization 2: Measures Of Center, Variability, and Position
  • Part 4: Summarization And Visualization Of Bivariate Relationships

1

An Introduction to Data Mining and Predictive Analytics

  • RStudio and Power BI Workstation
2

Data Preprocessing

  • Plotting Graphs By Performing Expectation-Maximization
  • Plotting the Density Values
  • Using eclat to Find Similarities in Adult Behavior
  • Finding Frequent Items in a Dataset
  • Determining and Visualizing Sequences
  • Computing LCP, LCS, and OMD
  • Plotting Points on a Map
  • Displaying a Histogram of Scatter Plots
  • Creating an Enhanced Scatter Plot
  • Finding a Dataset
  • Making a Prediction
3

Exploratory Data Analysis

  • Computing the Outliers for a Set
  • Calculating Anomalies
  • Constructing a Bar Plot
  • Producing a Word Cloud
4

Multivariate Statistics

  • Performing Multivariate Regression Analysis
  • Grouping and Organizing Bivariate Data
5

Simple Linear Regression

  • Performing Simple Regression
6

Logistic Regression

  • Using Holt Exponential Smoothing
7

Multiple Regression and Model Building

  • Performing Multiple Regression
  • Performing Tetrachoric Correlation
  • Generating a 3D Graphic
  • Producing a 3D Scatterplot
8

Decision Trees

  • Developing a Decision Tree
  • Performing Cluster Analysis
  • Constructing a Multitude of Decision Trees
9

Hierarchical and k-Means Clustering

  • Displaying the Hierarchical Cluster
  • Plotting a Graph by Performing k-means Clustering
  • Calculating K-medoids Clustering
  • Estimating the Number of Clusters Using Medoids
  • Performing Affinity Propagation Clustering
10

Association Rules

  • Using the apriori Rules Library
  • Evaluating Associations in a Shopping Basket
  • Producing a Regression Model
  • Understanding Instance-Based Learning
11

Case Study, Part 4: Modeling and Evaluation for High Performance Only

  • Importing the dataset
  • Displaying the number of rows and columns
  • Displaying the column names
  • Displaying mean, count, min, and max
  • Displaying the variable type
  • Displaying unique values
  • Displaying the unique values and the number of times each value appears
  • Importing the dataset
  • Displaying the number of values missing
  • Identifying standard deviation
  • Displaying duplicate rows
  • Drop all duplicates
  • Defining an outlier
  • Defining an outlier
  • Dropping a column
  • Fixing column values
  • Imputing the missing values
  • Creating a separate dataset
  • Importing the dataset
  • Duplicate values
  • Missing observations
  • Displaying the mean value
  • Importing a dataset
  • Using summary() method
  • Using str() method
  • Changing the data type
  • Ensuring the column is in date format
  • Importing the dataset
  • Displaying the values
  • Displaying the length
  • Ensuring all the values format
  • Creating a new column
  • Importing the dataset
  • Plotting a boxplot
  • Importing the dataset
  • Identifying missing values
  • Removing rows
  • Imputing the rows
  • Importing the dataset
  • Isolating values
  • Importing the dataset
  • Determining the values
  • Changing the areatype
  • Importing the dataset
  • Creating histograms
  • Plotting a bar chart
  • Plotting a bar chart
  • Displaying the maximum value
  • Plotting a boxplot
  • Creating a subset
  • Normalizing data
  • Creating the PCA
  • Creating a scree plot
  • Selecting fewest components
  • Creating the rotation
  • Creating components
  • Developing the Estimated Simple Linear Regression Equation
  • Developing the Estimated Regression Equation
  • Developing the Estimated Multiple Linear Regression Equation
  • Using the Estimated Regression
  • Developing the Simple Linear Regression Equation
  • Testing the Hypotheses of No Relationship Between Repair Time and the Number of Months
  • Constructing a Scatterplot of Months
  • Calculating the Predicted Repair Time and Residual
  • Constructing a Scatterplot of Months
  • Developing the Multiple Regression Equation
  • Testing the Hypotheses of No Relationship Between Repair Time and the Independent Variables
  • Testing the Hypotheses of No Relationship Between Repair Time and the Independent Variables
  • Creating a New Dummy Variable
  • Testing the Hypotheses of No Relationship Between Repair Time and the Independent Variables
  • Testing the Hypotheses of No Relationship Between Repair Time and the Independent Variables
  • Developing the Multiple Regression Equation
  • Testing the Hypotheses of No Relationship Between Repair Time and the Independent Variables
  • Testing the Hypotheses of No Relationship Between Repair Time and the Independent Variables
  • Testing the Hypotheses of No Relationship Between Repair Time and the Independent Variables
  • Evaluating the Candidate Logistic Regression Models
  • Creating a ROC Curve
  • Evaluate a Full Model with all Predictors
  • Constructing the ROC Curve and Computing the AUC
  • Using Descriptive Statistics and Charts
  • Using Descriptive Statistics and Charts to Enroll in the Seminars
  • Evaluating the Model on the Oscars.xlsx Data
  • Performing the Logistic Regression Model on the Reduced Model
  • Refitting the Logistic Regression Model to the Oscars.xlsx Data
  • Using a Default Cut-off Value
  • Using a Cutoff Value to Classify a Movie as a Winner or Not
  • Using the Model to Predict the Annual Winner
  • Removing the Least Significant Independent Variable and Rerunning the Model
scroll to top