Data Mining and Predictive Analysis

(SNHU-DM-PA.AE1)

Lessons

Lab

Get A Free Trial

This course includes:

Free pre-assessment and first 2 lessons

34+ Interactive Lessons

Accessible on mobile and tablet too

Certificate of completion

Are you an instructor?

Access detailed information about the course content, learning objectives, activities, and assessments before adding it to your curriculum.

Skills You’ll Get

Interactive Lessons

34+ Interactive Lessons |

Hands-On Labs

124+ LiveLab | 00+ Minutes

Download Course Outline

Preface

What is Data Mining? What is Predictive Analytics?
Why is this Course Needed?
Who Will Benefit from this Course?
Danger! Data Mining is Easy to do Badly
“White-Box” Approach
Algorithm Walk-Throughs
Exciting New Topics
The R Zone
Appendix: Data Summarization and Visualization
The Case Study: Bringing it all Together
How the Course is Structured

An Introduction to Data Mining and Predictive Analytics

What is Data Mining? What Is Predictive Analytics?
Wanted: Data Miners
The Need For Human Direction of Data Mining
The Cross-Industry Standard Process for Data Mining: CRISP-DM
Fallacies of Data Mining
What Tasks can Data Mining Accomplish
The R Zone
R References
Exercises

Data Preprocessing

Why do We Need to Preprocess the Data?
Data Cleaning
Handling Missing Data
Identifying Misclassifications
Graphical Methods for Identifying Outliers
Measures of Center and Spread
Data Transformation
Min–Max Normalization
Z-Score Standardization
Decimal Scaling
Transformations to Achieve Normality
Numerical Methods for Identifying Outliers
Flag Variables
Transforming Categorical Variables into Numerical Variables
Binning Numerical Variables
Reclassifying Categorical Variables
Adding an Index Field
Removing Variables that are not Useful
Variables that Should Probably not be Removed
Removal of Duplicate Records
A Word About ID Fields
The R Zone
R Reference
Exercises

Exploratory Data Analysis

Hypothesis Testing Versus Exploratory Data Analysis
Getting to Know The Data Set
Exploring Categorical Variables
Exploring Numeric Variables
Exploring Multivariate Relationships
Selecting Interesting Subsets of the Data for Further Investigation
Using EDA to Uncover Anomalous Fields
Binning Based on Predictive Value
Deriving New Variables: Flag Variables
Deriving New Variables: Numerical Variables
Using EDA to Investigate Correlated Predictor Variables
Summary of Our EDA
The R Zone
R References
Exercises

Dimension-Reduction Methods

Need for Dimension-Reduction in Data Mining
Principal Components Analysis
Applying PCA to the Houses Data Set
How Many Components Should We Extract?
Profiling the Principal Components
Communalities
Validation of the Principal Components
Factor Analysis
Applying Factor Analysis to the Adult Data Set
Factor Rotation
User-Defined Composites
An Example of a User-Defined Composite
The R Zone
R References
Exercises

Univariate Statistical Analysis

Data Mining Tasks in Discovering Knowledge in Data
Statistical Approaches to Estimation and Prediction
Statistical Inference
How Confident are We in Our Estimates?
Confidence Interval Estimation of the Mean
How to Reduce the Margin of Error
Confidence Interval Estimation of the Proportion
Hypothesis Testing for the Mean
Assessing The Strength of Evidence Against The Null Hypothesis
Using Confidence Intervals to Perform Hypothesis Tests
Hypothesis Testing for The Proportion
Reference
The R Zone
R Reference
Exercises

Multivariate Statistics

Two-Sample t-Test for Difference in Means
Two-Sample Z-Test for Difference in Proportions
Test for the Homogeneity of Proportions
Chi-Square Test for Goodness of Fit of Multinomial Data
Analysis of Variance
Reference
The R Zone
R Reference
Exercises

Preparing to Model the Data

Supervised Versus Unsupervised Methods
Statistical Methodology and Data Mining Methodology
Cross-Validation
Overfitting
Bias–Variance Trade-Off
Balancing The Training Data Set
Establishing Baseline Performance
The R Zone
R Reference
Exercises

Simple Linear Regression

An Example of Simple Linear Regression
Dangers of Extrapolation
How Useful is the Regression? The Coefficient of Determination, r2
Standard Error of the Estimate, s
Correlation Coefficient r
Anova Table for Simple Linear Regression
Outliers, High Leverage Points, and Influential Observations
Population Regression Equation
Verifying The Regression Assumptions
Inference in Regression
t-Test for the Relationship Between x and y
Confidence Interval for the Slope of the Regression Line
Confidence Interval for the Correlation Coefficient ρ
Confidence Interval for the Mean Value of y Given x
Prediction Interval for a Randomly Chosen Value of y Given x
Transformations to Achieve Linearity
Box–Cox Transformations
The R Zone
R References
Exercises

Logistic Regression

Simple Example of Logistic Regression
Maximum Likelihood Estimation
Interpreting Logistic Regression Output
Inference: Are the Predictors Significant?
Odds Ratio and Relative Risk
Interpreting Logistic Regression for a Dichotomous Predictor
Interpreting Logistic Regression for a Polychotomous Predictor
Interpreting Logistic Regression for a Continuous Predictor
Assumption of Linearity
Zero-Cell Problem
Multiple Logistic Regression
Introducing Higher Order Terms to Handle Nonlinearity
Validating the Logistic Regression Model
WEKA: Hands-On Analysis Using Logistic Regression
The R Zone
R References
Exercises

Multiple Regression and Model Building

An Example of Multiple Regression
The Population Multiple Regression Equation
Inference in Multiple Regression
Regression With Categorical Predictors, Using Indicator Variables
Adjusting R2: Penalizing Models For Including Predictors That Are Not Useful
Sequential Sums of Squares
Multicollinearity
Variable Selection Methods
Gas Mileage Data Set
An Application of Variable Selection Methods
Using the Principal Components as Predictors in Multiple Regression
The R Zone
R References
Exercises

k-Nearest Neighbor Algorithm

Classification Task
k-Nearest Neighbor Algorithm
Distance Function
Combination Function
Quantifying Attribute Relevance: Stretching the Axes
Database Considerations
k-Nearest Neighbor Algorithm for Estimation and Prediction
Choosing k
Application of k-Nearest Neighbor Algorithm Using IBM/SPSS Modeler
The R Zone
R References
Exercises

Decision Trees

What is a Decision Tree?
Requirements for Using Decision Trees
Classification and Regression Trees
C4.5 Algorithm
Decision Rules
Comparison of the C5.0 and CART Algorithms Applied to Real Data
The R Zone
R References
Exercises

Neural Networks

Input and Output Encoding
Neural Networks for Estimation and Prediction
Simple Example of a Neural Network
Sigmoid Activation Function
Back-Propagation
Gradient-Descent Method
Back-Propagation Rules
Example of Back-Propagation
Termination Criteria
Learning Rate
Momentum Term
Sensitivity Analysis
Application of Neural Network Modeling
The R Zone
R References
Exercises

NaïVe Bayes and Bayesian Networks

Bayesian Approach
Maximum A Posteriori (MAP) Classification
Posterior Odds Ratio
Balancing The Data
Naïve Bayes Classification
Interpreting The Log Posterior Odds Ratio
Zero-Cell Problem
Numeric Predictors for Naïve Bayes Classification
WEKA: Hands-on Analysis Using Naïve Bayes
Bayesian Belief Networks
Clothing Purchase Example
Using The Bayesian Network to Find Probabilities
The R Zone
R References
Exercises

Model Evaluation Techniques

Model Evaluation Techniques for the Description Task
Model Evaluation Techniques for the Estimation and Prediction Tasks
Model Evaluation Measures for the Classification Task
Accuracy and Overall Error Rate
Sensitivity and Specificity
False-Positive Rate and False-Negative Rate
Proportions of True Positives, True Negatives, False Positives, and False Negatives
Misclassification Cost Adjustment to Reflect Real-World Concerns
Decision Cost/Benefit Analysis
Lift Charts and Gains Charts
Interweaving Model Evaluation with Model Building
Confluence of Results: Applying a Suite of Models
The R Zone
R References
Exercises
Hands-On Analysis

Cost-Benefit Analysis Using Data-Driven Costs

Decision Invariance Under Row Adjustment
Positive Classification Criterion
Demonstration Of The Positive Classification Criterion
Constructing The Cost Matrix
Decision Invariance Under Scaling
Direct Costs and Opportunity Costs
Case Study: Cost-Benefit Analysis Using Data-Driven Misclassification Costs
Rebalancing as a Surrogate for Misclassification Costs
The R Zone
R References
Exercises

Cost-Benefit Analysis for Trinary and -Nary Classification Models

Classification Evaluation Measures for a Generic Trinary Target
Application of Evaluation Measures for Trinary Classification to the Loan Approval Problem
Data-Driven Cost-Benefit Analysis for Trinary Loan Classification Problem
Comparing Cart Models With and Without Data-Driven Misclassification Costs
Classification Evaluation Measures for a Generic k-Nary Target
Example of Evaluation Measures and Data-Driven Misclassification Costs for k-Nary Classification
The R Zone
R References
Exercises

Graphical Evaluation of Classification Models

Review of Lift Charts and Gains Charts
Lift Charts and Gains Charts Using Misclassification Costs
Response Charts
Profits Charts
Return on Investment (ROI) Charts
The R Zone
R References
Exercises
Hands-On Exercises

Hierarchical and k-Means Clustering

The Clustering Task
Hierarchical Clustering Methods
Single-Linkage Clustering
Complete-Linkage Clustering
k-Means Clustering
Example of k-Means Clustering at Work
Behavior of MSB, MSE, and Pseudo-F as the k-Means Algorithm Proceeds
Application of k-Means Clustering Using SAS Enterprise Miner
Using Cluster Membership to Predict Churn
The R Zone
R References
Exercises
Hands-On Analysis

Kohonen Networks

Self-Organizing Maps
Kohonen Networks
Example of a Kohonen Network Study
Cluster Validity
Application of Clustering Using Kohonen Networks
Interpreting The Clusters
Using Cluster Membership as Input to Downstream Data Mining Models
The R Zone
R References
Exercises

BIRCH Clustering

Rationale for BIRCH Clustering
Cluster Features
Cluster Feature TREE
Phase 1: Building The CF Tree
Phase 2: Clustering The Sub-Clusters
Example of Birch Clustering, Phase 1: Building The CF Tree
Example of BIRCH Clustering, Phase 2: Clustering The Sub-Clusters
Evaluating The Candidate Cluster Solutions
Case Study: Applying BIRCH Clustering to The Bank Loans Data Set
The R Zone
R References
Exercises

Measuring Cluster Goodness

Rationale for Measuring Cluster Goodness
The Silhouette Method
Silhouette Example
Silhouette Analysis of the IRIS Data Set
The Pseudo-F Statistic
Example of the Pseudo-F Statistic
Pseudo-F Statistic Applied to the IRIS Data Set
Cluster Validation
Cluster Validation Applied to the Loans Data Set
The R Zone
R References
Exercises

Association Rules

Affinity Analysis and Market Basket Analysis
Support, Confidence, Frequent Itemsets, and the A Priori Property
How Does The A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets
How Does The A Priori Algorithm Work (Part 2)? Generating Association Rules
Extension From Flag Data to General Categorical Data
Information-Theoretic Approach: Generalized Rule Induction Method
Association Rules are Easy to do Badly
How Can We Measure the Usefulness of Association Rules?
Do Association Rules Represent Supervised or Unsupervised Learning?
Local Patterns Versus Global Models
The R Zone
R References
Exercises

Segmentation Models

The Segmentation Modeling Process
Segmentation Modeling Using EDA to Identify the Segments
Segmentation Modeling using Clustering to Identify the Segments
The R Zone
R References
Exercises

Ensemble Methods: Bagging and Boosting

Rationale for Using an Ensemble of Classification Models
Bias, Variance, and Noise
When to Apply, and not to apply, Bagging
Bagging
Boosting
Application of Bagging and Boosting Using IBM/SPSS Modeler
References
The R Zone
R Reference
Exercises

Model Voting and Propensity Averaging

Simple Model Voting
Alternative Voting Methods
Model Voting Process
An Application of Model Voting
What is Propensity Averaging?
Propensity Averaging Process
An Application of Propensity Averaging
The R Zone
R References
Exercises
Hands-On Analysis

Genetic Algorithms

Introduction To Genetic Algorithms
Basic Framework of a Genetic Algorithm
Simple Example of a Genetic Algorithm at Work
Modifications and Enhancements: Selection
Modifications and Enhancements: Crossover
Genetic Algorithms for Real-Valued Variables
Using Genetic Algorithms to Train a Neural Network
WEKA: Hands-On Analysis Using Genetic Algorithms
The R Zone
R References

Imputation of Missing Data

Need for Imputation of Missing Data
Imputation of Missing Data: Continuous Variables
Standard Error of the Imputation
Imputation of Missing Data: Categorical Variables
Handling Patterns in Missingness
Reference
The R Zone
R References

Case Study, Part 1: Business Understanding, Data Preparation, and EDA

Cross-Industry Standard Practice for Data Mining
Business Understanding Phase
Data Understanding Phase, Part 1: Getting a Feel for the Data Set
Data Preparation Phase
Data Understanding Phase, Part 2: Exploratory Data Analysis

Case Study, Part 2: Clustering and Principal Components Analysis

Partitioning the Data
Developing the Principal Components
Validating the Principal Components
Profiling the Principal Components
Choosing the Optimal Number of Clusters Using Birch Clustering
Choosing the Optimal Number of Clusters Using k-Means Clustering
Application of k-Means Clustering
Validating the Clusters
Profiling the Clusters

Case Study, Part 3: Modeling And Evaluation For Performance And Interpretability

Do You Prefer The Best Model Performance, Or A Combination Of Performance And Interpretability?
Modeling And Evaluation Overview
Cost-Benefit Analysis Using Data-Driven Costs
Variables to be Input To The Models
Establishing The Baseline Model Performance
Models That Use Misclassification Costs
Models That Need Rebalancing as a Surrogate for Misclassification Costs
Combining Models Using Voting and Propensity Averaging
Interpreting The Most Profitable Model

Case Study, Part 4: Modeling and Evaluation for High Performance Only

Variables to be Input to the Models
Models that use Misclassification Costs
Models that Need Rebalancing as a Surrogate for Misclassification Costs
Combining Models using Voting and Propensity Averaging
Lessons Learned
Conclusions

Appendix A

Data Summarization and Visualization
Part 1: Summarization 1: Building Blocks Of Data Analysis
Part 2: Visualization: Graphs and Tables For Summarizing And Organizing Data
Part 3: Summarization 2: Measures Of Center, Variability, and Position
Part 4: Summarization And Visualization Of Bivariate Relationships

An Introduction to Data Mining and Predictive Analytics

RStudio and Power BI Workstation

Data Preprocessing

Plotting Graphs By Performing Expectation-Maximization
Plotting the Density Values
Using eclat to Find Similarities in Adult Behavior
Finding Frequent Items in a Dataset
Determining and Visualizing Sequences
Computing LCP, LCS, and OMD
Plotting Points on a Map
Displaying a Histogram of Scatter Plots
Creating an Enhanced Scatter Plot
Finding a Dataset
Making a Prediction

Exploratory Data Analysis

Computing the Outliers for a Set
Calculating Anomalies
Constructing a Bar Plot
Producing a Word Cloud

Multivariate Statistics

Performing Multivariate Regression Analysis
Grouping and Organizing Bivariate Data

Simple Linear Regression

Performing Simple Regression

Logistic Regression

Using Holt Exponential Smoothing

Multiple Regression and Model Building

Performing Multiple Regression
Performing Tetrachoric Correlation
Generating a 3D Graphic
Producing a 3D Scatterplot

Decision Trees

Developing a Decision Tree
Performing Cluster Analysis
Constructing a Multitude of Decision Trees

Hierarchical and k-Means Clustering

Displaying the Hierarchical Cluster
Plotting a Graph by Performing k-means Clustering
Calculating K-medoids Clustering
Estimating the Number of Clusters Using Medoids
Performing Affinity Propagation Clustering

Association Rules

Using the apriori Rules Library
Evaluating Associations in a Shopping Basket
Producing a Regression Model
Understanding Instance-Based Learning

Case Study, Part 4: Modeling and Evaluation for High Performance Only

Importing the dataset
Displaying the number of rows and columns
Displaying the column names
Displaying mean, count, min, and max
Displaying the variable type
Displaying unique values
Displaying the unique values and the number of times each value appears
Importing the dataset
Displaying the number of values missing
Identifying standard deviation
Displaying duplicate rows
Drop all duplicates
Defining an outlier
Defining an outlier
Dropping a column
Fixing column values
Imputing the missing values
Creating a separate dataset
Importing the dataset
Duplicate values
Missing observations
Displaying the mean value
Importing a dataset
Using summary() method
Using str() method
Changing the data type
Ensuring the column is in date format
Importing the dataset
Displaying the values
Displaying the length
Ensuring all the values format
Creating a new column
Importing the dataset
Plotting a boxplot
Importing the dataset
Identifying missing values
Removing rows
Imputing the rows
Importing the dataset
Isolating values
Importing the dataset
Determining the values
Changing the areatype
Importing the dataset
Creating histograms
Plotting a bar chart
Plotting a bar chart
Displaying the maximum value
Plotting a boxplot
Creating a subset
Normalizing data
Creating the PCA
Creating a scree plot
Selecting fewest components
Creating the rotation
Creating components
Developing the Estimated Simple Linear Regression Equation
Developing the Estimated Regression Equation
Developing the Estimated Multiple Linear Regression Equation
Using the Estimated Regression
Developing the Simple Linear Regression Equation
Testing the Hypotheses of No Relationship Between Repair Time and the Number of Months
Constructing a Scatterplot of Months
Calculating the Predicted Repair Time and Residual
Constructing a Scatterplot of Months
Developing the Multiple Regression Equation
Testing the Hypotheses of No Relationship Between Repair Time and the Independent Variables
Testing the Hypotheses of No Relationship Between Repair Time and the Independent Variables
Creating a New Dummy Variable
Testing the Hypotheses of No Relationship Between Repair Time and the Independent Variables
Testing the Hypotheses of No Relationship Between Repair Time and the Independent Variables
Developing the Multiple Regression Equation
Testing the Hypotheses of No Relationship Between Repair Time and the Independent Variables
Testing the Hypotheses of No Relationship Between Repair Time and the Independent Variables
Testing the Hypotheses of No Relationship Between Repair Time and the Independent Variables
Evaluating the Candidate Logistic Regression Models
Creating a ROC Curve
Evaluate a Full Model with all Predictors
Constructing the ROC Curve and Computing the AUC
Using Descriptive Statistics and Charts
Using Descriptive Statistics and Charts to Enroll in the Seminars
Evaluating the Model on the Oscars.xlsx Data
Performing the Logistic Regression Model on the Reduced Model
Refitting the Logistic Regression Model to the Oscars.xlsx Data
Using a Default Cut-off Value
Using a Cutoff Value to Classify a Movie as a Winner or Not
Using the Model to Predict the Annual Winner
Removing the Least Significant Independent Variable and Rerunning the Model

This course includes:

Free pre-assessment and first 2 lessons

34+ Interactive Lessons

Accessible on mobile and tablet too

Certificate of completion

Are you an instructor?

Access detailed information about the course content, learning objectives, activities, and assessments before adding it to your curriculum.

Data Mining and Predictive Analysis

Are you an instructor?