Join Online Computer Courses & Hands-On Labs

DAT-640: Predictive Analytics

(SNHU-DAT640.AE1) / ISBN : 978-1-61691-079-2

Lessons

Lab

TestPrep

Get A Free Trial

This course includes:

Free pre-assessment and first 2 lessons

13+ Interactive Lessons

Accessible on mobile and tablet too

Certificate of completion

Are you an instructor?

Access detailed information about the course content, learning objectives, activities, and assessments before adding it to your curriculum.

Lesson Plan

An Introduction to Data Mining and Predictive Analytics

What is Data Mining? What Is Predictive Analytics?
Wanted: Data Miners
The Need For Human Direction of Data Mining
The Cross-Industry Standard Process for Data Mining: CRISP-DM
Fallacies of Data Mining
What Tasks can Data Mining Accomplish
The R Zone
R References
Exercises

Data Preprocessing

Why do We Need to Preprocess the Data?
Data Cleaning
Handling Missing Data
Identifying Misclassifications
Graphical Methods for Identifying Outliers
Measures of Center and Spread
Data Transformation
Min–Max Normalization
Z-Score Standardization
Decimal Scaling
Transformations to Achieve Normality
Numerical Methods for Identifying Outliers
Flag Variables
Transforming Categorical Variables into Numerical Variables
Binning Numerical Variables
Reclassifying Categorical Variables
Adding an Index Field
Removing Variables that are not Useful
Variables that Should Probably not be Removed
Removal of Duplicate Records
A Word About ID Fields
The R Zone
R Reference
Exercises

Exploratory Data Analysis

Hypothesis Testing Versus Exploratory Data Analysis
Getting to Know The Data Set
Exploring Categorical Variables
Exploring Numeric Variables
Exploring Multivariate Relationships
Selecting Interesting Subsets of the Data for Further Investigation
Using EDA to Uncover Anomalous Fields
Binning Based on Predictive Value
Deriving New Variables: Flag Variables
Deriving New Variables: Numerical Variables
Using EDA to Investigate Correlated Predictor Variables
Summary of Our EDA
The R Zone
R References
Exercises

Preparing to Model the Data

Supervised Versus Unsupervised Methods
Statistical Methodology and Data Mining Methodology
Cross-Validation
Overfitting
Bias–Variance Trade-Off
Balancing The Training Data Set
Establishing Baseline Performance
The R Zone
R Reference
Exercises

Simple Linear Regression

An Example of Simple Linear Regression
Dangers of Extrapolation
How Useful is the Regression? The Coefficient of Determination, r2
Standard Error of the Estimate, s
Correlation Coefficient r
Anova Table for Simple Linear Regression
Outliers, High Leverage Points, and Influential Observations
Population Regression Equation
Verifying The Regression Assumptions
Inference in Regression
t-Test for the Relationship Between x and y
Confidence Interval for the Slope of the Regression Line
Confidence Interval for the Correlation Coefficient ρ
Confidence Interval for the Mean Value of y Given x
Prediction Interval for a Randomly Chosen Value of y Given x
Transformations to Achieve Linearity
Box–Cox Transformations
The R Zone
R References
Exercises

Appendix A

Data Summarization and Visualization
Part 1: Summarization 1: Building Blocks Of Data Analysis
Part 2: Visualization: Graphs and Tables For Summarizing And Organizing Data
Part 3: Summarization 2: Measures Of Center, Variability, and Position
Part 4: Summarization And Visualization Of Bivariate Relationships

Hierarchical and k-Means Clustering

The Clustering Task
Hierarchical Clustering Methods
Single-Linkage Clustering
Complete-Linkage Clustering
k-Means Clustering
Example of k-Means Clustering at Work
Behavior of MSB, MSE, and Pseudo-F as the k-Means Algorithm Proceeds
Application of k-Means Clustering Using SAS Enterprise Miner
Using Cluster Membership to Predict Churn
The R Zone
R References
Exercises
Hands-On Analysis

Measuring Cluster Goodness

Rationale for Measuring Cluster Goodness
The Silhouette Method
Silhouette Example
Silhouette Analysis of the IRIS Data Set
The Pseudo-F Statistic
Example of the Pseudo-F Statistic
Pseudo-F Statistic Applied to the IRIS Data Set
Cluster Validation
Cluster Validation Applied to the Loans Data Set
The R Zone
R References
Exercises

Decision Trees

What is a Decision Tree?
Requirements for Using Decision Trees
Classification and Regression Trees
C4.5 Algorithm
Decision Rules
Comparison of the C5.0 and CART Algorithms Applied to Real Data
The R Zone
R References
Exercises

Ensemble Methods: Bagging and Boosting

Rationale for Using an Ensemble of Classification Models
Bias, Variance, and Noise
When to Apply, and not to apply, Bagging
Bagging
Boosting
Application of Bagging and Boosting Using IBM/SPSS Modeler
References
The R Zone
R Reference
Exercises

Logistic Regression

Simple Example of Logistic Regression
Maximum Likelihood Estimation
Interpreting Logistic Regression Output
Inference: Are the Predictors Significant?
Odds Ratio and Relative Risk
Interpreting Logistic Regression for a Dichotomous Predictor
Interpreting Logistic Regression for a Polychotomous Predictor
Interpreting Logistic Regression for a Continuous Predictor
Assumption of Linearity
Zero-Cell Problem
Multiple Logistic Regression
Introducing Higher Order Terms to Handle Nonlinearity
Validating the Logistic Regression Model
WEKA: Hands-On Analysis Using Logistic Regression
The R Zone
R References
Exercises

Model Evaluation Techniques

Model Evaluation Techniques for the Description Task
Model Evaluation Techniques for the Estimation and Prediction Tasks
Model Evaluation Measures for the Classification Task
Accuracy and Overall Error Rate
Sensitivity and Specificity
False-Positive Rate and False-Negative Rate
Proportions of True Positives, True Negatives, False Positives, and False Negatives
Misclassification Cost Adjustment to Reflect Real-World Concerns
Decision Cost/Benefit Analysis
Lift Charts and Gains Charts
Interweaving Model Evaluation with Model Building
Confluence of Results: Applying a Suite of Models
The R Zone
R References
Exercises
Hands-On Analysis

Cost-Benefit Analysis Using Data-Driven Costs

Decision Invariance Under Row Adjustment
Positive Classification Criterion
Demonstration Of The Positive Classification Criterion
Constructing The Cost Matrix
Decision Invariance Under Scaling
Direct Costs and Opportunity Costs
Case Study: Cost-Benefit Analysis Using Data-Driven Misclassification Costs
Rebalancing as a Surrogate for Misclassification Costs
The R Zone
R References
Exercises

Hands-on LAB Activities

An Introduction to Data Mining and Predictive Analytics

Analyzing a Dataset

Data Preprocessing

Creating a Histogram
Creating a Scatterplot

Simple Linear Regression

Plotting Data with a Regression Line
Measuring the Goodness of Fit of the Regression
Verifying the Regression Assumptions

Hierarchical and k-Means Clustering

Finding Clusters in Data

Measuring Cluster Goodness

Plotting Silhouette Values of a Dataset
Applying Cluster Validation to a Dataset
Milestone I

Decision Trees

Plotting a Classification Tree
Viewing the Output Sorted by Support

Ensemble Methods: Bagging and Boosting

Practical R Activities

Logistic Regression

Creating a Plot for Logistic Regression
Interpreting Logistic Regression and Odds Ratio for a Dichotomous Predictor
Milestone 2
Practical R Activities

Model Evaluation Techniques

Analyzing Cost-benefit Using Data-driven Misclassification Costs
Practical R Activities

Cost-Benefit Analysis Using Data-Driven Costs

Final Project

Lab

CCNA 200-301 Pearson uCertify Network Simulator

ISBN: 9781616918378
200-301-SIMULATOR.AB1

Lessons AI Tutor

Accounting Course 101

ISBN: 9781644597002
ACCOUNT-WRKBK.AE1

Lessons Lab

Accounting All-in-One

ISBN: 9781644594490
ACCOUNTS.AE1

Lessons TestPrep

ACCUPLACER For Beginners

ISBN: 9781644595732
ACCUPLACER.AE1

Lessons TestPrep

ACT Prep 2024

ISBN: 9781644594889
ACT-PREP.AE1

Lessons AI Tutor

Mastering Active Directory

ISBN: 9781644595909
ACTV-DIRECT.AJ1

Lessons Lab AI Tutor

Advanced Programming in the UNIX Environment

ISBN: 9781644595121
ADV-PROG-UNIX.AP1

DAT-640: Predictive Analytics

Are you an instructor?

DAT-640: Predictive Analytics

Skills You’ll Get

Interactive Lessons

Gamified TestPrep

Hands-On Labs

An Introduction to Data Mining and Predictive Analytics

Data Preprocessing

Exploratory Data Analysis

Preparing to Model the Data

Simple Linear Regression

Appendix A

Hierarchical and k-Means Clustering

Measuring Cluster Goodness

Decision Trees

Ensemble Methods: Bagging and Boosting

Logistic Regression

Model Evaluation Techniques

Cost-Benefit Analysis Using Data-Driven Costs

An Introduction to Data Mining and Predictive Analytics

Data Preprocessing

Simple Linear Regression

Hierarchical and k-Means Clustering

Measuring Cluster Goodness

Decision Trees

Ensemble Methods: Bagging and Boosting

Logistic Regression

Model Evaluation Techniques

Cost-Benefit Analysis Using Data-Driven Costs

Related Courses

CCNA 200-301 Pearson uCertify Network Simulator

Accounting Course 101

Accounting All-in-One

ACCUPLACER For Beginners

ACT Prep 2024

Mastering Active Directory

Advanced Programming in the UNIX Environment