DAT-640: Predictive Analytics

(SNHU-DAT640.AE1) / ISBN : 978-1-61691-079-2
Lessons
Lab
TestPrep
Get A Free Trial

Skills You’ll Get

1

An Introduction to Data Mining and Predictive Analytics

  • What is Data Mining? What Is Predictive Analytics?
  • Wanted: Data Miners
  • The Need For Human Direction of Data Mining
  • The Cross-Industry Standard Process for Data Mining: CRISP-DM
  • Fallacies of Data Mining
  • What Tasks can Data Mining Accomplish
  • The R Zone
  • R References
  • Exercises
2

Data Preprocessing

  • Why do We Need to Preprocess the Data?
  • Data Cleaning
  • Handling Missing Data
  • Identifying Misclassifications
  • Graphical Methods for Identifying Outliers
  • Measures of Center and Spread
  • Data Transformation
  • Min–Max Normalization
  • Z-Score Standardization
  • Decimal Scaling
  • Transformations to Achieve Normality
  • Numerical Methods for Identifying Outliers
  • Flag Variables
  • Transforming Categorical Variables into Numerical Variables
  • Binning Numerical Variables
  • Reclassifying Categorical Variables
  • Adding an Index Field
  • Removing Variables that are not Useful
  • Variables that Should Probably not be Removed
  • Removal of Duplicate Records
  • A Word About ID Fields
  • The R Zone
  • R Reference
  • Exercises
3

Exploratory Data Analysis

  • Hypothesis Testing Versus Exploratory Data Analysis
  • Getting to Know The Data Set
  • Exploring Categorical Variables
  • Exploring Numeric Variables
  • Exploring Multivariate Relationships
  • Selecting Interesting Subsets of the Data for Further Investigation
  • Using EDA to Uncover Anomalous Fields
  • Binning Based on Predictive Value
  • Deriving New Variables: Flag Variables
  • Deriving New Variables: Numerical Variables
  • Using EDA to Investigate Correlated Predictor Variables
  • Summary of Our EDA
  • The R Zone
  • R References
  • Exercises
4

Preparing to Model the Data

  • Supervised Versus Unsupervised Methods
  • Statistical Methodology and Data Mining Methodology
  • Cross-Validation
  • Overfitting
  • Bias–Variance Trade-Off
  • Balancing The Training Data Set
  • Establishing Baseline Performance
  • The R Zone
  • R Reference
  • Exercises
5

Simple Linear Regression

  • An Example of Simple Linear Regression
  • Dangers of Extrapolation
  • How Useful is the Regression? The Coefficient of Determination, r2
  • Standard Error of the Estimate, s
  • Correlation Coefficient r
  • Anova Table for Simple Linear Regression
  • Outliers, High Leverage Points, and Influential Observations
  • Population Regression Equation
  • Verifying The Regression Assumptions
  • Inference in Regression
  • t-Test for the Relationship Between x and y
  • Confidence Interval for the Slope of the Regression Line
  • Confidence Interval for the Correlation Coefficient ρ
  • Confidence Interval for the Mean Value of y Given x
  • Prediction Interval for a Randomly Chosen Value of y Given x
  • Transformations to Achieve Linearity
  • Box–Cox Transformations
  • The R Zone
  • R References
  • Exercises
A

Appendix A

  • Data Summarization and Visualization
  • Part 1: Summarization 1: Building Blocks Of Data Analysis
  • Part 2: Visualization: Graphs and Tables For Summarizing And Organizing Data
  • Part 3: Summarization 2: Measures Of Center, Variability, and Position
  • Part 4: Summarization And Visualization Of Bivariate Relationships
7

Hierarchical and k-Means Clustering

  • The Clustering Task
  • Hierarchical Clustering Methods
  • Single-Linkage Clustering
  • Complete-Linkage Clustering
  • k-Means Clustering
  • Example of k-Means Clustering at Work
  • Behavior of MSB, MSE, and Pseudo-F as the k-Means Algorithm Proceeds
  • Application of k-Means Clustering Using SAS Enterprise Miner
  • Using Cluster Membership to Predict Churn
  • The R Zone
  • R References
  • Exercises
  • Hands-On Analysis
8

Measuring Cluster Goodness

  • Rationale for Measuring Cluster Goodness
  • The Silhouette Method
  • Silhouette Example
  • Silhouette Analysis of the IRIS Data Set
  • The Pseudo-F Statistic
  • Example of the Pseudo-F Statistic
  • Pseudo-F Statistic Applied to the IRIS Data Set
  • Cluster Validation
  • Cluster Validation Applied to the Loans Data Set
  • The R Zone
  • R References
  • Exercises
9

Decision Trees

  • What is a Decision Tree?
  • Requirements for Using Decision Trees
  • Classification and Regression Trees
  • C4.5 Algorithm
  • Decision Rules
  • Comparison of the C5.0 and CART Algorithms Applied to Real Data
  • The R Zone
  • R References
  • Exercises
10

Ensemble Methods: Bagging and Boosting

  • Rationale for Using an Ensemble of Classification Models
  • Bias, Variance, and Noise
  • When to Apply, and not to apply, Bagging
  • Bagging
  • Boosting
  • Application of Bagging and Boosting Using IBM/SPSS Modeler
  • References
  • The R Zone
  • R Reference
  • Exercises
11

Logistic Regression

  • Simple Example of Logistic Regression
  • Maximum Likelihood Estimation
  • Interpreting Logistic Regression Output
  • Inference: Are the Predictors Significant?
  • Odds Ratio and Relative Risk
  • Interpreting Logistic Regression for a Dichotomous Predictor
  • Interpreting Logistic Regression for a Polychotomous Predictor
  • Interpreting Logistic Regression for a Continuous Predictor
  • Assumption of Linearity
  • Zero-Cell Problem
  • Multiple Logistic Regression
  • Introducing Higher Order Terms to Handle Nonlinearity
  • Validating the Logistic Regression Model
  • WEKA: Hands-On Analysis Using Logistic Regression
  • The R Zone
  • R References
  • Exercises
12

Model Evaluation Techniques

  • Model Evaluation Techniques for the Description Task
  • Model Evaluation Techniques for the Estimation and Prediction Tasks
  • Model Evaluation Measures for the Classification Task
  • Accuracy and Overall Error Rate
  • Sensitivity and Specificity
  • False-Positive Rate and False-Negative Rate
  • Proportions of True Positives, True Negatives, False Positives, and False Negatives
  • Misclassification Cost Adjustment to Reflect Real-World Concerns
  • Decision Cost/Benefit Analysis
  • Lift Charts and Gains Charts
  • Interweaving Model Evaluation with Model Building
  • Confluence of Results: Applying a Suite of Models
  • The R Zone
  • R References
  • Exercises
  • Hands-On Analysis
13

Cost-Benefit Analysis Using Data-Driven Costs

  • Decision Invariance Under Row Adjustment
  • Positive Classification Criterion
  • Demonstration Of The Positive Classification Criterion
  • Constructing The Cost Matrix
  • Decision Invariance Under Scaling
  • Direct Costs and Opportunity Costs
  • Case Study: Cost-Benefit Analysis Using Data-Driven Misclassification Costs
  • Rebalancing as a Surrogate for Misclassification Costs
  • The R Zone
  • R References
  • Exercises

1

An Introduction to Data Mining and Predictive Analytics

  • Analyzing a Dataset
2

Data Preprocessing

  • Creating a Histogram
  • Creating a Scatterplot
3

Simple Linear Regression

  • Plotting Data with a Regression Line
  • Measuring the Goodness of Fit of the Regression
  • Verifying the Regression Assumptions
4

Hierarchical and k-Means Clustering

  • Finding Clusters in Data
5

Measuring Cluster Goodness

  • Plotting Silhouette Values of a Dataset
  • Applying Cluster Validation to a Dataset
  • Milestone I
6

Decision Trees

  • Plotting a Classification Tree
  • Viewing the Output Sorted by Support
7

Ensemble Methods: Bagging and Boosting

  • Practical R Activities
8

Logistic Regression

  • Creating a Plot for Logistic Regression
  • Interpreting Logistic Regression and Odds Ratio for a Dichotomous Predictor
  • Milestone 2
  • Practical R Activities
9

Model Evaluation Techniques

  • Analyzing Cost-benefit Using Data-driven Misclassification Costs
  • Practical R Activities
10

Cost-Benefit Analysis Using Data-Driven Costs

  • Final Project

Related Courses

All Courses
scroll to top