DAT-650: Advanced Data Analytics

(SNHU-DAT650.AE1) / ISBN : 978-1-61691-081-5
Lessons
Lab
TestPrep
Get A Free Trial

Skills You’ll Get

1

Case Study, Part 1: Business Understanding, Data Preparation, and EDA

  • Cross-Industry Standard Practice for Data Mining
  • Business Understanding Phase
  • Data Understanding Phase, Part 1: Getting a Feel for the Data Set
  • Data Preparation Phase
  • Data Understanding Phase, Part 2: Exploratory Data Analysis
2

Milestone

3

Multivariate Statistics

  • Two-Sample t-Test for Difference in Means
  • Two-Sample Z-Test for Difference in Proportions
  • Test for the Homogeneity of Proportions
  • Chi-Square Test for Goodness of Fit of Multinomial Data
  • Analysis of Variance
  • Reference
  • The R Zone
  • R Reference
  • Exercises
4

Case Study, Part 2: Clustering and Principal Components Analysis

  • Partitioning the Data
  • Developing the Principal Components
  • Validating the Principal Components
  • Profiling the Principal Components
  • Choosing the Optimal Number of Clusters Using Birch Clustering
  • Choosing the Optimal Number of Clusters Using k-Means Clustering
  • Application of k-Means Clustering
  • Validating the Clusters
  • Profiling the Clusters
5

k-Nearest Neighbor Algorithm

  • Classification Task
  • k-Nearest Neighbor Algorithm
  • Distance Function
  • Combination Function
  • Quantifying Attribute Relevance: Stretching the Axes
  • Database Considerations
  • k-Nearest Neighbor Algorithm for Estimation and Prediction
  • Choosing k
  • Application of k-Nearest Neighbor Algorithm Using IBM/SPSS Modeler
  • The R Zone
  • R References
  • Exercises
6

Association Rules

  • Affinity Analysis and Market Basket Analysis
  • Support, Confidence, Frequent Itemsets, and the A Priori Property
  • How Does The A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets
  • How Does The A Priori Algorithm Work (Part 2)? Generating Association Rules
  • Extension From Flag Data to General Categorical Data
  • Information-Theoretic Approach: Generalized Rule Induction Method
  • Association Rules are Easy to do Badly
  • How Can We Measure the Usefulness of Association Rules?
  • Do Association Rules Represent Supervised or Unsupervised Learning?
  • Local Patterns Versus Global Models
  • The R Zone
  • R References
  • Exercises
7

Multiple Regression and Model Building

  • An Example of Multiple Regression
  • The Population Multiple Regression Equation
  • Inference in Multiple Regression
  • Regression With Categorical Predictors, Using Indicator Variables
  • Adjusting R2: Penalizing Models For Including Predictors That Are Not Useful
  • Sequential Sums of Squares
  • Multicollinearity
  • Variable Selection Methods
  • Gas Mileage Data Set
  • An Application of Variable Selection Methods
  • Using the Principal Components as Predictors in Multiple Regression
  • The R Zone
  • R References
  • Exercises
8

Variable Selection Methods

9

NaïVe Bayes and Bayesian Networks

  • Bayesian Approach
  • Maximum A Posteriori (MAP) Classification
  • Posterior Odds Ratio
  • Balancing The Data
  • Naïve Bayes Classification
  • Interpreting The Log Posterior Odds Ratio
  • Zero-Cell Problem
  • Numeric Predictors for Naïve Bayes Classification
  • WEKA: Hands-on Analysis Using Naïve Bayes
  • Bayesian Belief Networks
  • Clothing Purchase Example
  • Using The Bayesian Network to Find Probabilities
  • The R Zone
  • R References
  • Exercises
10

Imputation of Missing Data

  • Need for Imputation of Missing Data
  • Imputation of Missing Data: Continuous Variables
  • Standard Error of the Imputation
  • Imputation of Missing Data: Categorical Variables
  • Handling Patterns in Missingness
  • Reference
  • The R Zone
  • R References
11

Case Study, Part 3: Modeling And Evaluation For Performance And Interpretability

  • Do You Prefer The Best Model Performance, Or A Combination Of Performance And Interpretability?
  • Modeling And Evaluation Overview
  • Cost-Benefit Analysis Using Data-Driven Costs
  • Variables to be Input To The Models
  • Establishing The Baseline Model Performance
  • Models That Use Misclassification Costs
  • Models That Need Rebalancing as a Surrogate for Misclassification Costs
  • Combining Models Using Voting and Propensity Averaging
  • Interpreting The Most Profitable Model
12

Case Study, Part 4: Modeling and Evaluation for High Performance Only

  • Variables to be Input to the Models
  • Models that use Misclassification Costs
  • Models that Need Rebalancing as a Surrogate for Misclassification Costs
  • Combining Models using Voting and Propensity Averaging
  • Lessons Learned
  • Conclusions

1

Milestone

  • Milestone I
2

k-Nearest Neighbor Algorithm

  • Running KNN
  • Calculating the Euclidean Distance
3

Association Rules

  • Milestone 2
4

Multiple Regression and Model Building

  • Approximating the Relationship between the Variables in a Scatterplot
  • Identifying Confidence Intervals
  • Creating a Dot Plot
  • Determining the Sequential Sums of Squares
  • Analyzing Multicollinearity
5

Variable Selection Methods

  • Applying the Best Subsets Procedure in a Regression Model
  • Applying Forward Selection Procedure
  • Applying the Backward Elimination Procedure
  • Applying the Stepwise Selection Procedure in a Regression Model
  • Using the Principal Components as Predictors in Multiple Regression
6

NaïVe Bayes and Bayesian Networks

  • Calculating Posterior Odds Ratio
  • Calculating the Log Posterior Odds Ratio
  • Calculating the Numeric Predictors for Naive Bayes Classification
  • Milestone 3
7

Imputation of Missing Data

8

Case Study, Part 3: Modeling And Evaluation For Performance And Interpretability

  • Final Project

Related Courses

All Courses
scroll to top