Data pre-processing

Discretization

Binarization

Convert continuous or numerical data into binary values (0, 1) based on some thresholds.

[!Example ] We have listen counts of user and want to determine taste of user. Listen counts is not a robust measure of user taste (Users have different listening habits)

With binarization can transform listen counts into binary values. For example, if listen count >= 10 then map to 1 else map to 0

Binning (Quantization)

Group continuous data into bins (Map continuous data to discrete or categorical one).

  • Need to decide bins size (width of each bin)

Here are some interesting methods of binning

Fixed-width binning

Each bin has specific numeric range, it can be any scale (custom designed, automatically segment, linear, exponential)

  • Simple
  • It may be good on uniformly distributed (fairly even spread)

Example

  • Custom Designed : Binning age of people with Stage of life

0 - 12 years old 12 - 17 years old 18 - 24 years old …

  • Exponential-width binning (Relate to Log Transform) : When data has a wide range of values

0 - 9 10 - 99 100 - 999 …

Fixed-frequency binning

Each interval contains approximately the same number of data points

  • When data is not uniformly distributed, we can ensure that each bin contains approximately the same number of data points

Example

Data = [0, 4, 12, 16, 16, 18, 24, 26, 28] Range = (-, 14] [14, 21] [21, +)

Bin 1 : [0, 4, 12] Bin 2 : [16, 16, 18] Bin 3 : [24, 26, 28]

Quantile Binning

Binning base on quantiles (values that divide the data into equal portions)

4-Quantile = Quartiles (Divide data into quarters) 10-Quantile = Deciles (Divide data into tenths)

Steps In K - Quantile = [] To find position of value of boundary between quantile, find

  • K = Number of quantile
  • n = Number of data in dataset
  • i = Number of Quantile

Example

A lot of empty bins when do fixed width-binning so quantile binning come to help

Data = [1, 1, 1, 2, 2, 3, 4, 4, 10, 10, 10, 500, 1000] Let divide data into Quartiles (4-Quantile) so n = 13 and K = 4

Quantile-1 has [1, 1, 1, 2, 2, 3] Quantile-2 has [4, 4] Quantile-3 has [10, 10, 10] Quantile-4 has [500, 1000]

Imbalanced Techniques - Sampling and SMOTE

Imbalanced occur when one class in dataset has more samples than another class. It can be caused by

  • Bias while sampling
  • Error from measurement
  • Natural of that data

Majority Class = Many samples class Minority Class = few samples class

It harder for a model to learn characteristic of Minority Class + some models are assume an equal distribution of class so imbalance can cause problem.

Note

We focus on target variable class

Sampling

Random Under-sampling

![[Pasted image 20240503013221.png | center | 300]]

Randomly eliminate samples from majority class until classes distribution balance

When training data is not that small, we can afford under-sampling

Advantages

  • Reduce model training time: fewer data points

Disadvantages

  • Reduced dataset not represent population or true distribution
  • May loss useful data
Random Over-sampling

![[Pasted image 20240503013247.png | center | 300]]

Randomly copy samples in minority class to get more balance distribution

When training data is less, over-sampling would be better than under-sampling

Advantages

  • No information loss

Disadvantages

  • Overfitting
Synthetic over-sampling: SMOTE

Take subsets of minority class then generate synthetic data points from those subsets and add those new data points to dataset.

It solve overfitting problem of over-sampling

Advantages

  • Help with overfitting problem
  • No data loss

Disadvantages

  • No consideration of majority class
  • Addition noise to dataset

![[77417image1 1.png | center | 500]]

How does SMOTE generate new data points

  1. Finding Nearest Neighbors
  2. Creating Synthetic Data Points: Do Interpolation
  • For each feature dimension, SMOTE calculates a difference value between the selected data point and one of its neighbors.
  • It then multiplies this difference by a random value between 0 and 1.
  • Finally, it adds this scaled difference to the feature value of the selected data point to create a new synthetic data point.
  1. Repeat

Dimensional Reduction - PCA

PCA reduces a large number of variables into a set of Principle Component axes.

Why we need to do dimension reduction

Most of the time, we have dataset which has high dimension (many columns or features), we can’t visualize those datasets so dimension reduction lend a hand to help with this problem.

Dimension Reduction algorithm will try to preserve all information in higher dimension data and reduce those data into 2-3D where we can visualize and understand(in other word, extract feature out of those high-dimensional data) them.

It can apply prior to applying some models which get affected a lot by Curse of Dimensionality

How to perform PCA

Ref: https://www.youtube.com/watch?v=MLaJbA82nzk

Feature
x48137
y114514
Number of features = n = 2
Number of samples = N = 4

Features Mean Calculation

Covariance Matrix Let Ordered Pairs are (x, x) (x, y) (y, x) (y, y)

  • Find covariance of all ordered pairs

Calculate Eigen value and Eigen Vector then Normalized eigen vector

  • Eigen value

where

  • Eigen Vector of
  • Normalize the eigen vector get unit eigen vector

Derive new dataset

P11P12P13P14
PC1-4.30523.75615.6928-5.1238

Tip

Before apply PCA, Standardizing is important step!

  • PCA seeks to maximize the variance of each component and Standardizing is a variance maximizing exercise

Model Evaluation

Cross-Validation

Technique to evaluate predictive models by partitioning the original sample into a training set to train the model and a test set to evaluate it.

There are many interesting methods those use cross-validation technique.

We can also use the cross validation set to tune model hyperparameters and test with test set later.

Holdout method

Separate dataset into two sets training set and testing set

  • Use training set to train only and then predict testing set and validate the model on it

K-fold cross validation

Randomly divide dataset into K equal-size parts, , use K-1 part as training set and the left one is Validation set

![[LyHIm.png | center | 600]]

Advantage

  • Good to use when don’t have a large enough dataset (When a single, random sample of the data is not representative sample of the underlying distribution)

**Disadvantage

  • K have effect, larger K larger training sets that overlap more, leading to a stronger dependence between the results in the K folds
  • Expensive computational time

Leave-one-out cross validation

Extreme case of K-fold cross validation, only one sample is used as a Validation set while the rest are used to train our model

![[loso.webp | center | 500]]

Partitioning Clustering

Data clustering concept

  • Grouping similar data points together based on certain criteria
  • Unsupervised Learning
  • Use Distance/Similarity Metric

The goal of clustering is to discover natural groupings or clusters in the data without any prior knowledge of the groupings.

Pre-processing for clustering

  • Normalize the data
  • Remove noises or outliers
  • Approach few numbers of dimensions (dimensional reduction)

K-means

How to perform K-means

Let

\begin{document}
\begin{tikzpicture}[scale=0.7]

    \coordinate (A1) at (3,8);
    \coordinate (A2) at (9,4);
    \coordinate (A3) at (4,9);
    \coordinate (A4) at (2,2);
    \coordinate (A5) at (10,5);
    \coordinate (A6) at (2,4);
    \coordinate (A7) at (6,8);
    \coordinate (A8) at (4,3);
    \coordinate (A9) at (4,7);
    
    \foreach \point in {A1, A2, A3, A4, A5, A6, A7, A8, A9}
        \fill (\point) circle (3pt);
    
    \foreach \point/\label in {A1/A1, A2/A2, A3/A3, A4/A4, A5/A5, A6/A6, A7/A7, A8/A8, A9/A9}
        \node[above right] at (\point) {\label};
    
    \draw[help lines,gray!30] (0,0) grid (12,10);
    \foreach \x in {1,2,...,12}
        \node[below] at (\x,0) {\x};
    \foreach \y in {1,2,...,10}
        \node[left] at (0,\y) {\y};
    
    \draw[->] (0,0) -- (12,0) node[right] {$x$};
    \draw[->] (0,0) -- (0,10) node[above] {$y$};
\end{tikzpicture}
\end{document}

In this problem we will use

Goal: Find member and centroid of each cluster

Step

  1. Choose K-centroid randomly (It can be any points)
  1. Find the nearest centroid for each point and assign each point to those centroid

In this problem we use Euclidean Distance

Let assume that we have assigned each point to its nearest centroid using Euclidean distance as a matric distance and get this result.

  • j1 has [A1, A3, A7, A9]
  • j2 has [A4, A6, A8]
  • j3 has [A2, A5]
  1. Find new centroid and see if cluster member change or not
  • If cluster member remain the same : Finish!
  • If cluster member change : Find new centroid again!

from j = 1 to K

where

Example

jcentroid
1
2
3
  • j1 has [A1, A3, A7, A9]
  • j2 has [A4, A6, A8]
  • j3 has [A2, A5]

Member of cluster is not changed so FINISH!

Result

\begin{document}
\begin{tikzpicture}[scale=0.7]
    \coordinate (A1) at (3,8);
    \coordinate (A2) at (9,4);
    \coordinate (A3) at (4,9);
    \coordinate (A4) at (2,2);
    \coordinate (A5) at (10,5);
    \coordinate (A6) at (2,4);
    \coordinate (A7) at (6,8);
    \coordinate (A8) at (4,3);
    \coordinate (A9) at (4,7);
    
    \foreach \point/\label/\color in {A1/A1/blue, A2/A2/yellow, A3/A3/blue, A4/A4/green, A5/A5/yellow, A6/A6/green, A7/A7/blue, A8/A8/green, A9/A9/blue}
        \fill[\color] (\point) circle (3pt) node[above right, \color] {\label};
    
    \coordinate (C1) at (4.25,8);
    \coordinate (C2) at (2.667,3);
    \coordinate (C3) at (9.5,4.5);
    
    \foreach \centroid/\label in {C1/c1, C2/c2, C3/c3}
        \fill[red] (\centroid) circle (5pt) node[below right] {\label};
    
    \foreach \point/\centroid/\color in {A1/C1/blue, A3/C1/blue, A7/C1/blue, A9/C1/blue, A4/C2/green, A6/C2/green, A8/C2/green, A2/C3/yellow, A5/C3/yellow}
        \fill[\color] (\point) circle (3pt);
    
    \draw[help lines,gray!30] (0,0) grid (12,10);
    \foreach \x in {1,2,...,12}
        \node[below] at (\x,0) {\x};
    \foreach \y in {1,2,...,10}
        \node[left] at (0,\y) {\y};
    
    \draw[->] (0,0) -- (12,0) node[right] {$x$};
    \draw[->] (0,0) -- (0,10) node[above] {$y$};
\end{tikzpicture}
\end{document}

Clustering Evaluation

Elbow Method

Help choose the right number of clusters for your clustering algorithm by plot graph between the number of clusters and the sum of squared distances

Within Cluster Sum of Squares - WCSS

then plot on graph where y is WCSS and x is Number of clusters

Association Rule

Apriori Algorithm

Ref: https://www.youtube.com/watch?v=rgN5eSEYbnY

Goal : get frequent itemset which is a set of data that tend to happen together.

Step

Customer IDTransaction IDItem Bought
10001{a, d, e, f}
10024{a, b, c, e}
20012{a, b, d, e}
20031{a, c, d, e}
30015{b, c, e}
30022{b, d, e}
40029{c, d}
40040{a, b, c}
50033{a, d, e}
50038{a, b, e}
Let minimum support 40%
  1. Filter item with minimum support

Filter out the item which Support < Minimum support

Transaction IDabcdef
00011111
00241111
00121111
00311111
0015111
0022111
002911
0040111
0033111
0038111
Count765681
101010101010
Support70%60%50%60%80%10%
Pass?
  1. Create itemset

To create itemset

2-Itemset

ProductSupportPass?
{a, b}41040%
{a, c}31030%
{a, d}41040%
{a, e}61060%
{b, c}31030%
{b, d}21020%
{b, e}51050%
{c, d}21020%
{c, e}31030%
{d, e}51050%

3-Itemset

ProductSupportPass?
{a, b ,c}41040%
{a, c, d}11010%
{a, d, e}41040%
{b, c, e}21020%

4-Itemset

ProductSupportPass?
{a, b ,c, e}11010%
{a, b, d, e}11010%
{a, c, d, e}11010%
  1. Get Association Rule and Calculate Confidence and Lift

Association define as

If A exist in transaction then B tend to exist too. For example, If Customer 1 buy A then he tend to buy B too.

We usually call

AntecedentConsequentAssociation Rule
{b}{a, d, e}
{a, d, e}{b}

Confidence

Measures how often a rule is true

  • if A occurs, B is likely to occur

Minimum Confidence set a minimum line for confidence

Let say, minimum confidence = 0.2

Association RuleSupport(A, B)Support(A)Confifence(A B)Pass Minimum Confidence

Lift

Measure of the strength of the association between two items

  • Lift greater than 1 indicates that the presence of item A has a positive effect on the presence of item B

Decision Trees and Random Forest

Terminology

Node Impurity

Node is like a dataset or subset of data in decision tree. Pure Node is a subset where all samples are in the same class

ABClass
122+
223213+
1232+

Impure Node is a subset where samples has same portion of classes (#Class + =#class -)

Entropy

Entropy of a random variable is the average level of “information”, “surprise”, or “uncertainty” inherent to the variable’s possible outcomes

When we measure entropy of something, it like we measure how much information we need to describe that thing or measure randomness in the system

In decision tree algorithm, entropy measure impurity in the Node

  • Pure Node: entropy = 0 (no impurity + need only few information to describe that node) low randomness

Example

Find entropy of this Node S where class label are + and -.

HeightRaceSexClass
160AsianF+
170HispanicF+
180WhiteM-
170WhiteM+
180AsianM-
Probability that person will be + is 3/5 = 0.6
Probability that person will be - is 2/5 = 0.4

NOTE: when , entropy = 0 ()

Information Gain

In decision tree, information gain measure how much entropy will be reduced after each split

  • Compute the difference between entropy before split and average of entropy after the split
  • Less reduction of entropy More Information Gain

Information gain help us select best feature to use to split the data at each internal node of the decision tree

  • Feature with the highest information gain is chosen as the split feature

Example

Let use the previous example where

HeightRaceSexClass
160AsianF+
170HispanicF+
180WhiteM-
170WhiteM+
180AsianM-
Find Information Gain of Sex or GAIN(S, Sex)
{F, M}
2 {2+, 0-}
3 {1+, 2-}
= 5 {3+, 2-}
0.970951

GINI Index

GINI index also measure the impurity, it is like information gain but compute differently (GINI faster to compute)

HeightRaceSexClass
160AsianF+
170HispanicF+
180WhiteM-
170WhiteM+
180AsianM-

Example

Find GINI index of Sex

2 {2+, 0-} 3 {1+, 2-} = 5 {3+, 2-}

Note

For continuous values, we need to apply discretization first.

Decision Trees

Decision Trees construction

Too lazy laew, doo this one https://www.youtube.com/watch?v=_L39rN6gz7Y

![[Structure-of-the-decision-tree-DT-algorithm.png | center | 500]]

Steps

Parameters

  • Max Depth: Maximum depth of the tree
    • Higher depth: Lead to overfit
  • Min Sample Split: Minimum sample require to split an internal node
    • Prevent splitting with too few sample
  • Min Sample Leaf: Minimum sample require to be leaf node
    • Prevent overfitting
  • Max Features: Maximum number of features to consider when splitting a node
  • Criterion: GINI, Entropy

Random Forest

Pick a random subset S, of training samples

  • For each subset => grow a full tree
  • Given, a new data point X
  • Classify X using each of the trees
  • For example, use majority vote: class predicted most often

Evaluation

Decision Tree is mainly use as classifier so we can use Evaluate with Precision/Recall, F-Score and Confusion Matrix

![[0 Lj8eoN2JB6vmNYT5.jpg | center | 600]]

Recommenders Systems

Information filtering system that implicitly or explicitly capture a user’s preference and generate a ranked list of items that might be of interest to the user.

Types of recommender systems

Collaborative

Content-Based

Knowledge-Based

Hybrid

Different types of Recommenders Systems

Evaluation of Recommenders Systems

Business Applications of Analytics