Anomaly Detection
goal: determine if X(test) is anomalous
convert features into probability and if it's less than some threshold P(X) < delta then it's anomalous
P(X) is product of all individual P(Xj) where Xj is Jth feature of X $\prod X_j, $ \(X_j\) Jth feature of \(X\)
sample splitting: training set involves 60% of only good examples. Cross Validation and Test each contain 20% good and 50% anomalous
choose anomaly-detection over supervised-learning when negative examples (y=0) >> positive (y=1) examples
because unknown anomalies are hard to learn
choose anomaly for: fraud detection, manufacturing defects, monitoring failures
choose supervised learning for: email spam, cancer, weather
feature selection: histogram of feature should look like gaussian
if not gaussian, try transformations such as log or ^(1/n)
try adding a new feature that gives larger sigma to anomalous values (from CV set)
e.g. to detect a program in infinite loop, invent a new feature CPU/IO, instead of/in addition to, CPU and IO as two independent features
use multi-variate gaussian to automatically detect correlation between features instead of inventing a new feature
Multi-Variate Gaussian
for features that are correlated (positively or negatively), try using multi-variate gaussian
v/s instead of treating each feature as a independent gaussian
mu and Sigma are then Rn and Rnxn matrices. sigma is covariance matrix (same as in Principal Component Analysis)
P(x) for normal Gaussian is just a special case of multi-variate gaussian where Sigma (covariance matrix) has variance (sigma^2) along diagonal and all non-diagonal elements are zero
they are also called axis-aligned because their contour maps ellipses have axises are aligned (not at an angle)
can only be used when m (number of samples) > n (number of features)
because the formula requires calculating inverse of Sigma, which is non-invertible if m <= n
in practice consider only if m >= 10 * n
computationally expensive compared to normal gaussian
recommender systems
goal: predict \(y_{(i,j)}\) if \(r_{(i,j)} = 0\) , where
i: product (eg movie), j: user, \(r_{(i,j)} = 1\) if user j has rated product i so that is y(i,j not undefined
problem formulation: find theta-j, parameter vector for user j to make prediction for product i as theta-j' * X
this becomes a linear regression problem to predict theta for user j based on recommendation s/he has provided for product i i.e. r(i,j) = 1
content based learning: when features are known, e.g. if a movie's genre vector is known i.e. [romantic, action, drama...] forms a feature vector
Collaborative Filtering is used to learn feature values for a product i given theta values for user j
when both are unknown, theta is assigned random values to guess feature values for X(i), which is in turn used to get better theta values for the user
this process is repeated until it converges to get optimal values for both theta and X(i)
thus users collaborate to come up with feature values for product to make better predictions for everyone
Low Rank Matrix Factorization is a vectorized matrix multiplication of X * Theta' where
X is matrix of all products in rows with their feature values in columns
Theta is parameter values in columns for all users in rows
related products: product and i and j are related if the distance between them is small
Mean Normalization : normalize ratings across all users so that mean rating of each product is zero
then to derive predicted rating for a user, add mean back to to zero-based rating
this has an advantage for new users, that instead of their ratings being 0 for all products, they'll be assigned mean rating for the product