Anomaly Detection¶

goal: determine if X(test) is anomalous
convert features into probability and if it's less than some threshold P(X) < delta then it's anomalous
- P(X) is product of all individual P(Xj) where Xj is Jth feature of X $\prod X_j, $ $X_j$ Jth feature of $X$
sample splitting: training set involves 60% of only good examples. Cross Validation and Test each contain 20% good and 50% anomalous
choose anomaly-detection over supervised-learning when negative examples (y=0) >> positive (y=1) examples
- because unknown anomalies are hard to learn
- choose anomaly for: fraud detection, manufacturing defects, monitoring failures
- choose supervised learning for: email spam, cancer, weather
feature selection: histogram of feature should look like gaussian
- if not gaussian, try transformations such as log or ^(1/n)
- try adding a new feature that gives larger sigma to anomalous values (from CV set)
  - e.g. to detect a program in infinite loop, invent a new feature CPU/IO, instead of/in addition to, CPU and IO as two independent features
- use multi-variate gaussian to automatically detect correlation between features instead of inventing a new feature

Multi-Variate Gaussian¶

for features that are correlated (positively or negatively), try using multi-variate gaussian
- v/s instead of treating each feature as a independent gaussian
- mu and Sigma are then Rn and Rnxn matrices. sigma is covariance matrix (same as in Principal Component Analysis)
P(x) for normal Gaussian is just a special case of multi-variate gaussian where Sigma (covariance matrix) has variance (sigma^2) along diagonal and all non-diagonal elements are zero
- they are also called axis-aligned because their contour maps ellipses have axises are aligned (not at an angle)
can only be used when m (number of samples) > n (number of features)
- because the formula requires calculating inverse of Sigma, which is non-invertible if m <= n
- in practice consider only if m >= 10 * n
computationally expensive compared to normal gaussian

recommender systems¶

goal: predict $y_{(i,j)}$ if $r_{(i,j)} = 0$, where
- i: product (eg movie), j: user, $r_{(i,j)} = 1$ if user j has rated product i so that is y(i,j not undefined
- problem formulation: find theta-j, parameter vector for user j to make prediction for product i as theta-j' * X
this becomes a linear regression problem to predict theta for user j based on recommendation s/he has provided for product i i.e. r(i,j) = 1
content based learning: when features are known, e.g. if a movie's genre vector is known i.e. [romantic, action, drama...] forms a feature vector
Collaborative Filtering is used to learn feature values for a product i given theta values for user j
- when both are unknown, theta is assigned random values to guess feature values for X(i), which is in turn used to get better theta values for the user
- this process is repeated until it converges to get optimal values for both theta and X(i)
- thus users collaborate to come up with feature values for product to make better predictions for everyone
Low Rank Matrix Factorization is a vectorized matrix multiplication of X * Theta' where
- X is matrix of all products in rows with their feature values in columns
- Theta is parameter values in columns for all users in rows
related products: product and i and j are related if the distance between them is small
Mean Normalization: normalize ratings across all users so that mean rating of each product is zero
- then to derive predicted rating for a user, add mean back to to zero-based rating
- this has an advantage for new users, that instead of their ratings being 0 for all products, they'll be assigned mean rating for the product