Skip to content

Anomaly Detection

  • goal: determine if X(test) is anomalous
  • convert features into probability and if it's less than some threshold P(X) < delta then it's anomalous
    • P(X) is product of all individual P(Xj) where Xj is Jth feature of X $\prod X_j, $ \(X_j\) Jth feature of \(X\)
  • sample splitting: training set involves 60% of only good examples. Cross Validation and Test each contain 20% good and 50% anomalous
  • choose anomaly-detection over supervised-learning when negative examples (y=0) >> positive (y=1) examples
    • because unknown anomalies are hard to learn
    • choose anomaly for: fraud detection, manufacturing defects, monitoring failures
    • choose supervised learning for: email spam, cancer, weather
  • feature selection: histogram of feature should look like gaussian
    • if not gaussian, try transformations such as log or ^(1/n)
    • try adding a new feature that gives larger sigma to anomalous values (from CV set)
      • e.g. to detect a program in infinite loop, invent a new feature CPU/IO, instead of/in addition to, CPU and IO as two independent features
    • use multi-variate gaussian to automatically detect correlation between features instead of inventing a new feature

Multi-Variate Gaussian

  • for features that are correlated (positively or negatively), try using multi-variate gaussian
    • v/s instead of treating each feature as a independent gaussian
    • mu and Sigma are then Rn and Rnxn matrices. sigma is covariance matrix (same as in Principal Component Analysis)
  • P(x) for normal Gaussian is just a special case of multi-variate gaussian where Sigma (covariance matrix) has variance (sigma^2) along diagonal and all non-diagonal elements are zero
    • they are also called axis-aligned because their contour maps ellipses have axises are aligned (not at an angle)
  • can only be used when m (number of samples) > n (number of features)
    • because the formula requires calculating inverse of Sigma, which is non-invertible if m <= n
    • in practice consider only if m >= 10 * n
  • computationally expensive compared to normal gaussian

recommender systems

  • goal: predict \(y_{(i,j)}\) if \(r_{(i,j)} = 0\), where
    • i: product (eg movie), j: user, \(r_{(i,j)} = 1\) if user j has rated product i so that is y(i,j not undefined
    • problem formulation: find theta-j, parameter vector for user j to make prediction for product i as theta-j' * X
  • this becomes a linear regression problem to predict theta for user j based on recommendation s/he has provided for product i i.e. r(i,j) = 1
  • content based learning: when features are known, e.g. if a movie's genre vector is known i.e. [romantic, action, drama...] forms a feature vector
  • Collaborative Filtering is used to learn feature values for a product i given theta values for user j
    • when both are unknown, theta is assigned random values to guess feature values for X(i), which is in turn used to get better theta values for the user
    • this process is repeated until it converges to get optimal values for both theta and X(i)
    • thus users collaborate to come up with feature values for product to make better predictions for everyone
  • Low Rank Matrix Factorization is a vectorized matrix multiplication of X * Theta' where
    • X is matrix of all products in rows with their feature values in columns
    • Theta is parameter values in columns for all users in rows
  • related products: product and i and j are related if the distance between them is small
  • Mean Normalization: normalize ratings across all users so that mean rating of each product is zero
    • then to derive predicted rating for a user, add mean back to to zero-based rating
    • this has an advantage for new users, that instead of their ratings being 0 for all products, they'll be assigned mean rating for the product