Skip to content

Feature Engineering

  • Data preparation and clean
    • Using domain knowledge to create/curate features for ML algorithms
    • Data Cleaning: Handle following. Generally if only few cases, remove rows, for large cases remove entire column
      • outliers: Use visualization (box-whisker charts), log-form to find outliers
      • missing values: some algorithms can't work with missing values, in that case replace with mean/median/mode
      • skewness: many technique assume normal distribution, use transformations such as square-root, log to smooth out
    • Scaling:
      • Features may have greater/lesser effect on model due to scale.
      • Often numerical (not categorical) features are scaled to range of 0-1
      • Scalars: MinMax, Standard, Normalizer
    • Encoding: Convert non-numeric types to numeric.
  • Feature Selection/Extraction
    • Selection keeps subset of original features, while Extraction creates new ones
  • used when number of features are very large
  • Techniques
    1. Filter: Correlation, Variance threshold, Anova, Information Gain
    2. Wrapper
    3. Embedded

Feature/Dimension Reduction

  • reduce number of dimensions. Most likely the features are correlated
  • help speed algorithm
  • help data visualization (reduce it down to 2 or 3 dimension)
  • techniques
    • Pearson correlation: selecting features with moderate/strong correlation with target columns
    • variance threshold: set a threshold. E.g. 0 variance means there is same value in every sample => not very useful
    • LDA
    • PCA

Linear Discriminant Analysis LDA

  • aims to find the projection that best separates the classes in the data.
  • used for supervised data
  • works by first calculating the mean and covariance matrix for each class in the data. It then calculates the between-class scatter matrix and the within-class scatter matrix. The goal is to find a projection that maximizes the ratio of the between-class scatter matrix to the within-class scatter matrix

Principal Component Analysis PCA

  • PCA aims to find the directions of maximum variance in the data
  • used for unsupervised data.
  • works by first centering the data around its mean and then finding the eigenvectors and eigenvalues of the covariance matrix
    • eigenvector represents direction of maximum variance; eigenvalue represents the amount.
    • eigenvectors are then used to project data onto a lower dimensional space

Coursera

  • Reduce n dimension to k dimension
  • involves finding k vectors onto which to project the data and minimize the squared projection error
    • squared-projection-error = 1/m * sum( (x(i) - xapprox(i))^2 )
  • projection is the position on the k dimension surface which is orthogonal
    • e.g. a 2-D point when projected on a 1-D line is the point on the line which is at the shortest distance from the original point
  • Algorithm
  • Pre-processing - mean normalization ensure every feature has zero mean - optionally feature-scaling i.e. make all features comparable magnitude - e.g. size of house in sq. feet and number of bedrooms are on different scales - usually done as x(j) = (x(j) - Mu(j))/s(j) when x(j) is feature j of x and Mu(j) is average and s(j) is some measure of beta value (Max-Min, or usually std deviation)
  • Compute covariant matrix Sigma
  • Compute U the eigenvectors of Sigma as: [U, S, V] = svd(Sigma)
  • Compute Ureduced as first k columns U(:, 1:k)
  • Compute z as Ureduced' * x
  • Reconstruction is going from compressed k dimensions to original n dimensions by
  • Choosing right value of k for n dimensions: find smallest k that retains 99% of variance
    • i.e. after reducing to k dimension, less than 1% of accuracy is lost
    • svd function returns a diagonal matrix S as one of the return value, which can be used to efficiently calculate konly
  • Mapping from x(i) -> z(i) where x is n dimensional and z is reduced k dimensional that must be derived

Artificial Data synthesis

  • Helps ML algorithm by coming up with more data
  • ROT ensure algorithm is low bias/high variance (by plotting learning curves) before adding more samples
    • increase number of feature/hidden layers to get to low bias algorithm
  • general ways to get more data
    • artificial data synthesis
    • collect/label yourself
    • crowd source
  • ways to build artificial data
    • Random samples: eg. for OCR, use variety of fonts to train your algorithm
    • Amplify: eg add distortions, like adding background noise to speech or distort characters slightly
      • doesn't help if distortions are random, like increasing brightness of pixel, adding random pixels

Image processing

  • Convolution: is used in image processing to reduce image noise and enhance important image features.
    • works similar to applying an image filter
  • Pooling: used for compressing images; works well with convolution
    • works by picking a pixel with highest numerical value among a pool of neighboring pixels