Feature Engineering¶

Data preparation and clean
- Using domain knowledge to create/curate features for ML algorithms
- Data Cleaning: Handle following. Generally if only few cases, remove rows, for large cases remove entire column
  - outliers: Use visualization (box-whisker charts), log-form to find outliers
  - missing values: some algorithms can't work with missing values, in that case replace with mean/median/mode
  - skewness: many technique assume normal distribution, use transformations such as square-root, log to smooth out
- Scaling:
  - Features may have greater/lesser effect on model due to scale.
  - Often numerical (not categorical) features are scaled to range of 0-1
  - Scalars: MinMax, Standard, Normalizer
- Encoding: Convert non-numeric types to numeric.
Feature Selection/Extraction
- Selection keeps subset of original features, while Extraction creates new ones
used when number of features are very large
Techniques
Filter: Correlation, Variance threshold, Anova, Information Gain
Wrapper
Embedded

Feature/Dimension Reduction¶

reduce number of dimensions. Most likely the features are correlated
help speed algorithm
help data visualization (reduce it down to 2 or 3 dimension)
techniques
- Pearson correlation: selecting features with moderate/strong correlation with target columns
- variance threshold: set a threshold. E.g. 0 variance means there is same value in every sample => not very useful
- LDA
- PCA

Linear Discriminant Analysis LDA¶

aims to find the projection that best separates the classes in the data.
used for supervised data
works by first calculating the mean and covariance matrix for each class in the data. It then calculates the between-class scatter matrix and the within-class scatter matrix. The goal is to find a projection that maximizes the ratio of the between-class scatter matrix to the within-class scatter matrix

Principal Component Analysis PCA¶

PCA aims to find the directions of maximum variance in the data
used for unsupervised data.
works by first centering the data around its mean and then finding the eigenvectors and eigenvalues of the covariance matrix
- eigenvector represents direction of maximum variance; eigenvalue represents the amount.
- eigenvectors are then used to project data onto a lower dimensional space

Coursera¶

Reduce n dimension to k dimension
involves finding k vectors onto which to project the data and minimize the squared projection error
- squared-projection-error = 1/m * sum( (x(i) - xapprox(i))^2 )
projection is the position on the k dimension surface which is orthogonal
- e.g. a 2-D point when projected on a 1-D line is the point on the line which is at the shortest distance from the original point
Algorithm
1. Pre-processing
2. mean normalization ensure every feature has zero mean
3. optionally feature-scaling i.e. make all features comparable magnitude
  - e.g. size of house in sq. feet and number of bedrooms are on different scales
4. usually done as x(j) = (x(j) - Mu(j))/s(j) when x(j) is feature j of x and Mu(j) is average and s(j) is some measure of beta value (Max-Min, or usually std deviation)
5. Compute covariant matrix Sigma
6. Compute U the eigenvectors of Sigma as: [U, S, V] = svd(Sigma)
7. Compute Ureduced as first k columns U(:, 1:k)
8. Compute z as Ureduced' * x
Reconstruction is going from compressed k dimensions to original n dimensions by
Choosing right value of k for n dimensions: find smallest k that retains 99% of variance
- i.e. after reducing to k dimension, less than 1% of accuracy is lost
- svd function returns a diagonal matrix S as one of the return value, which can be used to efficiently calculate konly
Mapping from x(i) -> z(i) where x is n dimensional and z is reduced k dimensional that must be derived

Artificial Data synthesis¶

Helps ML algorithm by coming up with more data
ROT ensure algorithm is low bias/high variance (by plotting learning curves) before adding more samples
- increase number of feature/hidden layers to get to low bias algorithm
general ways to get more data
- artificial data synthesis
- collect/label yourself
- crowd source
ways to build artificial data
- Random samples: eg. for OCR, use variety of fonts to train your algorithm
- Amplify: eg add distortions, like adding background noise to speech or distort characters slightly
  - doesn't help if distortions are random, like increasing brightness of pixel, adding random pixels

Image processing¶

Convolution: is used in image processing to reduce image noise and enhance important image features.
- works similar to applying an image filter
Pooling: used for compressing images; works well with convolution
- works by picking a pixel with highest numerical value among a pool of neighboring pixels