Glossary¶

ML¶

Google glossary

noise: A unknown value, other than known features (X), that contributes to target (y). For example, in school grading, mood of teacher may contribute to the grade of a student but it's not usually captured as a feature.
polynomial model: polynomial equation with different degrees. One-degree polynomial being a straight line and higher degree polynomials can represent complex shapes
decision-tree models: decision tree predicts a constant value for a range of \(X\) values. A split introduces a different range that predict a different \(y\) values. A complex decision-tree model is with many splits which can predict complex data
regularization: Favoring simpler models over complex models, thus smoother, simpler functions to reduce overfitting noise
inductive bias: when different families of models favor certain bias. For example, decision-trees v/s polynomial
Bayes error rate: An irreducible error rate that cannot be reduced to zero because of noise, which is not captured by features
Model Complexity: Degree in polynomial models, or the number of splits in decision tree models. Models with higher complexity can fit training data better than simpler models. If the training set is small, complex models tend to overfit.
Overfitting: When (usually complex) model fits the training data too well, including the noise; thus fails to capture true structure of the data. Overfitting occurs with complex models with low amount training data. Testing errors are much higher than training error.
underfitting: When model is too simple, such as first-degree polynomial (straight line with a slope) that fails to capture complex trend in data. Both, training and testing errors, are large. However, it's hard to tell if large error is due to underfitting or noise. With different training datasets, the resulting models tend to
high variance v/s high bias: With different set of training data a model that overfits will vary greatly depending on the training data, thus the statistical name high variance. Whereas underfitting models make the same kind of error (for example, one-degree polynomial line will have only slightly different slope), but will not capture the true shape of the data. compare
cost function: e.g. mean-squared-error function
mean squared error: average of sum of square of difference between actual and predicted: \(\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2\)
learning rate, alpha: is the change applied to cost function to make it converge
regularization, lambda: used for penalizing certain feature by introducing high multiplier in cost function
hyperparameter: a parameter that controls learning process (e.g. alpha) v/s the training the model itself
Loss function: measures how far off the predicted value is from the actual value === sum(y - f(x)) for all data points
gradient boosting: gradually builds a better model by combining several weak learners
cross-validation: a practice of keeping data for training validation separate, i.e. do not train the model on the same data used for testing
Sigmoid Function: used to measure logistical regression: \(\frac{1}{1 + e^{-x}}\)
Manhattan Distance: In contrast to Euclidean distance, it is computed as \(|X_2 - X_1| + |Y_2 - Y_1|\)
Back propagation: Using results of loss function to adjust the weights to narrow the loss in the next cycle
one-hot encoding: converting categorical data into a number 1 or 0, so most modeling algorithms can work on it
Reinforcement learning: algorithms that learns an optimal policy to maximize return when interacting with environment

AI¶

RNN: Recurrent Neural Network, an older type of neural network, before transformers
LSTM: Long Short-Term Memory, a type of RNN architecture that was the industry standard for processing sequences (like text and speech) before the Transformer took over.
SFT: Supervised Fine-tuning, a phase in LLM training, after pre-traning but before RLHF/DPO
RLHF: Reinforced Learning with Human Feedback, a post-training phase in LLM training after SFT
DPO: Direct Preference Optimization, a post-training phase in LLM training after SFT, often used as an alternative to RLHF because of the simplicity
MoE: Mixture of Experts, sparse version of Decoder-only LLM architecture. Only a small subset of parameters are active for inference. Minimizes memory uses. Contrast to Dense models
KV Cache: A trick used to make inference faster. Since the Transformer has already processed the beginning of your prompt, it saves those mathematical results in memory so it doesn't have to re-calculate the entire history every time it adds a new word.
AI Harness: software that surrounds raw LLM, managing memory, tool execution, safety guardrails, and persistent task state. Agent = Model + Harness