AI/ML¶

The ultimate goal of virtually all machine learning—whether predicting house prices or predicting the next word—is the same: Optimization.

The Core Algorithm: Almost all weight calculation techniques rely on Gradient Descent. The idea is to measure how "wrong" the model is (the Loss Function) and then use calculus (derivatives) to figure out which direction and by how much the weights must change to make the model less wrong.
The Principle: Minimizing loss remains the central mathematical principle connecting a simple linear regression model from the 1990s to a massive LLM today.

Traditional ML models often used simpler architectures (like simple feed-forward networks or decision trees). LLMs are built almost entirely on the Transformer architecture.

The Key Component: The Attention Mechanism. This mechanism is what gives LLMs their incredible power. It allows the model to weigh the relationship between every word in the input context simultaneously, regardless of how far apart they are.
How it changes training: This mechanism allows the model to build a highly sophisticated, multi-dimensional understanding of context and semantics that earlier, non-Transformer models could not achieve.