LLMs¶

Types of LLM Models¶

Standard/General LLMs: Examples include GPT-4o, Claude 3.5 Sonnet, and Llama 3. These excel at creative writing, summarization, and general conversation, predicting the next token based on patterns rather than deep reasoning.
Reasoning/Thinking Models (Test-Time Compute Models): These models spend significant computational resources "thinking" before providing an answer, generating a "scratchpad" or "inner monologue" to plan before outputting. Examples: OpenAI o1/o3, Gemini 2.0 Flash Thinking
Agentic LLMs: These models operate as "agents" that can set goals, decompose problems into sub-goals, and use external tools (browsers, code interpreters) to solve tasks, embodying a highly active "thinking" process.
Specialized Domain Models: These are trained on specialized data to enhance reasoning in particular fields. Examples: Minerva/Galactica: Focused on scientific and mathematical reasoning. Eurus: Specifically optimized for mathematical and code generation tasks.

Transformer¶

Most LLMs use Transformer architecture nowadays. Before Transformer, LLMs processed text one word at a time, whereas Transformer process entire sequence at a time. Transformer architecture relies on 3 secret ingredients:

Self-attention (Context mechanic): allows the model to look at a word and determine which other words are relevant to it. Example: "The bank was closed because the river overflowed", LLMs can determine that "bank" refers to land and not the building
Positional encoding (Order mechanic): since Transformer processes all words simultaneously, positional encoding adds a "time stamp" to each word, so the model can differentiate, for example, between "Dog bites man" v/s "Man bites dog"
Parallelization (Speed mechanic): Unlike older models (like RNNs or LSTMs) that had to wait for the previous word to finish, Transformers can crunch massing amount of data using GPUs.

Generalization and Zero-Shot Capability: Because the Transformer is trained on such vast amounts of general data, it can perform tasks it has never been explicitly trained on (zero-shot learning), such as summarizing a type of document it has never seen before, simply because it learned the general rules of summarization from other data.

Transformers are used in both, training and inference, however their features differ

Feature	In Training	In Inference
Data Input	Entire dataset blocks (all at once)	One prompt + generated words
Processing	Highly Parallel	Sequential (one token at a time)
Goal	Update weights (learn)	Predict next token (answer)
Memory	Huge (storing gradients)	Large (storing KV Cache)

Evolution of Neural Networks¶

Simple Neural Networks (Perceptrons): Learned basic patterns.
RNNs / LSTMs: Learned to handle sequences (like sentences) but were slow and had short memories.
Transformers: Refined the core idea of a neural network into a structure that uses Attention to handle massive data and complex context simultaneously.

Some niche areas, for example,

extremely long sentences can't be processed by Transformers because they have a context window.
On edge or resource constrained devices, transformers cannot run because they require significant RAM and GPU power; LSTM runs better

LLM Training¶

Modern LLM development doesn't just use one training method; it uses a multi-stage, sophisticated pipeline.

models are available as either base (pre-training) or fine-tuned (post-training)
Base models allow it to be trained on any special or custom use

Phase	Learning Goal	Technique	Data Type	Data Scale
Pre-training	Predict the next word	Self-Supervised	Raw text/diverse	Trillions of tokens
Post-training	Follow instructions	SFT, RLHF/DPO	Curated Q&A/rankings	Thousands of tokens

A fine-tuned model goes through the following stages of training:

Pre-training: the Library phase
SFT (Supervised Fine Tuning): the Classroom phase, makes the model chat capable
RLHF or DPO: the Feedback phase, refines the vibe
- RLHF: Reinforced Learning with Human Feedback
- DPO: Direct Preference Optimization (learning from dataset of preferred and rejected responses)

Technical differences:

Phase	Technical Objective	Mathematical Algorithm	Neural Network Role
Pre-training	Next-Token Prediction	Gradient Descent	Learns world knowledge
SFT	Minimize Prediction Error	Supervised Learning	Learns specific tasks/formats
RLHF	Maximize Reward Signal	PPO (Reinforcement Learning)	Learns human judgment/vibe
DPO	Contrastive Preference	DPO Loss (Supervised-like)	Learns preferences directly

Various model architectures¶

With Transformer as the blueprint, most popular models are:

Decoder-only: most popular (GPT, llama, Claude). It predicts the next token in sequence by only looking at previous tokens using unidirectional attention (Casual Masking). Unlike Encoder, which can look at the whole sentence at once (left and right), decoder can look at only words to their left. (Gemini)
Mixture of Experts (MoE): sparse version of decoder-only architecture. Instead of one giant neural network, it contains multiple experts. A router decides which expert should process each incoming token, typically activating only 1 or 2 experts at a time. Examples: Mixtral, DeepSeek, Grok
Encoder-Decoder: it uses encoder to understand the input and a decoder to generate the output. It uses cross attention to link the two, allowing model to look at entire sentence while generating response. Best for tasks which have fixed lengths, such as language translation summarization. Example: T5, BART
Encoder-only: focus strictly understanding relationships between words. Bidirectional. Excellent for classification, sentiment analysis, search, but poor at generating long-form text. Example: BERT, RoBERTa

Upcoming/experimental: trying solve Quadratic scaling problem of Transformers, where they slow down as text gets longer.

State-Space Models (SSMs): use linear time complexity. Very fast. Example: Mamba
Hybrid: combine Transformer with SSM or MoE layers