Skip to content

LLMs

Types of LLM Models

  • Standard/General LLMs: Examples include GPT-4o, Claude 3.5 Sonnet, and Llama 3. These excel at creative writing, summarization, and general conversation, predicting the next token based on patterns rather than deep reasoning.
  • Reasoning/Thinking Models (Test-Time Compute Models): These models spend significant computational resources "thinking" before providing an answer, generating a "scratchpad" or "inner monologue" to plan before outputting. Examples: OpenAI o1/o3, Gemini 2.0 Flash Thinking
  • Agentic LLMs: These models operate as "agents" that can set goals, decompose problems into sub-goals, and use external tools (browsers, code interpreters) to solve tasks, embodying a highly active "thinking" process.
  • Specialized Domain Models: These are trained on specialized data to enhance reasoning in particular fields. Examples: Minerva/Galactica: Focused on scientific and mathematical reasoning. Eurus: Specifically optimized for mathematical and code generation tasks.

Transformer

Most LLMs use Transformer architecture nowadays. Before Transformer, LLMs processed text one word at a time, whereas Transformer process entire sequence at a time. Transformer architecture relies on 3 secret ingredients:

  1. Self-attention (Context mechanic): allows the model to look at a word and determine which other words are relevant to it. Example: "The bank was closed because the river overflowed", LLMs can determine that "bank" refers to land and not the building
  2. Positional encoding (Order mechanic): since Transformer processes all words simultaneously, positional encoding adds a "time stamp" to each word, so the model can differentiate, for example, between "Dog bites man" v/s "Man bites dog"
  3. Parallelization (Speed mechanic): Unlike older models (like RNNs or LSTMs) that had to wait for the previous word to finish, Transformers can crunch massing amount of data using GPUs.

Generalization and Zero-Shot Capability: Because the Transformer is trained on such vast amounts of general data, it can perform tasks it has never been explicitly trained on (zero-shot learning), such as summarizing a type of document it has never seen before, simply because it learned the general rules of summarization from other data.

Transformers are used in both, training and inference, however their features differ

Feature In Training In Inference
Data Input Entire dataset blocks (all at once) One prompt + generated words
Processing Highly Parallel Sequential (one token at a time)
Goal Update weights (learn) Predict next token (answer)
Memory Huge (storing gradients) Large (storing KV Cache)

Evolution of Neural Networks

  1. Simple Neural Networks (Perceptrons): Learned basic patterns.
  2. RNNs / LSTMs: Learned to handle sequences (like sentences) but were slow and had short memories.
  3. Transformers: Refined the core idea of a neural network into a structure that uses Attention to handle massive data and complex context simultaneously.

Some niche areas, for example,

  • extremely long sentences can't be processed by Transformers because they have a context window.
  • On edge or resource constrained devices, transformers cannot run because they require significant RAM and GPU power; LSTM runs better

LLM Training

Modern LLM development doesn't just use one training method; it uses a multi-stage, sophisticated pipeline.

  • models are available as either base (pre-training) or fine-tuned (post-training)
  • Base models allow it to be trained on any special or custom use
Phase Learning Goal Technique Data Type Data Scale
Pre-training Predict the next word Self-Supervised Raw text/diverse Trillions of tokens
Post-training Follow instructions SFT, RLHF/DPO Curated Q&A/rankings Thousands of tokens

A fine-tuned model goes through the following stages of training:

  1. Pre-training: the Library phase
  2. SFT (Supervised Fine Tuning): the Classroom phase, makes the model chat capable
  3. RLHF or DPO: the Feedback phase, refines the vibe
    • RLHF: Reinforced Learning with Human Feedback
    • DPO: Direct Preference Optimization (learning from dataset of preferred and rejected responses)

Technical differences:

Phase Technical Objective Mathematical Algorithm Neural Network Role
Pre-training Next-Token Prediction Gradient Descent Learns world knowledge
SFT Minimize Prediction Error Supervised Learning Learns specific tasks/formats
RLHF Maximize Reward Signal PPO (Reinforcement Learning) Learns human judgment/vibe
DPO Contrastive Preference DPO Loss (Supervised-like) Learns preferences directly

Various model architectures

With Transformer as the blueprint, most popular models are:

  • Decoder-only: most popular (GPT, llama, Claude). It predicts the next token in sequence by only looking at previous tokens using unidirectional attention (Casual Masking). Unlike Encoder, which can look at the whole sentence at once (left and right), decoder can look at only words to their left. (Gemini)
  • Mixture of Experts (MoE): sparse version of decoder-only architecture. Instead of one giant neural network, it contains multiple experts. A router decides which expert should process each incoming token, typically activating only 1 or 2 experts at a time. Examples: Mixtral, DeepSeek, Grok
  • Encoder-Decoder: it uses encoder to understand the input and a decoder to generate the output. It uses cross attention to link the two, allowing model to look at entire sentence while generating response. Best for tasks which have fixed lengths, such as language translation summarization. Example: T5, BART
  • Encoder-only: focus strictly understanding relationships between words. Bidirectional. Excellent for classification, sentiment analysis, search, but poor at generating long-form text. Example: BERT, RoBERTa

Upcoming/experimental: trying solve Quadratic scaling problem of Transformers, where they slow down as text gets longer.

  • State-Space Models (SSMs): use linear time complexity. Very fast. Example: Mamba
  • Hybrid: combine Transformer with SSM or MoE layers