LLMs¶
Types of LLM Models¶
- Standard/General LLMs: Examples include GPT-4o, Claude 3.5 Sonnet, and Llama 3. These excel at creative writing, summarization, and general conversation, predicting the next token based on patterns rather than deep reasoning.
- Reasoning/Thinking Models (Test-Time Compute Models): These models spend significant computational resources "thinking" before providing an answer, generating a "scratchpad" or "inner monologue" to plan before outputting. Examples: OpenAI o1/o3, Gemini 2.0 Flash Thinking
- Agentic LLMs: These models operate as "agents" that can set goals, decompose problems into sub-goals, and use external tools (browsers, code interpreters) to solve tasks, embodying a highly active "thinking" process.
- Specialized Domain Models: These are trained on specialized data to enhance reasoning in particular fields. Examples: Minerva/Galactica: Focused on scientific and mathematical reasoning. Eurus: Specifically optimized for mathematical and code generation tasks.
Transformer¶
Most LLMs use Transformer architecture nowadays. Before Transformer, LLMs processed text one word at a time, whereas Transformer process entire sequence at a time. Transformer architecture relies on 3 secret ingredients:
- Self-attention (Context mechanic): allows the model to look at a word and determine which other words are relevant to it. Example: "The bank was closed because the river overflowed", LLMs can determine that "bank" refers to land and not the building
- Positional encoding (Order mechanic): since Transformer processes all words simultaneously, positional encoding adds a "time stamp" to each word, so the model can differentiate, for example, between "Dog bites man" v/s "Man bites dog"
- Parallelization (Speed mechanic): Unlike older models (like RNNs or LSTMs) that had to wait for the previous word to finish, Transformers can crunch massing amount of data using GPUs.
Various model architectures¶
With Transformer as the blueprint, most popular models are:
- Decoder-only: most popular (GPT, llama, Claude). It predicts the next token in sequence by only looking at previous tokens using unidirectional attention (Casual Masking). Unlike Encoder, which can look at the whole sentence at once (left and right), decoder can look at only words to their left. (Gemini)
- Mixture of Experts (MoE): sparse version of decoder-only architecture. Instead of one giant neural network, it contains multiple experts. A router decides which expert should process each incoming token, typically activating only 1 or 2 experts at a time. Examples: Mixtral, DeepSeek, Grok
- Encoder-Decoder: it uses encoder to understand the input and a decoder to generate the output. It uses cross attention to link the two, allowing model to look at entire sentence while generating response. Best for tasks which have fixed lengths, such as language translation summarization. Example: T5, BART
- Encoder-only: focus strictly understanding relationships between words. Bidirectional. Excellent for classification, sentiment analysis, search, but poor at generating long-form text. Example: BERT, RoBERTa
Upcoming/experimental: trying solve Quadratic scaling problem of Transformers, where they slow down as text gets longer.
- State-Space Models (SSMs): use linear time complexity. Very fast. Example: Mamba
- Hybrid: combine Transformer with SSM or MoE layers