• We’ve talked about how transformers generate predictions, but there’s a crucial step at the end of the process that often gets glossed over: the softmax function. This mathematical function is what lets a model turn raw scores into something meaningful – probabilities. Let’s break down what softmax is, why it’s important, and how it fits into the bigger picture.

    What is Softmax?

    After a transformer has just finished processing an input. It spits out a vector of numbers, one for each possible word in its vocabulary. These numbers (often called logits) are not probabilities yet – they’re just raw, unbounded scores. We need a way to turn these scores into probabilities that sum to 1, so the model can “decide” what word to pick next.

    The softmax function takes this vector of scores and squashes them into a probability distribution. The higher the score, the higher the resulting probability – but crucially, all probabilities will add up to 1.

    Moreover, the softmax function uses exponentials. This makes it so that if any scores are significantly higher than the rest, that disparity gets magnified, and the probability of that token being selected is high.

    How does softmax work?

    The softmax function is quite straightforward! This is how it works:

    1) For each score in the input vector to the function, exponentiate it. In other words, if the score is x, we raise e (the mathematical constant) to the power of x.

    2) For each of these exponentiated values, we divide it by the sum of all exponentiated values to obtain the probability for the corresponding score/token!

    The second step ensures that all probabilities sum up to one, and the function also ensures that all probabilities are between 0 and 1.

    Where is softmax used?

    Aside from the end of the decoder step in transformers, they’re used in various other applications of AI, such as classification and reinforcement learning. The act of converting a list of arbitrary scores into a list of probabilities is what makes the softmax function so useful and ubiquitous!

  • Continuing from the previous post, let’s now dive into the second half of the transformer – the decoder. If you recall, the decoder takes the contextualized embeddings produced by the encoder from the input sequence. It then generates the output sequence, one token at a time. Let’s break down how this process unfolds, step by step.

    Shifted Output Embeddings

    The decoder starts with the output tokens generated so far. These are embedded into vectors, just like in the encoder. However, there’s a twist: the output sequence is shifted right. This means, for each position, the decoder only “sees” the tokens that have already been generated, never the future ones. So our decoder’s initial “start position” is just a special <start> token embedding.

    Positional Encoding

    Just as in the encoder, these embeddings are combined with positional encoding. This ensures that the decoder is aware of the position of each token in the output sequence, which is crucial for generating coherent and grammatically correct sentences.

    Masked Multi-Head Attention

    The first major block in the decoder is masked multi-head attention. Here, each output token can attend to all previous tokens in the sequence, but not to any future tokens. This is enforced by a “mask” that blocks attention to subsequent positions. The result: when generating the next word, the model can only use information from what it has already generated, never “peeking” ahead. This is essential for tasks like text generation, where each word must be predicted step by step.

    Multi-Head Cross Attention

    Also called multi-head encoder-decoder attention, this is where the decoder looks back at the encoder’s output. Each token in the decoder can attend to any position in the encoder’s final output, allowing it to align generated words with relevant parts of the input sequence. This mechanism is what enables the transformer to generate translations that are faithful to the source sentence, or summaries that actually reflect the input.

    Feedforward Layer

    There’s then another feedforward layer, or FFN. It performs a function similar to that in the encoder, letting the model learn about more complex and abstract relations and concepts.

    Linear and Softmax

    The embeddings after the Nx repeating blocks in the decoder are trained to contain the data for the model’s prediction of the next token.

    The linear layer simply projects the decoder output onto the model’s vocabulary size.

    The softmax function (we’ll explain this later), converts that vocabulary-sized vector into a same-sized vector containing the probabilities of the next possible output token.

  • Continuing from the previous post, let’s dive into the first section of the transformer – **the encoder**. As we discussed, the encoder embeds the input tokens, uses positional encoding and attention to imbue the token embeddings with relevant meaning, and passes the modified embeddings to the decoder. We’ve already covered how token embeddings work, so lets jump to…

    Positional Encoding

    Since each token is converted to its individual embedding vector, they don’t have any information about surrounding tokens and the order in which they occur. Naturally, this is important to how language is structured, so we need a way to add this information to the embeddings!

    Positional encoding does this by adding a vector to each token embedding. This vector is derived from sinusoidal functions and is dependent on the position of the token. The positional meaning of this added vector is “baked” into the latent space of the encoder – the internal abstract space of representations that the model learns.

    Multi-Head Attention

    As we discussed in the attention post, this section allows for each embedding to attend to every other, allowing the token embeddings to incorporate relevant meaning from other tokens.

    The **multi-head** prefix refers to the fact that attention here is carried out by multiple units, or “heads”. Each head conducts attention slightly differently. For instance, one head could deal with attending to nearby words, while another might deal with subject-object relations. Put together, these heads complete comprehensive attention.

    Feedforward Layer

    The feedforward layer is a neural network that is independently applied to each encoding after the attention step. This neural network transforms the embedding, enhancing its representation. While the attention layer lets each embedding gain data from surrounding ones, the feedforward layer lets the model “think more deeply” about the embedding. Since it’s a non-linear layer, the training process lets the feedforward layers in the transformer learn deeply about complex relations.

    Add & Norm

    Standing for “Add and normalize”, this unit adds the input of the previous unit to its output and then normalizes the result. Normalization stabilizes training, and adding the input to the output improves gradient flow, which mitigates errors when training. Combined, this smoothly enables the next step…

    Nx

    You’ll notice a little “Nx” next to a box encapsulating the attention and feedforward layers. This simply means that this mini-sequence is repeated some N times within the architecture, similar to how a neural network has multiple layers. This enables the transformer to learn more abstract and complex concepts.

  • Introduced in the seminal paper “Attention Is All You Need,” the transformer revolutionized the world of natural language processing (NLP) and supercharged the progress of LLMs today. Let’s take a look at how it works.

    Let’s consider transformers used for causal language modelling. Causal refers to the property of depending only on prior and current inputs, not those in the future. So, causal language modelling in this context refers to the goal of predicting the text that comes next, after a given input.

    For example, with a given input of “I am 30 years-”, we would like to predict the next word, likely to be the word “old.” Since we’re talking about LLMs, our goal is rather to predict the next token.

    This is how the transformer-based LLM works: Given a prompt (input), we try to predict the most suitable token to come next. Then we append it to the initial input and keep repeating the process.

    Encoders and Decoders

    The transformer, as it is described in “Attention Is All You Need,” works on an encoder-decoder system.

    The encoder takes the input passed to the model — if we’re considering an LLM, that would be the prompt. It then transforms this input into context-rich embeddings. These aren’t just word embeddings; they also contain positional encoding, indicating where the embedding is located within the whole corpus, and each token attends to every other one as well.

    The decoder generates the output tokens. Each time an output token is generated, the decoder takes in the encodings that have been generated so far as input. It uses masked self-attention to make sure that each token can only attend to tokens that come before it. This is necessary for training, and we’ll discuss it later. The decoder also uses cross-attention to attend to the encoder’s output, so it can refer to the input that way.

    Finally, the decoder returns a list of possible next tokens and their corresponding confidence probabilities. From this, the next token is selected, and the decoder step repeats.

  • Continuing from the first part, let’s look at some more of OpenAI’s model naming conventions.

    Pro

    While mini models sacrifice performance for improved speed and decreased costs, pro-class models do the opposite. They are optimized for accuracy and better reasoning, and as such, are slower and more expensive. Thus, they are more suited to mission-critical use cases.

    To put the mini-regular-pro comparisons into perspective, mini models are generally roughly half the cost of regular models (comparing cost per million input and output tokens). On the other hand, o1-pro is 10-50x as expensive as regular models.

    Comparing performance, let’s look at AIME (American Invitational Mathematics Examination) benchmark results, popularly used as a maths benchmark.

    The mini-class of models sit at ~70-80% on the benchmark. Regular models score ~75-90%. o1-pro, however, scored 93% on it! A 3% increase might not seem like much, but interpreting it differently, the o1-pro model makes 30% fewer mistakes than the regular models, which could be a very useful improvement.

    .5

    GPT 3.5 is an improvement on GPT-3, but it was built upon GPT-3, and was not revolutionary enough to warrant a new number. Hence, the “.5”. This shouldn’t be unfamiliar if you’ve interacted with software versions before. 

    Turbo

    Turbo models are optimized for speed, and to a lesser extent, cost. Pricing sits between regular and mini models, and performance is lower, but close to regular. While model size is reduced for mini models, turbo models maintain a similar size to regular.

    Others

    Those were the main naming conventions, but let’s take a look at a couple more.

    Moderation

    Moderation models are designed to screen outputs for policy-violating content.

    Realtime

    Realtime models are designed to deliver low-latency models suitable for streaming input and output to and from the model. You might use them in TTS/STT applications, or other applications where low-latency is critical.

  • As OpenAI models have progressed over the years, you might have heard of the new models being released via headlines or in passing. But beyond the base GPT versions, the naming conventions probably seem rather confusing. 4o, .5, turbo? What does it even mean? Let’s take a look, starting with the basics.

    Base GPT – major versions

    The major versions of the base GPT models – that is, GPT-2, GPT-3, and GPT-4, are named as such since each version represents a major leap in capabilities. To get it out of the way, “GPT” stands for generative pre-trained transformer, in other words, OpenAI’s transformer-based LLM models.

    To put the leaps in progress into perspective, let’s look at the models’ parameter counts.

    GPT-2: 1.5 billion

    GPT-3: 175 billion

    GPT-4: estimated 1.8 trillion

    Other features of the models such as the context window length, multimodality, and training data quantity and quality similarly improved as well.

    “o”-models

    It gets a little confusing here. When the “o” comes after the number, like in GPT-4o, it stands for “omni”, signifying the model’s capability to handle multimodal input and output – text, vision, and audio.

    When the “o” comes before the number, like in the o1 model for example, it denotes a class of models that specialize in advanced reasoning, such as in math, science, and programming.

    mini

    These models, as the name suggests, are distilled versions of their “full” counterparts. They have fewer parameters and thus sacrifice some accuracy and reasoning ability. However, they, in turn, are faster to run and cheaper to use, ideal for use cases where deep reasoning doesn’t matter as much, like in chatbots and some other RAG applications.

  • Continuing from the last post, here are some more prompt engineering techniques.

    Tree of Thought

    Rather than being a way to enhance a prompt, Tree of Thought is more so a prompting framework. You break down your task into intermediate steps and repeat a step multiple times. If a step seems like it is in the right direction, or at least possibly is in the right direction, we continue with that as a new base, and move onto the next step, once again generating multiple responses. This allows for continuously validated reasoning.

    Prompt chaining

    Prompt chaining is simply breaking a task down into steps and prompting the generative model with those one by one. For example, if you wanted to make a presentation:

    Prompt 1: “I want to make a presentation on XYZ. I’d like to cover these points: … . Please create an outline for a 10-slide presentation.”

    Prompt 2: “This outline looks good. Give me a good title for my presentation.”

    Prompt 3: “Now plan out an effective introduction for a general audience.”

    And so on.

    Self-consistency

    A single LLM response may be incorrect due to various factors – hallucinating details, miscalculations, etc. To improve on this, self-consistency is a prompt engineering technique that essentially produces multiple independent responses to a question, often using randomization techniques to encourage diverse reasoning.

    Then, a majority vote is used to obtain the final answer.

  • Sometimes you may feel like you can’t get an LLM chatbot to do quite what you want. Sometimes, this can be resolved by improving your prompt. Let’s take a look at prompt engineering, the art of constructing an effective prompt!

    Prompts and prompt engineering

    A prompt is simply the textual input you give generative AI models (like LLMs and image generation models), instructing them to perform your desired task. A simple example – whatever you type into ChatGPT is a prompt!

    Prompt engineering is the art of designing prompts to be more relevant, effective, and imposing some manner of constraints. It’s generally done all with natural language, and you can pick and choose what methods you’d like to use, so don’t be intimidated!

    Some prompt engineering techniques

    Here are some commonly-used prompt engineering techniques

    One/Few-shot prompting

    In this technique, “shots” refers to examples you’re providing in your prompt. A regular prompt you would use, like “Make a sentence with the word ‘weta’ in it”, is an example of a zero-shot prompt, since you’re not providing any examples. (PS: Google it if you dare)

    An example of a one-shot prompt would be:

    “A weta is an insect endemic to New Zealand. An example of a sentence using the word ‘weta’ is:

    Mark was terrified of the weta crawling on his bedroom floor

    Create a sentence using the word ‘weta’ in it”

    As for few-shot prompting, just add a few more examples.

    Role prompting

    In this technique, you assign the LLM a particular role or persona. This encourages the model to embody a similar expertise level, tone, and perspective as the role you assigned. An example:

    “You are a technical support specialist with expertise in handling network issues. You are also great at explaining solutions to non-technical teammates.

    I’m facing this issue, please help me resolve it: … “

    Chain of thought

    In this technique, you encourage the model to work through its solution step-by-step. This encourages the model to generate its ‘thinking process’, which can be useful as added context for complex questions. An example:

    “Find the solutions to this equation, and explain your reasoning step-by-step:

    x^2 + 2x + 1 = 0”

  • When we previously discussed embeddings, we talked about being able to imbue an embedding with relevant semantic meaning from the surrounding context. For example, in the sentence “I spoke to Mark but he …”, an LLM would like to know what the embedding “he” refers to. The method that makes this possible is called attention. Let’s take a high-level overview of what it’s about.

    How attention works (at a high level)

    At the core of each transformer block within a transformer (the revolutionary architecture that made modern LLMs possible) lies the attention layer. An attention head is a component that conducts the attention mechanism, and several of them run in parallel within an attention layer. Essentially, an attention head assigns relevancy scores with respect to the current token for preceding token embeddings within the context window. In lingo, we say that the current token is attending to other tokens.

    These relevancy scores are then combined into the current token embedding to enrich it with relevant semantic meaning from the surrounding text. This method of attention is called self-attention and lies at the heart of GPT models. I’ll dive deeper into how attention works in a later post.

  • When you read any kind of text, you’re able to quite naturally understand what’s written, without giving it much active thought. Take a look at someone learning a new language, however, and you’ll see that when they try to read a sentence, they do so by breaking it down – usually word-by-word, and sometimes breaking down larger words further.

    Similarly, LLMs break down their text inputs into smaller parseable units called tokens. Your first thought might be to break down texts into individual words, and that’s valid! Termed “word tokenization”, that’s a well-known tokenization strategy. However, consider the words “running”, “runner”, and “runners”. When you think about these words, you probably don’t consider them separately. You identify the root of the word – “run”, and that it’s conjoined with suffixes that slightly modify the word’s context.

    Likewise, subword tokenization is a dominant tokenization method for LLMs. As the name suggests, tokens obtained via this method can be smaller than an entire word, often word roots, prefixes, and suffixes as described above.

    How tokenization plays into LLMs

    LLMs are designed with a certain vocabulary size in mind. This determines the number of tokens it can register in its vocabulary. Before an LLM reads data, an algorithm called a tokenizer breaks the word down into tokens. The tokenizer is trained to generate a token vocabulary of a specified size that fits the data it expects to read well. In case the LLM encounters unexpected text that might not fit word or even subword tokens (such as misspelled words), byte tokens are often added to the vocabulary as well, so that the data is still able to be tokenized. These tokens represent a single byte of data – quite the granular division!

    Since an LLM’s vocabulary is fixed after the tokenizer is trained, each token is assigned a unique numerical token ID. When text is broken down into tokens, those tokens are represented by their IDs. Next, LLMs maintain an embedding matrix that maps each token ID to its corresponding token embedding – an embedding that solely represents that token. This way, tokens can be quickly converted into token embeddings for the LLM to use.