Evaluating large language models (LLMs) is crucial to ensure they deliver accurate, safe, and useful outputs. After all, without rigorous assessment, models may generate incorrect, biased, or harmful content that undermines trust and viability. Evaluation helps developers understand a model’s strengths and weaknesses, guide improvements, and ensure alignment with practical needs.
How do we evaluate LLMs?
There are two primary methods for evaluating LLMs: automatic metrics and human evaluation. Automatic metrics use algorithms to score outputs objectively, enabling rapid assessment at large scale. Human evaluation, on the other hand, involves subjective judgment by people who assess fluency, relevance, and appropriateness more holistically, capturing nuances that machines may miss.
When to Use Which Approach
The choice between automatic and human evaluation depends on factors like scale, cost, and task type. Automatic metrics are ideal for frequent, large-scale testing where efficiency is critical, but their insights are limited. Human evaluation is best for nuanced tasks, high-stakes applications, or final validation, despite being slower and more costly. Often, a blend of both provides the most reliable and actionable insights.
When it comes to LLMs, temperature is a key parameter that controls the randomness or creativity of the text the model generates. It acts like a dial that influences how adventurous or predictable the model’s word choices are when producing language, essentially shaping the style and variety of the output.
How Temperature Works
Temperature works by adjusting the probability distribution over possible next words before the model selects its output. Internally, the model computes logits—raw, unnormalized scores for each potential token. Temperature rescales these logits, effectively sharpening or flattening the probability distribution. A low temperature value (close to zero) makes the distribution peak sharply around the highest-probability words, making the output highly predictable and deterministic. Conversely, a higher temperature flattens the distribution, increasing the chances of selecting less likely tokens and thus boosting output diversity and creativity.
Why Temperature Matters
This parameter is crucial because it allows users to balance coherence and inventiveness based on their needs. For example, when factual accuracy and reliability are paramount—such as in technical writing or data-driven responses—a low temperature is preferred to generate consistent and precise text. On the other hand, creative tasks like storytelling, brainstorming, or poetry benefit from a higher temperature, which encourages novel and varied outputs. Adjusting temperature helps tailor the model’s behavior to different applications, making it a versatile tool in managing language generation.
Practical Implications and Examples
Typically, temperature values range from 0 to 1, though values above 1 are sometimes used for even more randomness. A common low setting might be around 0.2 to 0.5, yielding safe and focused text. Mid-range values around 0.7 strike a balance, offering natural yet slightly inventive language. At high values near 1 or beyond, outputs become more unpredictable and sometimes quirky, which can be exciting but may also introduce nonsensical or irrelevant content.
A context window is essentially the span or range of tokens—units of text like words, subwords, or punctuation—that an AI language model can consider or “remember” at one time. Think of it as the model’s working memory, the active portion of text it analyzes when processing input and generating outputs. This window frames what the model can directly reference or draw upon during a single interaction, allowing it to maintain continuity and relevance within that scope.
How Context Windows Work
Language models process language through tokens, which break down text into manageable pieces, ranging from full words to smaller segments depending on the model’s design. The context window defines how many of these tokens the model can use simultaneously. As the model receives input, it incorporates all tokens within this window to understand meaning, context, and relationships before predicting the next token or generating a response. If the input exceeds the window size, the model must truncate or forget earlier tokens, limiting how much information it can actively consider at once.
Why Context Windows Matter in LLMs
The size of the context window directly influences a large language model’s performance and usefulness. A larger window allows the model to consider more text—whether it’s a long conversation, a detailed document, or complex instructions—leading to better understanding, more coherent outputs, and fewer hallucinations or errors caused by missing context. It also enhances the model’s ability to handle nuanced or extended interactions without losing track of earlier details. Consequently, innovations that expand context window sizes are crucial for improving how LLMs interact naturally and effectively across various applications.
We’ve talked about how transformers generate predictions, but there’s a crucial step at the end of the process that often gets glossed over: the softmax function. This mathematical function is what lets a model turn raw scores into something meaningful – probabilities. Let’s break down what softmax is, why it’s important, and how it fits into the bigger picture.
What is Softmax?
After a transformer has just finished processing an input. It spits out a vector of numbers, one for each possible word in its vocabulary. These numbers (often called logits) are not probabilities yet – they’re just raw, unbounded scores. We need a way to turn these scores into probabilities that sum to 1, so the model can “decide” what word to pick next.
The softmax function takes this vector of scores and squashes them into a probability distribution. The higher the score, the higher the resulting probability – but crucially, all probabilities will add up to 1.
Moreover, the softmax function uses exponentials. This makes it so that if any scores are significantly higher than the rest, that disparity gets magnified, and the probability of that token being selected is high.
How does softmax work?
The softmax function is quite straightforward! This is how it works:
1) For each score in the input vector to the function, exponentiate it. In other words, if the score is x, we raise e (the mathematical constant) to the power of x.
2) For each of these exponentiated values, we divide it by the sum of all exponentiated values to obtain the probability for the corresponding score/token!
The second step ensures that all probabilities sum up to one, and the function also ensures that all probabilities are between 0 and 1.
Where is softmax used?
Aside from the end of the decoder step in transformers, they’re used in various other applications of AI, such as classification and reinforcement learning. The act of converting a list of arbitrary scores into a list of probabilities is what makes the softmax function so useful and ubiquitous!
Continuing from the previous post, let’s now dive into the second half of the transformer – the decoder. If you recall, the decoder takes the contextualized embeddings produced by the encoder from the input sequence. It then generates the output sequence, one token at a time. Let’s break down how this process unfolds, step by step.
Shifted Output Embeddings
The decoder starts with the output tokens generated so far. These are embedded into vectors, just like in the encoder. However, there’s a twist: the output sequence is shifted right. This means, for each position, the decoder only “sees” the tokens that have already been generated, never the future ones. So our decoder’s initial “start position” is just a special <start> token embedding.
Positional Encoding
Just as in the encoder, these embeddings are combined with positional encoding. This ensures that the decoder is aware of the position of each token in the output sequence, which is crucial for generating coherent and grammatically correct sentences.
Masked Multi-Head Attention
The first major block in the decoder is masked multi-head attention. Here, each output token can attend to all previous tokens in the sequence, but not to any future tokens. This is enforced by a “mask” that blocks attention to subsequent positions. The result: when generating the next word, the model can only use information from what it has already generated, never “peeking” ahead. This is essential for tasks like text generation, where each word must be predicted step by step.
Multi-Head Cross Attention
Also called multi-head encoder-decoder attention, this is where the decoder looks back at the encoder’s output. Each token in the decoder can attend to any position in the encoder’s final output, allowing it to align generated words with relevant parts of the input sequence. This mechanism is what enables the transformer to generate translations that are faithful to the source sentence, or summaries that actually reflect the input.
Feedforward Layer
There’s then another feedforward layer, or FFN. It performs a function similar to that in the encoder, letting the model learn about more complex and abstract relations and concepts.
Linear and Softmax
The embeddings after the Nx repeating blocks in the decoder are trained to contain the data for the model’s prediction of the next token.
The linear layer simply projects the decoder output onto the model’s vocabulary size.
The softmax function (we’ll explain this later), converts that vocabulary-sized vector into a same-sized vector containing the probabilities of the next possible output token.
Continuing from the previous post, let’s dive into the first section of the transformer – **the encoder**. As we discussed, the encoder embeds the input tokens, uses positional encoding and attention to imbue the token embeddings with relevant meaning, and passes the modified embeddings to the decoder. We’ve already covered how token embeddings work, so lets jump to…
Positional Encoding
Since each token is converted to its individual embedding vector, they don’t have any information about surrounding tokens and the order in which they occur. Naturally, this is important to how language is structured, so we need a way to add this information to the embeddings!
Positional encoding does this by adding a vector to each token embedding. This vector is derived from sinusoidal functions and is dependent on the position of the token. The positional meaning of this added vector is “baked” into the latent space of the encoder – the internal abstract space of representations that the model learns.
Multi-Head Attention
As we discussed in the attention post, this section allows for each embedding to attend to every other, allowing the token embeddings to incorporate relevant meaning from other tokens.
The **multi-head** prefix refers to the fact that attention here is carried out by multiple units, or “heads”. Each head conducts attention slightly differently. For instance, one head could deal with attending to nearby words, while another might deal with subject-object relations. Put together, these heads complete comprehensive attention.
Feedforward Layer
The feedforward layer is a neural network that is independently applied to each encoding after the attention step. This neural network transforms the embedding, enhancing its representation. While the attention layer lets each embedding gain data from surrounding ones, the feedforward layer lets the model “think more deeply” about the embedding. Since it’s a non-linear layer, the training process lets the feedforward layers in the transformer learn deeply about complex relations.
Add & Norm
Standing for “Add and normalize”, this unit adds the input of the previous unit to its output and then normalizes the result. Normalization stabilizes training, and adding the input to the output improves gradient flow, which mitigates errors when training. Combined, this smoothly enables the next step…
Nx
You’ll notice a little “Nx” next to a box encapsulating the attention and feedforward layers. This simply means that this mini-sequence is repeated some N times within the architecture, similar to how a neural network has multiple layers. This enables the transformer to learn more abstract and complex concepts.
Introduced in the seminal paper “Attention Is All You Need,” the transformer revolutionized the world of natural language processing (NLP) and supercharged the progress of LLMs today. Let’s take a look at how it works.
Let’s consider transformers used for causal language modelling. Causal refers to the property of depending only on prior and current inputs, not those in the future. So, causal language modelling in this context refers to the goal of predicting the text that comes next, after a given input.
For example, with a given input of “I am 30 years-”, we would like to predict the next word, likely to be the word “old.” Since we’re talking about LLMs, our goal is rather to predict the next token.
This is how the transformer-based LLM works: Given a prompt (input), we try to predict the most suitable token to come next. Then we append it to the initial input and keep repeating the process.
Encoders and Decoders
The transformer, as it is described in “Attention Is All You Need,” works on an encoder-decoder system.
The encoder takes the input passed to the model — if we’re considering an LLM, that would be the prompt. It then transforms this input into context-rich embeddings. These aren’t just word embeddings; they also contain positional encoding, indicating where the embedding is located within the whole corpus, and each token attends to every other one as well.
The decoder generates the output tokens. Each time an output token is generated, the decoder takes in the encodings that have been generated so far as input. It uses masked self-attention to make sure that each token can only attend to tokens that come before it. This is necessary for training, and we’ll discuss it later. The decoder also uses cross-attention to attend to the encoder’s output, so it can refer to the input that way.
Finally, the decoder returns a list of possible next tokens and their corresponding confidence probabilities. From this, the next token is selected, and the decoder step repeats.
Continuing from the first part, let’s look at some more of OpenAI’s model naming conventions.
Pro
While mini models sacrifice performance for improved speed and decreased costs, pro-class models do the opposite. They are optimized for accuracy and better reasoning, and as such, are slower and more expensive. Thus, they are more suited to mission-critical use cases.
To put the mini-regular-pro comparisons into perspective, mini models are generally roughly half the cost of regular models (comparing cost per million input and output tokens). On the other hand, o1-pro is 10-50x as expensive as regular models.
Comparing performance, let’s look at AIME (American Invitational Mathematics Examination) benchmark results, popularly used as a maths benchmark.
The mini-class of models sit at ~70-80% on the benchmark. Regular models score ~75-90%. o1-pro, however, scored 93% on it! A 3% increase might not seem like much, but interpreting it differently, the o1-pro model makes 30% fewer mistakes than the regular models, which could be a very useful improvement.
.5
GPT 3.5 is an improvement on GPT-3, but it was built upon GPT-3, and was not revolutionary enough to warrant a new number. Hence, the “.5”. This shouldn’t be unfamiliar if you’ve interacted with software versions before.
Turbo
Turbo models are optimized for speed, and to a lesser extent, cost. Pricing sits between regular and mini models, and performance is lower, but close to regular. While model size is reduced for mini models, turbo models maintain a similar size to regular.
Others
Those were the main naming conventions, but let’s take a look at a couple more.
Moderation
Moderation models are designed to screen outputs for policy-violating content.
Realtime
Realtime models are designed to deliver low-latency models suitable for streaming input and output to and from the model. You might use them in TTS/STT applications, or other applications where low-latency is critical.
As OpenAI models have progressed over the years, you might have heard of the new models being released via headlines or in passing. But beyond the base GPT versions, the naming conventions probably seem rather confusing. 4o, .5, turbo? What does it even mean? Let’s take a look, starting with the basics.
Base GPT – major versions
The major versions of the base GPT models – that is, GPT-2, GPT-3, and GPT-4, are named as such since each version represents a major leap in capabilities. To get it out of the way, “GPT” stands for generative pre-trained transformer, in other words, OpenAI’s transformer-based LLM models.
To put the leaps in progress into perspective, let’s look at the models’ parameter counts.
GPT-2: 1.5 billion
GPT-3: 175 billion
GPT-4: estimated 1.8 trillion
Other features of the models such as the context window length, multimodality, and training data quantity and quality similarly improved as well.
“o”-models
It gets a little confusing here. When the “o” comes after the number, like in GPT-4o, it stands for “omni”, signifying the model’s capability to handle multimodal input and output – text, vision, and audio.
When the “o” comes before the number, like in the o1 model for example, it denotes a class of models that specialize in advanced reasoning, such as in math, science, and programming.
mini
These models, as the name suggests, are distilled versions of their “full” counterparts. They have fewer parameters and thus sacrifice some accuracy and reasoning ability. However, they, in turn, are faster to run and cheaper to use, ideal for use cases where deep reasoning doesn’t matter as much, like in chatbots and some other RAG applications.
Continuing from the last post, here are some more prompt engineering techniques.
Tree of Thought
Rather than being a way to enhance a prompt, Tree of Thought is more so a prompting framework. You break down your task into intermediate steps and repeat a step multiple times. If a step seems like it is in the right direction, or at least possibly is in the right direction, we continue with that as a new base, and move onto the next step, once again generating multiple responses. This allows for continuously validated reasoning.
Prompt chaining
Prompt chaining is simply breaking a task down into steps and prompting the generative model with those one by one. For example, if you wanted to make a presentation:
Prompt 1: “I want to make a presentation on XYZ. I’d like to cover these points: … . Please create an outline for a 10-slide presentation.”
Prompt 2: “This outline looks good. Give me a good title for my presentation.”
Prompt 3: “Now plan out an effective introduction for a general audience.”
And so on.
Self-consistency
A single LLM response may be incorrect due to various factors – hallucinating details, miscalculations, etc. To improve on this, self-consistency is a prompt engineering technique that essentially produces multiple independent responses to a question, often using randomization techniques to encourage diverse reasoning.
Then, a majority vote is used to obtain the final answer.