• If you’re unfamiliar with how Large Language Models (LLMs) – such as those behind chatbots like ChatGPT – work, you may wonder how they can seemingly understand what you’re saying. More impressively, even if you don’t convey your intentions very well, they can often pick up on what you want to say!

    Behind these feats lie embeddings, a method by which LLMs can capture the meaning behind text.

    Embeddings?

    An embedding is just a vector – an ordered list of numbers. Each embedding is quite large, on the scale of hundreds and even thousands of numbers. Each embedding vector is able to store a specific semantic meaning – in other words, what the text actually represents. For instance, the word “log” can have multiple meanings depending on the context – a fallen tree, a record of events, or the mathematical function. If we use an embedding to store the meaning of “log”, it will have different values depending on the surrounding context (I’ll explain how identifying the surrounding context works later).

    How do embeddings work?

    For now, let’s imagine that an embedding captures the meaning of a single word.

    Since each embedding is a vector of say – length ‘N’ – we could say that it represents a point in N-dimensional space. As an analogy, a vector containing 2 numbers could identify a point on a piece of paper, and a vector containing 3 numbers could pinpoint a location in 3-D space.

    Embeddings group words with similar semantic meaning close to each other, and the more dissimilar the words are, the farther away from each other they get. Remarkably, embeddings have been shown to capture the meanings of more abstract concepts via specific directions. Here’s a famous example. In the embedding model Word2Vec, the vectors for “king” – “man” roughly equal the result of “queen” – “woman”. This shows that the idea of “monarchy” is captured via a particular direction in N-dimensional space!

    How are embeddings made?

    It may still feel eerily magical that such a system can be created for something as complex as human language. Embeddings are made via an embedding model/layer, generally a specialized neural network that’s been trained to produce embeddings that accurately capture semantic meaning. This is often done via contrastive learning, a method that teaches the model the similarity and dissimilarity between words.

    More on embeddings

    While we’ve been discussing embeddings in the context of individual words, they can capture the meaning of sentences, paragraphs, and even whole documents! This ties into the idea of capturing the meaning of a word while taking into account the surrounding context. This concept lies at the heart of LLMs, done via a mechanism called attention. Attention allows an LLM to imbue the meaning of relevant portions of text into an embedding, allowing it to capture richer semantic meaning. It’s a rather complex topic, so I won’t get into it in this post.

  • Training VAEs involves feeding batches of images through the encoder-decoder pipeline, minimizing the combined loss. The model learns unsupervised – no labels needed beyond the images themselves. Start with simple architectures (CNNs for encoders/decoders) and scale up.

    Watch for posterior collapse, where the KL term dominates, causing σ0 and the latent space to become useless (everything encodes to the prior mean). Mitigate with annealing β or free bits techniques. Blurriness in outputs stems from averaging over probabilistic samples, but that’s a feature for stability.

    VAEs as Generation Engines

    Generation is straightforward: sample zN(0,1), decode to an image. Think of it as throwing a dart at a dartboard and feeding that outcome to the decoder. Results are often blurry compared to GANs, but VAEs excel at controlled synthesis. Want a new image between two known ones? Interpolate in latent space. Edit attributes? Traverse specific dimensions.

    Real-World Impact in Image Generation Pipelines

    VAEs power modern stacks. In Stable Diffusion, a VAE compresses images to a compact latent space, letting the diffusion model operate there (far cheaper than pixel space). DALL-E uses similar priors for coherent outputs. Beyond generation, VAEs enable anomaly detection (flag poor reconstructions), compression (smaller file sizes), and editing (latent tweaks propagate globally).

  • Loss and Reparameterization

    VAEs optimize a cleverly designed loss function that balances two goals. First, reconstruction loss measures how well the decoder rebuilds the original input, often using mean squared error for images:Lrecon=xx^2

    where x is the input and x^ the reconstruction.

    Second, KL divergence regularizes the latent distribution to stay close to a standard normal prior N(0,1):LKL=DKL(q(zx)p(z))

    This prevents the model from memorizing training data and encourages a structured latent space. The full VAE loss is L=Lrecon+βLKL, where β tunes the tradeoff.

    Training requires the reparameterization trick to make sampling differentiable. Instead of sampling zq(zx) directly (which breaks gradients), we compute z=μ+σϵ where ϵN(0,1). This shifts randomness to a constant, allowing backpropagation through μ and σ.

    The Magic of Smooth Latent Spaces

    VAEs shine in their latent space properties. Because encodings form distributions pulled toward a Gaussian prior, the space becomes continuous and interpolable. Encode two cat images, then linearly interpolate their latent vectors – the decoder spits out convincing morphs between them. This smoothness arises directly from KL regularization, which spaces out representations evenly.

    Latent dimensions also capture semantic features. Lower dimensions might encode broad categories like “animal” or “background,” while higher ones handle details like fur texture. This structure makes VAEs ideal for tasks beyond pure reconstruction.

  • While language models have dominated AI discussions lately, the image generation revolution happening in parallel deserves equal attention. At the heart of many breakthrough image synthesis systems, from Stable Diffusion to DALL-E, lies a powerful architecture called the Variational Autoencoder (VAE). VAEs serve as the compression engine that makes modern image generation practical, transforming high-dimensional pixel data into compact latent representations that models can efficiently manipulate. Understanding VAEs isn’t just about grasping one more neural network variant; it’s about understanding the mathematical foundation that enabled AI to paint, edit, and imagine visual content at scale. Before diving into the flashier diffusion models and GANs that generate your favorite AI artwork, we need to understand how VAEs learned to compress and reconstruct the visual world.

    What Are Autoencoders, and Why Variational?

    Autoencoders represent a natural starting point for VAEs. Picture a neural network that learns to copy its input to its output, but with a twist: it must squeeze through a narrow “bottleneck” in the middle. The encoder compresses the input (say, a 64×64 image with thousands of pixels) into a low-dimensional latent vector – maybe just 128 numbers. The decoder then reconstructs the image from this vector.

    Standard autoencoders work great for compression but suffer from brittle latent spaces. Small changes in input lead to wildly different encodings, making the space unusable for generation or interpolation. VAEs fix this by making the encoding probabilistic. Instead of outputting a single point, the encoder produces parameters for a distribution (typically a multivariate Gaussian, defined by mean μ and variance σ2). Sampling from this distribution gives the latent vector, ensuring nearby inputs map to nearby distributions in a smooth, continuous space.