When you read any kind of text, you’re able to quite naturally understand what’s written, without giving it much active thought. Take a look at someone learning a new language, however, and you’ll see that when they try to read a sentence, they do so by breaking it down – usually word-by-word, and sometimes breaking down larger words further.
Similarly, LLMs break down their text inputs into smaller parseable units called tokens. Your first thought might be to break down texts into individual words, and that’s valid! Termed “word tokenization”, that’s a well-known tokenization strategy. However, consider the words “running”, “runner”, and “runners”. When you think about these words, you probably don’t consider them separately. You identify the root of the word – “run”, and that it’s conjoined with suffixes that slightly modify the word’s context.
Likewise, subword tokenization is a dominant tokenization method for LLMs. As the name suggests, tokens obtained via this method can be smaller than an entire word, often word roots, prefixes, and suffixes as described above.
How tokenization plays into LLMs
LLMs are designed with a certain vocabulary size in mind. This determines the number of tokens it can register in its vocabulary. Before an LLM reads data, an algorithm called a tokenizer breaks the word down into tokens. The tokenizer is trained to generate a token vocabulary of a specified size that fits the data it expects to read well. In case the LLM encounters unexpected text that might not fit word or even subword tokens (such as misspelled words), byte tokens are often added to the vocabulary as well, so that the data is still able to be tokenized. These tokens represent a single byte of data – quite the granular division!
Since an LLM’s vocabulary is fixed after the tokenizer is trained, each token is assigned a unique numerical token ID. When text is broken down into tokens, those tokens are represented by their IDs. Next, LLMs maintain an embedding matrix that maps each token ID to its corresponding token embedding – an embedding that solely represents that token. This way, tokens can be quickly converted into token embeddings for the LLM to use.
Leave a comment