quotient space

Originally, tokenizers had a fixed vocabulary. If a word was not in the vocabulary list, it would be represented by something like [UNK]. Many modern tokenizers now operate on representation of text as bytes. If combinations of bytes is very common, those bytes (two or more) would be combined into a single token.

The Enter key, Spacebar, and other forms of whitespace also correspond to different combinations of bytes. The tokenizer can choose to preserve or combine these.

You might have seen strange text like this before. Regular characters can be combined with modifiers to add accents for example, or, whatever this is. Since these combinations are highly unusual, the Tokenizer will generally default to splitting up the characters and modifiers into their own tokens.

Many emojis are composed of multiple codepoints joined by a Zero-Width Joiner (ZWJ). For example, a Black Cat is often represented as a sequence of a standard Cat emoji and a Black Square. Depending on the tokenizer's vocabulary, these may be seen as one unit or distinct fragments.

Adding hidden characters is one of an increasing number of attack vectors for LLM based applications. They may bypass filtering systems by splitting up a word. Alternatively, they may correspond to letters that take up zero space, inserting hidden text. Input sanitization is therefore essential for applications using LLMs.

By adding 'Tools' to an LLM, we are frontloading the prompt with a bunch of data. The "agent" detects when the model has output a command. Runs the action, and reruns the conversation with the output of the action.

Multi-Modal inputs can look quite different to regular text. In this case, the engine takes the image, splits it up into "patches" and runs the patches through an encoder, creating an embedding. The word tokens are also turned into embeddings, and the combination of language and image is processed together. This approach is often called "early fusion"

Tokenization Sandbox

Overview

Prompt Template (Jinja)

Context (JSON Payload)

Raw Engine AST Output

Processed Transformer sequence