Theo Green explains tokens, limits and the importance of making AI context windows work for you when inputting queries and other prompts into AI systems
An AI model’s context window is essentially what it uses as a short-term memory. It is the block of text, measured in tokens, that the model can “see” and use as a condition while writing an answer to the question presented.
Context windows determine whether an AI can use something like an entire legal brief, a multi-file codebase, or only the last few lines of the ongoing conversation with the current user. Bigger windows allow models to tackle larger, more complex tasks. Therefore, the size of the window must be considered when users propose to input large quantities of data to an AI model.
Tokens: the unit that matters
Tokens are the basic blocks that AI models use to represent text. A token might be a full word, part of a word, punctuation, or even a single character, depending on which tokeniser (the engine used to convert text into tokens) is used.
Different tokeniser designs will make different choices about how to divide text into tokens. Typically, larger vocabularies will keep common words intact while splitting rare or compound words. This directly impacts the number of tokens that a piece of text will be turned into.
In English, a general rule of thumb is that one token is equal to about four characters or roughly about 75% of an average word. This means that 1,000 tokens are equivalent to about 750 words. Therefore, if someone wants to calculate how big their prompt is, they should count the tokens, not merely the words that make up the prompt. Counting tokens means measuring how many individual text units—like words, morphemes (units within words, like prefixes or plurals) or punctuation—a language model has to process in a given prompt.
Token counts matter as they have an impact on the outputs the AI will produce. This is because context windows limit the amount of information the AI model will process. For instance, research and engineering work may require context lengths of hundreds of thousands or even millions of tokens. Not all AI models will be able to handle this effectively, meaning that in some cases, parts of the data input into the context window will simply be ignored.
Different engines, different window sizes
AI model vendors publish context window sizes to promote the power of their models. For example, vendors like Anthropic and OpenAI now advertise models whose peak context lengths reach into the hundreds of thousands of tokens. In some cases, engineering paths capable of over 1,000,000 token contexts have been announced. However, these are generally tied to specific model versions, tiers, or preview programs rather than being universally available endpoints.
Context window sizes vary widely between different models due to a variety of factors, including the model’s architecture, tokeniser differences, and deployment choices.
Architecture and training: Some models are built around mechanisms that support longer text sequences such as “sparse attention” (where the model focuses on only relevant parts of the input, rather than every single token), the use of “positional encoding” (where the relevance of a token’s position, such as at the end of a word, is remembered), or chunked attention (where large documents are broken down into smaller pieces for analysis).
Tokeniser variation: Two prompts with the same word count may represent different numbers of tokens because the tokenisers used have split the text differently. Different tokenisers use different algorithms, such as Byte-Pair Encoding or WordPiece.
Byte-Pair Encoding (BPE) is a text compression method that iteratively merges the most frequent pairs of characters or morphemes to form a vocabulary, while WordPiece is a morpheme tokenisation technique that splits words into smaller units based on frequency, with the aim of balancing vocabulary size and coverage.
As well as using different algorithms, different tokenisers are trained on different text corpora and vocabularies. This means that they may identify different places in the prompt to cut words, punctuation, and whitespace into the sub-word pieces that make up the tokens.
Product parameters: Vendors may offer multiple model variants, including faster, lower cost and larger models. Context window sizes will vary between these.
Is the advertised size realistic in practice?
The advertised context window size is true, at least in theory. But there are certain caveats that are important to consider:
Theoretical vs practical: A vendor might demonstrate a model running well with 200k tokens or even 1 million tokens in controlled tests or for select customers. These demonstrations will often be in a controlled environment, and with the exact situation planned to ensure that the AI works efficiently. Whilst it is possible for the model to achieve this in the right circumstances, in practice it is far from guaranteed that the model will reliably support that size of context window in every region or product tier.
Product limits, bugs and routing: Community reports and provider documents show many cases where calls hit limits smaller than those advertised and fail to operate within the larger limit that has been claimed.
Performance and quality: Even if a model accepts a million tokens, performance (coherence, latency, hallucination rate etc) may degrade with scale unless the model was explicitly trained and tuned for that regime. Degrade with scale means that as the number of tokens or input size increases, the model’s performance, such as coherence, response time, and accuracy, can worsen unless it has been specifically trained to handle larger inputs effectively.
Large windows are not a magic bullet: thoughtful engineering and prompt design still matter and are arguably more important in some regards, especially when it comes to the quality of use for the end user.
By carefully engineering the prompt, it enables the model to give responses that optimise clarity, context, and specificity, guiding the model towards generating accurate, relevant, and coherent responses. Thoughtful engineering in prompt design involves understanding the model’s behaviour and tailoring prompts to provide clear instructions, context and constraints. It also includes experimenting with different formulations, adjusting for token efficiency, as well as iterating based on feedback to achieve the most accurate and relevant outputs.
Why context window size matters to users
Context window size matters to users in many ways. It can affect the size of the information that can be analysed efficiently as well as the cost of buying and operating the model, the speed with which outputs are delivered and the ease with which end users can interact with the system.
Completeness: If your goal is to analyse an entire book, a long contract, or a multi-file repository in one pass, you need a large enough window to keep all relevant content in view. Smaller windows force document chunking, which adds complexity and can reduce the coherence of the overall result.
Cost and latency: Larger context processing uses more memory and computer processing power. Models typically cost more to buy and operate. Outputs take longer to be delivered. Environmental impact is far greater. Even when available, ultra-long contexts should be justified by value and shouldn’t be wasted on smaller projects that don’t need this capability.
Tooling and user experience: Chat histories, system messages, plugin/tool descriptions and function definitions all consume the space available for tokens. If you don’t manage them, you’ll hit limits unexpectedly, and this can reduce quality or stop anything from being produced past a certain point.
Practical ways to optimise usage
Most people don’t need a million-token window to be effective or to achieve their aims. There are a few tactics that can be used to get the most out of the token limits that you are working with.
Count and compress: Use token counters during development. Reduce useless or unhelpful text, compress the text you have by summarising it, and try to use concise system and instruction messages. (OpenAI and other vendors publish token-counting guidance that can help.)
Summarise and roll up: Keep a concise summary of earlier conversation/document sections and use a summary of these in prompts, instead of the full text. Update the summary as you go. This can make prompts far shorter and can help you keep an idea of how the AI is adapting and changing its outputs, and how you can tailor it to your needs.
Retrieval-augmented generation (RAG): Store long documents as embeddings and retrieve only the most relevant passages for the current query. This gives effective “infinite” memory without blowing the context window. It can also help keep you more organised and make it easier to locate key passages for when you specifically need them. In basic terms this involves letting the model look things up rather than using just the information it was trained on.
Chunk intelligently and stitch: When chunking is necessary, chunk by semantic boundaries such as sections, functions or chapters, and include overlap or “stitching” prompts at the start and end of each chunk that ensure continuity. This helps the AI connect sections together and enables it to produce more work with more seamless transitions that will read and be understood more easily.
Use model-specialised endpoints: If your vendor provides a long-context variant, reserve it for tasks that require it (such as large-codebase analysis or legal document review) and use cheaper/shorter models for routine tasks like chat. This will not only save money; it will also make better use of each model, ensuring that the output produced by each model is higher in accuracy and quality.
Avoid massive tool payloads: Don’t send huge function or tool descriptions with every request; keep them short and include only the fields the model actually needs. Very large payloads can cause errors, slow responses or unexpected failures, even on models that advertise long context windows.
Don’t chase size, optimise usage
Vendors keep pushing the upper bounds of context windows, so remember that advertised maxima are engineering milestones, not unconditional guarantees that will be available to the average user.
For most users, smart prompt engineering (summaries, retrieval, compression, and selective long-model usage) yields far more practical value than simply using the largest token counts available online. When you do need truly enormous context lengths, verify availability on your chosen endpoint, test for quality and failure modes, and be ready to balance cost, latency and accuracy.
Theo Green is a freelance writer
Main image courtesy of iStockPhoto.com and dem10
© 2025, Lyonsdown Limited. Business Reporter® is a registered trademark of Lyonsdown Ltd. VAT registration number: 830519543