Understanding AI Tokens and How Pricing Really Works

Every time you call an LLM API you're charged for tokens. But what's a token, how are they counted, and how do you keep your bill under control?

What Is a Token?

A token is the basic unit an LLM processes. It's roughly 4 characters of English text, or about ¾ of a word. Punctuation, spaces, and special characters are often their own tokens.

Examples:
- "Hello" → 1 token
- "Hello, world!" → 4 tokens
- "supercalifragilistic" → 6 tokens
- A typical 1,000-word article → ~1,300 tokens

Non-English text tokenises less efficiently. A sentence in Arabic or Chinese may use 2-3× as many tokens as the same idea in English.

Input vs Output Tokens

All providers charge separately for input (what you send) and output (what the model generates):

Provider Input per 1M Output per 1M
GPT-4o $5.00 $15.00
Claude 3 Sonnet $3.00 $15.00
Gemini 1.5 Pro $1.25 $5.00

Output tokens are always more expensive because generation is computationally heavier than reading.

Context Window

The context window is the maximum number of tokens the model can "see" at once — both your prompt and the model's previous responses. Exceeding it causes the oldest content to be dropped.

  • GPT-4o: 128k tokens (~100k words)
  • Claude 3.5 Sonnet: 200k tokens
  • Gemini 1.5 Pro: 1M tokens

Counting Tokens Before You Pay

import tiktoken   # OpenAI's tokeniser (works for many models)

enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("Your prompt text here")
print(f"Token count: {len(tokens)}")

Cost-Reduction Strategies

1. Shorter system prompts. Every API call includes the system prompt. Trim it ruthlessly.

2. Cache expensive prompts. Anthropic and OpenAI offer prompt caching — repeated prefixes are cached and charged at 10% of normal input cost.

3. Use a smaller model for simple tasks.
- GPT-4o-mini: $0.15/1M input — great for classification, extraction
- Claude Haiku: $0.25/1M input — fast and cheap

4. Stream responses and stop early.

for chunk in client.messages.stream(...):
    text = chunk.text
    if len(text) > 500:
        break   # stop generating early

5. Compress your context. Instead of sending entire documents, send only the relevant sections after retrieval (RAG).

Understanding tokens is the foundation of building cost-efficient AI apps.