Understanding AI Tokens and How Pricing Really Works
Every time you call an LLM API you're charged for tokens. But what's a token, how are they counted, and how do you keep your bill under control?
What Is a Token?
A token is the basic unit an LLM processes. It's roughly 4 characters of English text, or about ¾ of a word. Punctuation, spaces, and special characters are often their own tokens.
Examples:
- "Hello" → 1 token
- "Hello, world!" → 4 tokens
- "supercalifragilistic" → 6 tokens
- A typical 1,000-word article → ~1,300 tokens
Non-English text tokenises less efficiently. A sentence in Arabic or Chinese may use 2-3× as many tokens as the same idea in English.
Input vs Output Tokens
All providers charge separately for input (what you send) and output (what the model generates):
| Provider | Input per 1M | Output per 1M |
|---|---|---|
| GPT-4o | $5.00 | $15.00 |
| Claude 3 Sonnet | $3.00 | $15.00 |
| Gemini 1.5 Pro | $1.25 | $5.00 |
Output tokens are always more expensive because generation is computationally heavier than reading.
Context Window
The context window is the maximum number of tokens the model can "see" at once — both your prompt and the model's previous responses. Exceeding it causes the oldest content to be dropped.
- GPT-4o: 128k tokens (~100k words)
- Claude 3.5 Sonnet: 200k tokens
- Gemini 1.5 Pro: 1M tokens
Counting Tokens Before You Pay
import tiktoken # OpenAI's tokeniser (works for many models)
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("Your prompt text here")
print(f"Token count: {len(tokens)}")
Cost-Reduction Strategies
1. Shorter system prompts. Every API call includes the system prompt. Trim it ruthlessly.
2. Cache expensive prompts. Anthropic and OpenAI offer prompt caching — repeated prefixes are cached and charged at 10% of normal input cost.
3. Use a smaller model for simple tasks.
- GPT-4o-mini: $0.15/1M input — great for classification, extraction
- Claude Haiku: $0.25/1M input — fast and cheap
4. Stream responses and stop early.
for chunk in client.messages.stream(...):
text = chunk.text
if len(text) > 500:
break # stop generating early
5. Compress your context. Instead of sending entire documents, send only the relevant sections after retrieval (RAG).
Understanding tokens is the foundation of building cost-efficient AI apps.
0 Comments
Join the conversation
No comments yet. Be the first!