This OpenAI Token Counter uses JTokkit v1.1.0, an open-source Java implementation of OpenAI's tiktoken.
Maven dependency: com.knuddels:jtokkit:1.1.0
JTokkit is developed and maintained by Knuddels GmbH and is available under the MIT license.
Implementation Details
This token counter is built using:
// Maven dependency
implementation 'com.knuddels:jtokkit:1.1.0'
// Token counting implementation
import com.knuddels.jtokkit.Encodings;
import com.knuddels.jtokkit.api.Encoding;
import com.knuddels.jtokkit.api.EncodingRegistry;
import com.knuddels.jtokkit.api.EncodingType;
import com.knuddels.jtokkit.api.ModelType;
// Usage example
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding encoding = registry.getEncoding(EncodingType.CL100K_BASE);
int tokenCount = encoding.countTokens(text);
JTokkit provides the same tokenization as OpenAI's official tiktoken library, ensuring accurate token counts.
OpenAI Token Counter Guide
What is tokenization?
Tokenization is the process of breaking text into smaller units called tokens. OpenAI models process text as tokens, not individual characters or words. Tokens can be as short as one character or as long as one word (e.g., "a" or "apple"). Common words are often a single token, while less common words might be split into multiple tokens.
Important: Different models use different tokenizers, and your token count directly affects:
- API costs (you pay per token for both input and output)
- Context window usage (all models have a maximum token limit)
- Processing speed (more tokens take longer to process)
Encoding Types
JTokkit provides several encoding types that match OpenAI's tokenizers:
Encoding | Description | Used By |
---|---|---|
cl100k_base |
Newest tokenizer with improved handling of code and non-English languages | GPT-4, GPT-3.5-Turbo, text-embedding-ada-002 |
p50k_base |
Improved vocabulary with better handling of code | GPT-3 davinci, curie, babbage, ada; Codex models |
p50k_edit |
Specialized for edit models | text-davinci-edit-001, code-davinci-edit-001 |
r50k_base |
Original GPT-3 tokenizer with basic vocabulary | Older davinci, curie, babbage, ada models |
o200k_base |
Latest tokenizer optimized for GPT-4o models | GPT-4o, GPT-4o-mini |
Models and Context Windows
Each OpenAI model has a maximum context window (number of tokens it can process in one request):
Model | Max Tokens | Encoding Type |
---|---|---|
gpt-4 |
٨٬١٩٢ | cl100k_base |
gpt-4o |
١٢٨٬٠٠٠ | o200k_base |
gpt-4o-mini |
١٢٨٬٠٠٠ | o200k_base |
gpt-4-32k |
٣٢٬٧٦٨ | cl100k_base |
gpt-4-turbo |
١٢٨٬٠٠٠ | cl100k_base |
gpt-3.5-turbo |
١٦٬٣٨٥ | cl100k_base |
gpt-3.5-turbo-16k |
١٦٬٣٨٥ | cl100k_base |
text-davinci-003 |
٤٬٠٩٧ | p50k_base |
text-davinci-002 |
٤٬٠٩٧ | p50k_base |
text-davinci-001 |
٢٬٠٤٩ | r50k_base |
text-curie-001 |
٢٬٠٤٩ | r50k_base |
text-babbage-001 |
٢٬٠٤٩ | r50k_base |
text-ada-001 |
٢٬٠٤٩ | r50k_base |
davinci |
٢٬٠٤٩ | r50k_base |
curie |
٢٬٠٤٩ | r50k_base |
babbage |
٢٬٠٤٩ | r50k_base |
ada |
٢٬٠٤٩ | r50k_base |
code-davinci-002 |
٨٬٠٠١ | p50k_base |
code-davinci-001 |
٨٬٠٠١ | p50k_base |
code-cushman-002 |
٢٬٠٤٨ | p50k_base |
code-cushman-001 |
٢٬٠٤٨ | p50k_base |
davinci-codex |
٤٬٠٩٦ | p50k_base |
cushman-codex |
٢٬٠٤٨ | p50k_base |
text-davinci-edit-001 |
٣٬٠٠٠ | p50k_edit |
code-davinci-edit-001 |
٣٬٠٠٠ | p50k_edit |
text-embedding-ada-002 |
٨٬١٩١ | cl100k_base |
text-embedding-3-small |
٨٬١٩١ | cl100k_base |
text-embedding-3-large |
٨٬١٩١ | cl100k_base |
text-similarity-davinci-001 |
٢٬٠٤٦ | r50k_base |
text-similarity-curie-001 |
٢٬٠٤٦ | r50k_base |
text-similarity-babbage-001 |
٢٬٠٤٦ | r50k_base |
text-similarity-ada-001 |
٢٬٠٤٦ | r50k_base |
text-search-davinci-doc-001 |
٢٬٠٤٦ | r50k_base |
text-search-curie-doc-001 |
٢٬٠٤٦ | r50k_base |
text-search-babbage-doc-001 |
٢٬٠٤٦ | r50k_base |
text-search-ada-doc-001 |
٢٬٠٤٦ | r50k_base |
code-search-babbage-code-001 |
٢٬٠٤٦ | r50k_base |
code-search-ada-code-001 |
٢٬٠٤٦ | r50k_base |
Token Counting Methods
This calculator offers two counting methods:
Method | Standard | Ordinary |
---|---|---|
encode() |
Full tokenization that returns token IDs with special handling for HTML | encodeOrdinary() treats HTML as plain text |
countTokens() |
Optimized method that only counts tokens with special handling for HTML | countTokensOrdinary() treats HTML as plain text |
HTML handling: When your text contains HTML, standard methods recognize and process tags specially, while ordinary methods treat HTML tags as regular text. This can result in different token counts. Use ordinary methods if you want your HTML to be preserved exactly as written.
Token Counting Tips
- Repetitive text often uses fewer tokens than unique text of the same length
- Whitespace (spaces, tabs, newlines) counts as tokens
- Special characters may be tokenized individually, increasing token count
- Common words typically use fewer tokens than rare or technical terms
- Non-English text often uses more tokens than equivalent English text
- Code can be tokenized efficiently with the newer tokenizers (cl100k_base, p50k_base)
Common Token Count Examples
Text | Tokens | Note |
---|---|---|
"Hello world" | 2 | Common words often count as single tokens |
"antidisestablishmentarianism" | 5 | Long words split into multiple tokens |
"こんにちは" | 3 | Non-Latin characters often need more tokens |
"1234567890" | 3 | Numbers are often chunked into multiple tokens |
"<div>Hello</div>" | 4-6 | HTML tags may count differently between standard and ordinary methods |