OpenAI Token Counter

Standard Token Count	9 (handles HTML tags specially)
Ordinary Token Count	9 (treats HTML as plain text)
Character Count	37
Word Count	7
Tokens per Word	1,29

Encoding Type	cl100k_base
Counting Method	encode()
Selected Via	Encoding: cl100k_base
Context Window	8 192 tokens
Context Usage	0,1%

This OpenAI Token Counter uses JTokkit v1.1.0, an open-source Java implementation of OpenAI's tiktoken.

Maven dependency: com.knuddels:jtokkit:1.1.0

JTokkit is developed and maintained by Knuddels GmbH and is available under the MIT license.

Implementation Details

This token counter is built using:

// Maven dependency
implementation 'com.knuddels:jtokkit:1.1.0'

// Token counting implementation
import com.knuddels.jtokkit.Encodings;
import com.knuddels.jtokkit.api.Encoding;
import com.knuddels.jtokkit.api.EncodingRegistry;
import com.knuddels.jtokkit.api.EncodingType;
import com.knuddels.jtokkit.api.ModelType;

// Usage example
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding encoding = registry.getEncoding(EncodingType.CL100K_BASE);
int tokenCount = encoding.countTokens(text);

JTokkit provides the same tokenization as OpenAI's official tiktoken library, ensuring accurate token counts.

OpenAI Token Counter Guide

What is tokenization?

Tokenization is the process of breaking text into smaller units called tokens. OpenAI models process text as tokens, not individual characters or words. Tokens can be as short as one character or as long as one word (e.g., "a" or "apple"). Common words are often a single token, while less common words might be split into multiple tokens.

Important: Different models use different tokenizers, and your token count directly affects:

API costs (you pay per token for both input and output)
Context window usage (all models have a maximum token limit)
Processing speed (more tokens take longer to process)

Encoding Types

JTokkit provides several encoding types that match OpenAI's tokenizers:

Encoding	Description	Used By
`cl100k_base`	Newest tokenizer with improved handling of code and non-English languages	GPT-4, GPT-3.5-Turbo, text-embedding-ada-002
`p50k_base`	Improved vocabulary with better handling of code	GPT-3 davinci, curie, babbage, ada; Codex models
`p50k_edit`	Specialized for edit models	text-davinci-edit-001, code-davinci-edit-001
`r50k_base`	Original GPT-3 tokenizer with basic vocabulary	Older davinci, curie, babbage, ada models
`o200k_base`	Latest tokenizer optimized for GPT-4o models	GPT-4o, GPT-4o-mini

Models and Context Windows

Each OpenAI model has a maximum context window (number of tokens it can process in one request):

Model	Max Tokens	Encoding Type
`gpt-4`	8 192	cl100k_base
`gpt-4o`	128 000	o200k_base
`gpt-4o-mini`	128 000	o200k_base
`gpt-4-32k`	32 768	cl100k_base
`gpt-4-turbo`	128 000	cl100k_base
`gpt-3.5-turbo`	16 385	cl100k_base
`gpt-3.5-turbo-16k`	16 385	cl100k_base
`text-davinci-003`	4 097	p50k_base
`text-davinci-002`	4 097	p50k_base
`text-davinci-001`	2 049	r50k_base
`text-curie-001`	2 049	r50k_base
`text-babbage-001`	2 049	r50k_base
`text-ada-001`	2 049	r50k_base
`davinci`	2 049	r50k_base
`curie`	2 049	r50k_base
`babbage`	2 049	r50k_base
`ada`	2 049	r50k_base
`code-davinci-002`	8 001	p50k_base
`code-davinci-001`	8 001	p50k_base
`code-cushman-002`	2 048	p50k_base
`code-cushman-001`	2 048	p50k_base
`davinci-codex`	4 096	p50k_base
`cushman-codex`	2 048	p50k_base
`text-davinci-edit-001`	3 000	p50k_edit
`code-davinci-edit-001`	3 000	p50k_edit
`text-embedding-ada-002`	8 191	cl100k_base
`text-embedding-3-small`	8 191	cl100k_base
`text-embedding-3-large`	8 191	cl100k_base
`text-similarity-davinci-001`	2 046	r50k_base
`text-similarity-curie-001`	2 046	r50k_base
`text-similarity-babbage-001`	2 046	r50k_base
`text-similarity-ada-001`	2 046	r50k_base
`text-search-davinci-doc-001`	2 046	r50k_base
`text-search-curie-doc-001`	2 046	r50k_base
`text-search-babbage-doc-001`	2 046	r50k_base
`text-search-ada-doc-001`	2 046	r50k_base
`code-search-babbage-code-001`	2 046	r50k_base
`code-search-ada-code-001`	2 046	r50k_base

Token Counting Methods

This calculator offers two counting methods:

Method	Standard	Ordinary
`encode()`	Full tokenization that returns token IDs with special handling for HTML	`encodeOrdinary()` treats HTML as plain text
`countTokens()`	Optimized method that only counts tokens with special handling for HTML	`countTokensOrdinary()` treats HTML as plain text

HTML handling: When your text contains HTML, standard methods recognize and process tags specially, while ordinary methods treat HTML tags as regular text. This can result in different token counts. Use ordinary methods if you want your HTML to be preserved exactly as written.

Token Counting Tips

Repetitive text often uses fewer tokens than unique text of the same length
Whitespace (spaces, tabs, newlines) counts as tokens
Special characters may be tokenized individually, increasing token count
Common words typically use fewer tokens than rare or technical terms
Non-English text often uses more tokens than equivalent English text
Code can be tokenized efficiently with the newer tokenizers (cl100k_base, p50k_base)

Common Token Count Examples

Text	Tokens	Note
"Hello world"	2	Common words often count as single tokens
"antidisestablishmentarianism"	5	Long words split into multiple tokens
"こんにちは"	3	Non-Latin characters often need more tokens
"1234567890"	3	Numbers are often chunked into multiple tokens
"<div>Hello</div>"	4-6	HTML tags may count differently between standard and ordinary methods

Entrez le calcul

Choose one of the following options:

Token Counting Method:

Résultats