OpenAI Token Counter

Entrez le calcul

Choose one of the following options:

Token Counting Method:

Résultats

Token Count
9
Using cl100k_base encoding with encode()
0,1% of 8 192 token limit
Token Statistics
Standard Token Count 9 (handles HTML tags specially)
Ordinary Token Count 9 (treats HTML as plain text)
Character Count 37
Word Count 7
Tokens per Word 1,29
Encoding Information
Encoding Type cl100k_base
Counting Method encode()
Selected Via Encoding: cl100k_base
Context Window 8 192 tokens
Context Usage 0,1%

This OpenAI Token Counter uses JTokkit v1.1.0, an open-source Java implementation of OpenAI's tiktoken.

Maven dependency: com.knuddels:jtokkit:1.1.0

JTokkit is developed and maintained by Knuddels GmbH and is available under the MIT license.

Implementation Details

This token counter is built using:

// Maven dependency
implementation 'com.knuddels:jtokkit:1.1.0'

// Token counting implementation
import com.knuddels.jtokkit.Encodings;
import com.knuddels.jtokkit.api.Encoding;
import com.knuddels.jtokkit.api.EncodingRegistry;
import com.knuddels.jtokkit.api.EncodingType;
import com.knuddels.jtokkit.api.ModelType;

// Usage example
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding encoding = registry.getEncoding(EncodingType.CL100K_BASE);
int tokenCount = encoding.countTokens(text);

JTokkit provides the same tokenization as OpenAI's official tiktoken library, ensuring accurate token counts.

OpenAI Token Counter Guide

What is tokenization?

Tokenization is the process of breaking text into smaller units called tokens. OpenAI models process text as tokens, not individual characters or words. Tokens can be as short as one character or as long as one word (e.g., "a" or "apple"). Common words are often a single token, while less common words might be split into multiple tokens.

Important: Different models use different tokenizers, and your token count directly affects:

  • API costs (you pay per token for both input and output)
  • Context window usage (all models have a maximum token limit)
  • Processing speed (more tokens take longer to process)

Encoding Types

JTokkit provides several encoding types that match OpenAI's tokenizers:

Encoding Description Used By
cl100k_base Newest tokenizer with improved handling of code and non-English languages GPT-4, GPT-3.5-Turbo, text-embedding-ada-002
p50k_base Improved vocabulary with better handling of code GPT-3 davinci, curie, babbage, ada; Codex models
p50k_edit Specialized for edit models text-davinci-edit-001, code-davinci-edit-001
r50k_base Original GPT-3 tokenizer with basic vocabulary Older davinci, curie, babbage, ada models
o200k_base Latest tokenizer optimized for GPT-4o models GPT-4o, GPT-4o-mini

Models and Context Windows

Each OpenAI model has a maximum context window (number of tokens it can process in one request):

Model Max Tokens Encoding Type
gpt-4 8 192 cl100k_base
gpt-4o 128 000 o200k_base
gpt-4o-mini 128 000 o200k_base
gpt-4-32k 32 768 cl100k_base
gpt-4-turbo 128 000 cl100k_base
gpt-3.5-turbo 16 385 cl100k_base
gpt-3.5-turbo-16k 16 385 cl100k_base
text-davinci-003 4 097 p50k_base
text-davinci-002 4 097 p50k_base
text-davinci-001 2 049 r50k_base
text-curie-001 2 049 r50k_base
text-babbage-001 2 049 r50k_base
text-ada-001 2 049 r50k_base
davinci 2 049 r50k_base
curie 2 049 r50k_base
babbage 2 049 r50k_base
ada 2 049 r50k_base
code-davinci-002 8 001 p50k_base
code-davinci-001 8 001 p50k_base
code-cushman-002 2 048 p50k_base
code-cushman-001 2 048 p50k_base
davinci-codex 4 096 p50k_base
cushman-codex 2 048 p50k_base
text-davinci-edit-001 3 000 p50k_edit
code-davinci-edit-001 3 000 p50k_edit
text-embedding-ada-002 8 191 cl100k_base
text-embedding-3-small 8 191 cl100k_base
text-embedding-3-large 8 191 cl100k_base
text-similarity-davinci-001 2 046 r50k_base
text-similarity-curie-001 2 046 r50k_base
text-similarity-babbage-001 2 046 r50k_base
text-similarity-ada-001 2 046 r50k_base
text-search-davinci-doc-001 2 046 r50k_base
text-search-curie-doc-001 2 046 r50k_base
text-search-babbage-doc-001 2 046 r50k_base
text-search-ada-doc-001 2 046 r50k_base
code-search-babbage-code-001 2 046 r50k_base
code-search-ada-code-001 2 046 r50k_base

Token Counting Methods

This calculator offers two counting methods:

Method Standard Ordinary
encode() Full tokenization that returns token IDs with special handling for HTML encodeOrdinary() treats HTML as plain text
countTokens() Optimized method that only counts tokens with special handling for HTML countTokensOrdinary() treats HTML as plain text

HTML handling: When your text contains HTML, standard methods recognize and process tags specially, while ordinary methods treat HTML tags as regular text. This can result in different token counts. Use ordinary methods if you want your HTML to be preserved exactly as written.

Token Counting Tips

  • Repetitive text often uses fewer tokens than unique text of the same length
  • Whitespace (spaces, tabs, newlines) counts as tokens
  • Special characters may be tokenized individually, increasing token count
  • Common words typically use fewer tokens than rare or technical terms
  • Non-English text often uses more tokens than equivalent English text
  • Code can be tokenized efficiently with the newer tokenizers (cl100k_base, p50k_base)

Common Token Count Examples

Text Tokens Note
"Hello world" 2 Common words often count as single tokens
"antidisestablishmentarianism" 5 Long words split into multiple tokens
"こんにちは" 3 Non-Latin characters often need more tokens
"1234567890" 3 Numbers are often chunked into multiple tokens
"<div>Hello</div>" 4-6 HTML tags may count differently between standard and ordinary methods
Mise à jour: