OpenAI Token Counter

أدخل الحساب

Choose one of the following options:

Token Counting Method:

نتائج

Token Count
٩
Using cl100k_base encoding with encode()
٠٫١% of ٨٬١٩٢ token limit
Token Statistics
Standard Token Count ٩ (handles HTML tags specially)
Ordinary Token Count ٩ (treats HTML as plain text)
Character Count ٣٧
Word Count ٧
Tokens per Word ١٫٢٩
Encoding Information
Encoding Type cl100k_base
Counting Method encode()
Selected Via Encoding: cl100k_base
Context Window ٨٬١٩٢ tokens
Context Usage ٠٫١%

This OpenAI Token Counter uses JTokkit v1.1.0, an open-source Java implementation of OpenAI's tiktoken.

Maven dependency: com.knuddels:jtokkit:1.1.0

JTokkit is developed and maintained by Knuddels GmbH and is available under the MIT license.

Implementation Details

This token counter is built using:

// Maven dependency
implementation 'com.knuddels:jtokkit:1.1.0'

// Token counting implementation
import com.knuddels.jtokkit.Encodings;
import com.knuddels.jtokkit.api.Encoding;
import com.knuddels.jtokkit.api.EncodingRegistry;
import com.knuddels.jtokkit.api.EncodingType;
import com.knuddels.jtokkit.api.ModelType;

// Usage example
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding encoding = registry.getEncoding(EncodingType.CL100K_BASE);
int tokenCount = encoding.countTokens(text);

JTokkit provides the same tokenization as OpenAI's official tiktoken library, ensuring accurate token counts.

OpenAI Token Counter Guide

What is tokenization?

Tokenization is the process of breaking text into smaller units called tokens. OpenAI models process text as tokens, not individual characters or words. Tokens can be as short as one character or as long as one word (e.g., "a" or "apple"). Common words are often a single token, while less common words might be split into multiple tokens.

Important: Different models use different tokenizers, and your token count directly affects:

  • API costs (you pay per token for both input and output)
  • Context window usage (all models have a maximum token limit)
  • Processing speed (more tokens take longer to process)

Encoding Types

JTokkit provides several encoding types that match OpenAI's tokenizers:

Encoding Description Used By
cl100k_base Newest tokenizer with improved handling of code and non-English languages GPT-4, GPT-3.5-Turbo, text-embedding-ada-002
p50k_base Improved vocabulary with better handling of code GPT-3 davinci, curie, babbage, ada; Codex models
p50k_edit Specialized for edit models text-davinci-edit-001, code-davinci-edit-001
r50k_base Original GPT-3 tokenizer with basic vocabulary Older davinci, curie, babbage, ada models
o200k_base Latest tokenizer optimized for GPT-4o models GPT-4o, GPT-4o-mini

Models and Context Windows

Each OpenAI model has a maximum context window (number of tokens it can process in one request):

Model Max Tokens Encoding Type
gpt-4 ٨٬١٩٢ cl100k_base
gpt-4o ١٢٨٬٠٠٠ o200k_base
gpt-4o-mini ١٢٨٬٠٠٠ o200k_base
gpt-4-32k ٣٢٬٧٦٨ cl100k_base
gpt-4-turbo ١٢٨٬٠٠٠ cl100k_base
gpt-3.5-turbo ١٦٬٣٨٥ cl100k_base
gpt-3.5-turbo-16k ١٦٬٣٨٥ cl100k_base
text-davinci-003 ٤٬٠٩٧ p50k_base
text-davinci-002 ٤٬٠٩٧ p50k_base
text-davinci-001 ٢٬٠٤٩ r50k_base
text-curie-001 ٢٬٠٤٩ r50k_base
text-babbage-001 ٢٬٠٤٩ r50k_base
text-ada-001 ٢٬٠٤٩ r50k_base
davinci ٢٬٠٤٩ r50k_base
curie ٢٬٠٤٩ r50k_base
babbage ٢٬٠٤٩ r50k_base
ada ٢٬٠٤٩ r50k_base
code-davinci-002 ٨٬٠٠١ p50k_base
code-davinci-001 ٨٬٠٠١ p50k_base
code-cushman-002 ٢٬٠٤٨ p50k_base
code-cushman-001 ٢٬٠٤٨ p50k_base
davinci-codex ٤٬٠٩٦ p50k_base
cushman-codex ٢٬٠٤٨ p50k_base
text-davinci-edit-001 ٣٬٠٠٠ p50k_edit
code-davinci-edit-001 ٣٬٠٠٠ p50k_edit
text-embedding-ada-002 ٨٬١٩١ cl100k_base
text-embedding-3-small ٨٬١٩١ cl100k_base
text-embedding-3-large ٨٬١٩١ cl100k_base
text-similarity-davinci-001 ٢٬٠٤٦ r50k_base
text-similarity-curie-001 ٢٬٠٤٦ r50k_base
text-similarity-babbage-001 ٢٬٠٤٦ r50k_base
text-similarity-ada-001 ٢٬٠٤٦ r50k_base
text-search-davinci-doc-001 ٢٬٠٤٦ r50k_base
text-search-curie-doc-001 ٢٬٠٤٦ r50k_base
text-search-babbage-doc-001 ٢٬٠٤٦ r50k_base
text-search-ada-doc-001 ٢٬٠٤٦ r50k_base
code-search-babbage-code-001 ٢٬٠٤٦ r50k_base
code-search-ada-code-001 ٢٬٠٤٦ r50k_base

Token Counting Methods

This calculator offers two counting methods:

Method Standard Ordinary
encode() Full tokenization that returns token IDs with special handling for HTML encodeOrdinary() treats HTML as plain text
countTokens() Optimized method that only counts tokens with special handling for HTML countTokensOrdinary() treats HTML as plain text

HTML handling: When your text contains HTML, standard methods recognize and process tags specially, while ordinary methods treat HTML tags as regular text. This can result in different token counts. Use ordinary methods if you want your HTML to be preserved exactly as written.

Token Counting Tips

  • Repetitive text often uses fewer tokens than unique text of the same length
  • Whitespace (spaces, tabs, newlines) counts as tokens
  • Special characters may be tokenized individually, increasing token count
  • Common words typically use fewer tokens than rare or technical terms
  • Non-English text often uses more tokens than equivalent English text
  • Code can be tokenized efficiently with the newer tokenizers (cl100k_base, p50k_base)

Common Token Count Examples

Text Tokens Note
"Hello world" 2 Common words often count as single tokens
"antidisestablishmentarianism" 5 Long words split into multiple tokens
"こんにちは" 3 Non-Latin characters often need more tokens
"1234567890" 3 Numbers are often chunked into multiple tokens
"<div>Hello</div>" 4-6 HTML tags may count differently between standard and ordinary methods
تحديث: