Overview
Direct Answer
Tokenisation is the foundational preprocessing step that converts raw text into discrete units (tokens) that language models can process numerically. These units may represent individual words, subword fragments, or characters, depending on the tokenisation strategy employed.
How It Works
The process segments input text according to defined rules—either at whitespace boundaries for word-level tokenisation, or through vocabulary-based algorithms such as Byte Pair Encoding or WordPiece for subword splitting. Each token is then mapped to a numerical identifier via a learned vocabulary, enabling downstream models to perform mathematical operations on textual data.
Why It Matters
Effective tokenisation directly impacts model efficiency, accuracy, and cost. Poor tokenisation strategies increase sequence length, consuming more computational resources and memory during training and inference. Language coverage and handling of out-of-vocabulary terms critically influence model robustness across multilingual and domain-specific applications.
Common Applications
Tokenisation is essential across machine translation systems, sentiment analysis pipelines, document classification, and conversational AI platforms. It enables named entity recognition systems to identify boundaries of entities and supports question-answering models in retrieving and ranking relevant text spans.
Key Considerations
Trade-offs exist between vocabulary size, sequence length, and computational overhead. Language-specific requirements, handling of punctuation and special characters, and preserving semantic boundaries present ongoing challenges, particularly for morphologically rich languages and code-based applications.
Cited Across coldai.org4 pages mention Tokenisation
Industry pages, services, technologies, capabilities, case studies and insights on coldai.org that reference Tokenisation — providing applied context for how the concept is used in client engagements.
Referenced By1 term mentions Tokenisation
Other entries in the wiki whose definition references Tokenisation — useful for understanding how this concept connects across Natural Language Processing and adjacent domains.
More in Natural Language Processing
Natural Language Processing
Core NLPThe field of AI focused on enabling computers to understand, interpret, and generate human language.
Text Embedding Model
Core NLPA neural network trained to convert text passages into fixed-dimensional vectors that capture semantic meaning, enabling similarity search, clustering, and retrieval applications.
Dialogue Management
Generation & TranslationThe component of conversational systems that tracks conversation state, determines the next system action, and maintains coherent multi-turn interactions with users.
Text Summarisation
Text AnalysisThe process of creating a concise and coherent summary of a longer text document while preserving key information.
Intent Detection
Generation & TranslationThe classification of user utterances into predefined categories representing the user's goal or purpose, a fundamental component of conversational AI and chatbot systems.
Byte-Pair Encoding
Parsing & StructureA subword tokenisation algorithm that iteratively merges the most frequent character pairs to build a vocabulary.
Context Window
Semantics & RepresentationThe maximum amount of text a language model can consider at once when generating a response.
Text-to-SQL
Generation & TranslationThe task of automatically converting natural language questions into executable SQL queries, enabling non-technical users to interrogate databases through conversational interfaces.