Overview
A subword tokenisation algorithm that iteratively merges the most frequent character pairs to build a vocabulary.
Cross-References(1)
More in Natural Language Processing
GloVe
Semantics & RepresentationGlobal Vectors for Word Representation — an unsupervised learning algorithm for obtaining word vector representations from aggregated word co-occurrence statistics.
Natural Language Understanding
Core NLPThe subfield of NLP focused on machine reading comprehension and extracting meaning from text.
Top-K Sampling
Generation & TranslationA text generation strategy that restricts the model to sampling from the K most probable next tokens.
Speech Synthesis
Speech & AudioThe artificial production of human speech from text, also known as text-to-speech.
Large Language Model
Semantics & RepresentationA neural network trained on massive text corpora that can generate, understand, and reason about natural language.
Text Generation
Generation & TranslationThe process of producing coherent and contextually relevant text using AI language models.
Vector Database
Core NLPA database optimised for storing and querying high-dimensional vector embeddings for similarity search.
Context Window
Semantics & RepresentationThe maximum amount of text a language model can consider at once when generating a response.