Overview
A database optimised for storing and querying high-dimensional vector embeddings for similarity search.
More in Natural Language Processing
Tokenisation
Semantics & RepresentationThe process of breaking text into smaller units (tokens) such as words, subwords, or characters for processing by language models.
BERT
Semantics & RepresentationBidirectional Encoder Representations from Transformers — a language model that understands context by reading text in both directions.
Named Entity Recognition
Parsing & StructureAn NLP task that identifies and classifies named entities in text into categories like person, organisation, and location.
Instruction Tuning
Semantics & RepresentationTraining a language model to follow natural language instructions by fine-tuning on instruction-response pairs.
Abstractive Summarisation
Text AnalysisA text summarisation approach that generates novel sentences to capture the essential meaning of a document, rather than simply extracting and rearranging existing sentences.
Context Window
Semantics & RepresentationThe maximum amount of text a language model can consider at once when generating a response.
GloVe
Semantics & RepresentationGlobal Vectors for Word Representation — an unsupervised learning algorithm for obtaining word vector representations from aggregated word co-occurrence statistics.
Chunking Strategy
Core NLPThe method of dividing long documents into smaller segments for embedding and retrieval, balancing context preservation with optimal chunk sizes for vector search accuracy.