Chunking Strategy

Overview

The method of dividing long documents into smaller segments for embedding and retrieval, balancing context preservation with optimal chunk sizes for vector search accuracy.

Cross-References(1)

Deep Learning

Embedding

Related in Core NLP

Natural Language Processing

The field of AI focused on enabling computers to understand, interpret, and generate human language.

Seq2Seq Model

A neural network architecture that maps an input sequence to an output sequence, used in translation and summarisation.

Latent Dirichlet Allocation

A generative probabilistic model for discovering topics in a collection of documents.

Text Embedding

Dense vector representations of text passages that capture semantic meaning for similarity comparison and retrieval.

Semantic Search

Search technology that understands the meaning and intent behind queries rather than just matching keywords.

Vector Database

A database optimised for storing and querying high-dimensional vector embeddings for similarity search.

Constitutional AI

An approach to AI alignment where models are trained to follow a set of principles or constitution.

Natural Language Understanding

The subfield of NLP focused on machine reading comprehension and extracting meaning from text.

Natural Language Generation

The subfield of NLP concerned with producing natural language text from structured data or representations.

Document Understanding

AI systems that extract structured information from unstructured documents by combining optical character recognition, layout analysis, and natural language comprehension.

Slot Filling

The task of extracting specific parameter values from user utterances to fulfil a detected intent, such as identifying dates, locations, and names in booking requests.

Cross-Lingual Transfer

The application of models trained in one language to perform tasks in another language, leveraging shared multilingual representations learned during pre-training.

More in Natural Language Processing

Contextual Embedding

Semantics & Representation

Word representations that change based on surrounding context, capturing polysemy and contextual meaning.

Code Generation

Semantics & Representation

The automated production of source code from natural language specifications or partial code context, powered by large language models trained on programming repositories.

Multilingual Model

Semantics & Representation

A language model trained on text from dozens or hundreds of languages simultaneously, enabling cross-lingual understanding and generation without language-specific fine-tuning.

Speech Synthesis

Speech & Audio

The artificial production of human speech from text, also known as text-to-speech.

BERT

Semantics & Representation

Bidirectional Encoder Representations from Transformers — a language model that understands context by reading text in both directions.

Structured Output

Semantics & Representation

The generation of machine-readable formatted responses such as JSON, XML, or code from language models, enabling reliable integration with downstream software systems.

Token Limit

Semantics & Representation

The maximum number of tokens a language model can process in a single input-output interaction.

GPT

Semantics & Representation

Generative Pre-trained Transformer — a family of autoregressive language models that generate text by predicting the next token.

Overview

Cross-References(1)

Related in Core NLP

Natural Language Processing

Seq2Seq Model

Latent Dirichlet Allocation

Text Embedding

Semantic Search

Vector Database

Constitutional AI

Natural Language Understanding

Natural Language Generation

Document Understanding

Slot Filling

Cross-Lingual Transfer

More in Natural Language Processing

Contextual Embedding

Code Generation

Multilingual Model

Speech Synthesis

BERT

Structured Output

Token Limit

GPT

See Also

Embedding