Activation Function — Technology Wiki

Overview

Direct Answer

An activation function is a mathematical operation applied to the weighted sum of inputs at each neuron, introducing non-linearity to enable neural networks to learn complex, non-linear relationships in data. Without it, stacked layers would collapse into a single linear transformation, severely limiting representational capacity.

How It Works

During forward propagation, each neuron computes a weighted sum of its inputs plus a bias term, then passes this value through the chosen function (such as ReLU, sigmoid, or tanh) before outputting to the next layer. This non-linear transformation allows the network to approximate arbitrary functions. During backpropagation, the derivative of the function is used to compute gradients for weight updates.

Why It Matters

Selection of the appropriate function directly impacts training speed, convergence behaviour, and final model accuracy. Poor choices can cause vanishing or exploding gradients, slowing training significantly or preventing learning altogether. Efficient functions like ReLU reduce computational overhead, lowering inference costs in production systems.

Common Applications

ReLU is standard in convolutional neural networks for image recognition tasks. Sigmoid and tanh remain prevalent in recurrent networks for time-series forecasting. Softmax is essential in multi-class classification layers across natural language processing and computer vision applications.

Key Considerations

ReLU units can suffer from the 'dying ReLU' problem where neurons become inactive permanently. The choice must align with the output layer's requirements: sigmoid for binary classification, softmax for multi-class, and linear for regression tasks.

Cross-References(1)

Deep Learning

Neural Network

Referenced By3 terms mention Activation Function

Other entries in the wiki whose definition references Activation Function — useful for understanding how this concept connects across Deep Learning and adjacent domains.

ReLU·Deep Learning Sigmoid Function·Deep Learning Softmax Function·Deep Learning

Related in Training & Optimisation

Self-Attention

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Multi-Head Attention

An attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.

Residual Network

A deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.

Layer Normalisation

A normalisation technique that normalises across the features of each individual sample rather than across the batch.

Dropout

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

ReLU

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.

Sigmoid Function

An activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.

Softmax Function

An activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.

Positional Encoding

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Gradient Clipping

A technique that caps gradient values during training to prevent the exploding gradient problem.

Mixed Precision Training

Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.

Rotary Positional Encoding

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

More in Deep Learning

Vanishing Gradient

Architectures

A problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.

Capsule Network

Architectures

A neural network architecture that groups neurons into capsules to better capture spatial hierarchies and part-whole relationships.

Word Embedding

Language Models

Dense vector representations of words where semantically similar words are mapped to nearby points in vector space.

Model Parallelism

Architectures

A distributed training approach that partitions a model across multiple devices, enabling training of models too large to fit in a single accelerator's memory.

Fine-Tuning

Language Models

The process of adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset, transferring learned representations to new domains.

Fully Connected Layer

Architectures

A neural network layer where every neuron is connected to every neuron in the adjacent layers.

Adapter Layers

Language Models

Small trainable modules inserted between frozen transformer layers that enable task-specific adaptation without modifying the original model weights.

Attention Mechanism

Architectures

A neural network component that learns to focus on relevant parts of the input when producing each element of the output.