Residual Network — Technology Wiki

Overview

Direct Answer

A deep neural network architecture that employs skip connections (residual connections) to allow input signals to bypass one or more layers, enabling the training of networks with 100+ layers by mitigating the vanishing gradient problem.

How It Works

Skip connections add the input of a layer directly to its output, forcing the network to learn residual mappings—the difference between desired and input signals—rather than learning the full transformation. This architectural modification preserves gradient magnitude during backpropagation, allowing errors to flow through very deep networks without exponential decay.

Why It Matters

Residual networks dramatically improved accuracy in large-scale image recognition tasks and became foundational for modern computer vision systems. The ability to train substantially deeper models with better convergence properties reduced training time and improved performance on complex visual and sequential tasks, driving adoption across industries requiring high-accuracy perception systems.

Common Applications

Medical image analysis for diagnostic detection, object recognition in autonomous vehicle systems, and large-scale image classification in e-commerce platforms rely on residual architectures. Natural language processing models and speech recognition systems also employ residual connections to process sequential data more effectively.

Key Considerations

Deeper networks do not automatically produce better results; residual connections mitigate training difficulties but require careful hyperparameter tuning and computational resources. Practitioners must balance network depth against overfitting risk and deployment constraints.

Cross-References(1)

Deep Learning

Neural Network

Related in Training & Optimisation

Self-Attention

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Multi-Head Attention

An attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.

Layer Normalisation

A normalisation technique that normalises across the features of each individual sample rather than across the batch.

Dropout

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Activation Function

A mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.

ReLU

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.

Sigmoid Function

An activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.

Softmax Function

An activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.

Positional Encoding

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Gradient Clipping

A technique that caps gradient values during training to prevent the exploding gradient problem.

Mixed Precision Training

Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.

Rotary Positional Encoding

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

More in Deep Learning

Word Embedding

Language Models

Dense vector representations of words where semantically similar words are mapped to nearby points in vector space.

Fine-Tuning

Architectures

The process of taking a pretrained model and further training it on a smaller, task-specific dataset.

Generative Adversarial Network

Generative Models

A framework where two neural networks compete — a generator creates synthetic data while a discriminator evaluates its authenticity.

Weight Decay

Architectures

A regularisation technique that penalises large model weights during training by adding a fraction of the weight magnitude to the loss function, preventing overfitting.

Batch Normalisation

Architectures

A technique that normalises layer inputs during training to stabilise and accelerate deep neural network learning.

Vanishing Gradient

Architectures

A problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.

Knowledge Distillation

Architectures

A model compression technique where a smaller student model learns to mimic the behaviour of a larger teacher model.

Capsule Network

Architectures

A neural network architecture that groups neurons into capsules to better capture spatial hierarchies and part-whole relationships.