Overview
Direct Answer
A fully connected layer is a neural network component in which each neuron receives input from all neurons in the preceding layer and transmits output to all neurons in the following layer. Also termed a dense layer, it forms a complete bipartite graph of connections between adjacent layers.
How It Works
Each neuron in the layer computes a weighted sum of all inputs from the prior layer, adds a bias term, and applies an activation function to produce its output. The weight matrix dimensionality is determined by the product of the input and output neuron counts, making computation cost scale quadratically with layer size. This architecture enables the network to learn arbitrary non-linear transformations by adjusting weights during backpropagation.
Why It Matters
Dense layers serve as the primary mechanism for learning complex feature representations and decision boundaries in neural networks. They are computationally efficient for feature extraction and classification tasks, directly impacting model accuracy and inference latency—critical factors in production systems handling real-time predictions and large-scale data processing.
Common Applications
Fully connected layers appear in image classification networks (following convolutional feature extraction), natural language processing models for text classification, recommendation systems, and time-series forecasting. They form the output layer in virtually all supervised learning neural networks.
Key Considerations
Fully connected layers introduce significant parameter overhead compared to convolutional or recurrent alternatives, increasing memory consumption and training time. They assume no spatial or temporal structure in data, making them less efficient than specialised layers for structured inputs such as images or sequences.
Cross-References(1)
More in Deep Learning
Generative Adversarial Network
Generative ModelsA framework where two neural networks compete — a generator creates synthetic data while a discriminator evaluates its authenticity.
Pre-Training
Language ModelsThe initial phase of training a deep learning model on a large unlabelled corpus using self-supervised objectives, establishing general-purpose representations for downstream adaptation.
Activation Function
Training & OptimisationA mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.
Model Parallelism
ArchitecturesA distributed training approach that partitions a model across multiple devices, enabling training of models too large to fit in a single accelerator's memory.
Key-Value Cache
ArchitecturesAn optimisation in autoregressive transformer inference that stores previously computed key and value tensors to avoid redundant computation during sequential token generation.
Graph Neural Network
ArchitecturesA neural network designed to operate on graph-structured data, learning representations of nodes, edges, and entire graphs.
Vanishing Gradient
ArchitecturesA problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.
ReLU
Training & OptimisationRectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.