Nowadays, Convolutional Neural Networks (also called CNNs or ConvNets) dominate fields like computer vision. They excel in anything from image classification to face recognition, and even have other applications like music and medicine. So how did this come to be? Why can’t we just use standard fully-connected neural networks for everything?
Let’s say we want to train an image classifier on 1000 x 1000 RGB images with a standard Neural Network. That means we’ll have 3 million imput features. If we want to have 1000 units in the first hidden layer, we end up with a weights matrix of 3 billion parameters. With that many parameters, we’d be hard pressed to find enough training data to prevent overfitting. Plus, the computational and memory requirements to train the model wouldn’t be very feasible.
This is where Convolutional Neural Networks come in. Instead of having separate weights for each input pixel, we train a fixed number of “filters” that we convolve with our input image to extract features which can then be passed into ReLU, pooling, and fully connected layers.
Consider a 5 x 5 input image whose pixel values are only 0 and 1:
Then, consider one of our filters, a 3 x 3 matrix:
We compute the convolution of our image and filter as shown in the animation below:
Basically, we slide our filter over our input image 1 pixel at a time (called a “stride” of 1) and for each position, compute element-wise multiplication and add them up to form our output matrix.
A great real-world example is below:
Our first filter, with the red outline, is convolved with our input image to form a feature map. The convolution of another filter, with the green outline, forms a different feature map.
In practice, CNNs will learn the values of the best filters to use through the optimization of a loss function, as with most learning models. What we need to specify are “hyperparameters” like the number of filters to use and filter size.
After our convolutional layer, we need to introduce non-linearity into our model. ReLU is an element-wise operation that replaces all negative pixel values in the feature map with zero, as shown in the example below:
Other non linear functions can be used, but ReLU has been found to perform the best in most situations.
Pooling layers, also known as downsampling layers, reduce the dimensionality of our features, both making our model more computationally efficient and combating overfitting. There are two types of pooling used: Max Pooling and Average Pooling. Max Pooling returns the maximum value from each kernel, while Average Pooling returns the average value:
Generally, we use Max Pooling, as it is able to discard noisy activations. Average Pooling, on the other hand, simply performs dimensionality reduction.
Fully Connected Layer
With the above layers implemented, our model should be successfully enabled to understand the features from our image. All we have left to do is to flatten the output and feed it into a standard, fully-connected Neural Network.
Putting it all together, our model should look something like this:
In summary, here’s how we build a Convolutional Neural Network:
- Provide input image into convolution layer.
- Choose hyperparameters, apply filters with strides, and perform convolution on the image.
- Apply ReLU activation to the matrix.
- Perform pooling to reduce dimensionality and combat overfitting.
- Flatten the output and feed into a fully connected layer to output the final classifications.