Building an image detector using Convolutional Nueral Network
When you see the above picture do you see a man facing you? Or a man facing to the right? Both? Yes that’s because the brain switches the image based on the features it picks up. Well the way computers process the image is very similar to the way our brain does.
The convolutional nueral network takes an input image and produces an output indicating the probability of what the image might be. This feature in turn is used in a variety of applications including self tagging feature in social media to self driving cars.
Let’s say you have two images, one black & white and the other is a color image having 2x2 pixels. But to the computer it is an array of values between 0 and 255 (in 8 bit binary representation). Here 0 represents the black most pixel and 255 the white most. While in a black and white image you have a two dimensional array, the color image consists of 3 layers (Red, green and blue) with each one of the colors having it’s own intensity levels.
Now let’s proceed to understand the steps involved in building a CNN for image detection.
Convolution
In simple terms, convolution means how one function modifies the shape of the other.
When an input image is convoluted with a filter or a feature detector, it gives a corresponding feature map. Think of the highlighted part in the input image as a sliding window. When it is compared with a feature detector, the count of overlapping pixels is represented in the feature map. The whole idea of doing this is to make the image smaller so that it’s easier to process and find the features. It also helps to preserve the features (like eyes of a cheetah, a bullet train etc). We create multiple feature maps to obtain our first convolution layer.
Based on the feature detector we can also apply filter to the images. This can be used to either sharpen, blur or detect edge in an image.
Rectification Layer
The rectification layer in CNN is used to increase the non linearity. Linearity in this context refers to the gradual increment from white to grey and then to black in the images. The rectification layer is used to reduce this phenomenon.
Max Pooling
Pooling is performed to ensure the ‘spatial invariance’ meaning that the nueral network shouldn’t be fixed on the location of a particular feature in the given image. (For example the eye mark in the set of images containing Cheetahs- the spatial orientation or the texture shouldn’t be of any consequence for the system).
The max pooling process takes out the maximum value within the window and maps it to the pooled feature map. Also if you observe we are reducing the number of parameters going into the final layers of the nueral network by about 75% compared to the original image. Other types of pooling which can also be used include minimum pooling (taking the minimum value), mean pooling etc.
Flattening
The flattening process involves converting a pooled feature map into a single column. This forms the input of the artificial nueral network.
Full connection
A full connection is obtained when a whole artificial nueral network is added to the convolutional nueral network. The output value refers to a detected image (ex. classification of an image as a cat or a dog). In the direction of the data flow the error is calculated (while training the network whether an image is detected correctly or not by that particular nueron) and the performance parameters such as the ‘weights’ are adjusted in the artificial nueral network. If the nuerons in the fully connected layer indicates a wrong classification then node in the output layer begins to ignore them. The node in the output layer indicates that few nuerons always help in correct classification (ex. one nueron predicts eye lashes correctly and others eye lashes and so on)
In the above example the dog output value considers the weighted average of only the nuerons highlighted in blue and likewise the cat considers the weighted average of the nuerons highlighted in green. The resultant probability from the nuerons in the output layer clearly indicates that the input image is that of a dog. The weights or the values in the final output layer is learnt through an iterative process.