This article talks about the deficiency in conventional image classification techniques of machine learning which brought Convolutional Neural Networks or ConvNets into the lime light for image classification. Convolutional Neural Networks can be shallow or deep. In this article we talk about the general architecture of a ConvNet which is a feather in the hat of deep learning. We also explain LeNet (LeNet5) which is one of the shallow Convolutional neural networks. We also analyze AlexNet which is a deep Convolutional neural network made by Alex Krizhevsky and finally we write about how AlexNet can be used to solve the problem of indoor scene classification which is a common problem in AI. (Please SHARE this article and LIKE us. Your share helps us. Your Like motivates us)
Image classification is the task of classifying an image into one of the given categories based on the visual content of the image. The conventional methods used for image classification studied under artificial intelligence or machine learning) consisted of two separate modules, namely
- The feature extraction module and
- The classification module.
The feature extractor module was a completely separate and independent module when compared to the classification module. This separation had a disadvantage that the extractor module was able to extract only a certain set of predetermined features on individual images and was unable to learn and extract discriminating features from the set of images that belong to different categories. Some of the widely used feature descriptors used for extracting features from images were Gist , Local Binary Patterns (LBP)  , Histogram of Oriented Gradients (HOG)  , Scale Invariant Feature Transform (SIFT)  . The user had the choice to either use them or build his own feature extractor, however building a custom feature extractor is a different task and it may not work effectively in all types of problems.
This necessitated the building of an integrated feature extraction and classification model where the integrated model was capable of overcoming the disadvantage of conventional methods. The extractor module of the integrated model must be capable of learning to extract discriminating features from the given training samples based on the accuracy achieved by the classification module. Lecun et al.  suggests that Multi-Layer Feed Forward Neural Network (MLFFNN) can be considered for the mentioned purpose. However when MLFFNN is used on images, the curse of dimensionality  comes into effect due to high number of features that requires excessive number of trainable parameters. The excessive number of trainable parameters (especially when the images are of high resolution) makes MLFFNN infeasible for fulfilling the desired purpose of classification. The in-feasibility arises primarily due to the lack of availability of a very large labelled training data set. Bishop  recommends that the number of training samples should be at least ten times the number of parameters required to be tuned up for an effective training. If the training data is available, then the next requirement for an image classification model is to satisfy the property of being invariant to translation. Lecun et al.  mentions that an MLFFNN can satisfy this property but such a network will be architecturally so complex that the practical implementation of it will become infeasible.
The task of making a feasible neural network for image classification with automatic feature extraction and translation invariance gave rise to a new category of neural networks called as the Convolutional Neural Network (ConvNet). These networks have special types of processing layers in the extractor module which learns to extract discriminating feature from the given input image. The extracted features are then passed on to the classification module that resembles to MLFFNN in terms of architecture. The advantage of having a ConvNet is that it requires very less number of trainable parameters as compared to MLFFNN because of its architecture, which supports sharing of weights and partially connected layers. Along with reduced number of trainable parameters, a ConvNet's design also makes the model invariant to translation thus making it state-of-the-art for classifying images.
This report describes various architectures of ConvNet such as LeNet, AlexNet and also an approach to solve the task of Indoor Scene Recognition (ISR) using AlexNet architecture. The report is organized as follows: Page 2 explains the general architecture of a convolutional neural network along with various types of processing layers that builds the network. Page 3 describes the architecture of LeNet which is a shallow ConvNet and the need for going into building deep networks. Page 4 describes the architecture of AlexNet which is a deep ConvNet. Page 5 talks about Dropout which is used in AlexNet. Page 6 describes AlexNet's application in Indoor Scene Recognition.
- Next >>