This article talks about the deficiency in conventional image classification techniques of machine learning which brought Convolutional Neural Networks or ConvNets into the lime light for image classification. Convolutional Neural Networks can be shallow or deep. In this article we talk about the general architecture of a ConvNet which is a feather in the hat of deep learning. We also explain LeNet (LeNet5) which is one of the shallow Convolutional neural networks. We also analyze AlexNet which is a deep Convolutional neural network made by Alex Krizhevsky and finally we write about how AlexNet can be used to solve the problem of indoor scene classification which is a common problem in AI. (Please SHARE this article and LIKE us. Your share helps us. Your Like motivates us)
Image classification is the task of classifying an image into one of the given categories based on the visual content of the image. The conventional methods used for image classification studied under artificial intelligence or machine learning) consisted of two separate modules, namely
- The feature extraction module and
- The classification module.
The feature extractor module was a completely separate and independent module when compared to the classification module. This separation had a disadvantage that the extractor module was able to extract only a certain set of predetermined features on individual images and was unable to learn and extract discriminating features from the set of images that belong to different categories. Some of the widely used feature descriptors used for extracting features from images were Gist , Local Binary Patterns (LBP)  , Histogram of Oriented Gradients (HOG)  , Scale Invariant Feature Transform (SIFT)  . The user had the choice to either use them or build his own feature extractor, however building a custom feature extractor is a different task and it may not work effectively in all types of problems.
This necessitated the building of an integrated feature extraction and classification model where the integrated model was capable of overcoming the disadvantage of conventional methods. The extractor module of the integrated model must be capable of learning to extract discriminating features from the given training samples based on the accuracy achieved by the classification module. Lecun et al.  suggests that Multi-Layer Feed Forward Neural Network (MLFFNN) can be considered for the mentioned purpose. However when MLFFNN is used on images, the curse of dimensionality  comes into effect due to high number of features that requires excessive number of trainable parameters. The excessive number of trainable parameters (especially when the images are of high resolution) makes MLFFNN infeasible for fulfilling the desired purpose of classification. The in-feasibility arises primarily due to the lack of availability of a very large labelled training data set. Bishop  recommends that the number of training samples should be at least ten times the number of parameters required to be tuned up for an effective training. If the training data is available, then the next requirement for an image classification model is to satisfy the property of being invariant to translation. Lecun et al.  mentions that an MLFFNN can satisfy this property but such a network will be architecturally so complex that the practical implementation of it will become infeasible.
The task of making a feasible neural network for image classification with automatic feature extraction and translation invariance gave rise to a new category of neural networks called as the Convolutional Neural Network (ConvNet). These networks have special types of processing layers in the extractor module which learns to extract discriminating feature from the given input image. The extracted features are then passed on to the classification module that resembles to MLFFNN in terms of architecture. The advantage of having a ConvNet is that it requires very less number of trainable parameters as compared to MLFFNN because of its architecture, which supports sharing of weights and partially connected layers. Along with reduced number of trainable parameters, a ConvNet's design also makes the model invariant to translation thus making it state-of-the-art for classifying images.
This report describes various architectures of ConvNet such as LeNet, AlexNet and also an approach to solve the task of Indoor Scene Recognition (ISR) using AlexNet architecture. The report is organized as follows: Page 2 explains the general architecture of a convolutional neural network along with various types of processing layers that builds the network. Page 3 describes the architecture of LeNet which is a shallow ConvNet and the need for going into building deep networks. Page 4 describes the architecture of AlexNet which is a deep ConvNet. Page 5 talks about Dropout which is used in AlexNet. Page 6 describes AlexNet's application in Indoor Scene Recognition.
General Architecture of a Convolutional Neural Network
ConvNets are built up with processing layers namely, Convolutional, SubSampling (SS) and the Fully Connected (FC). The configuration and the arrangement of these layers is decided by the designer of the network. The design depends upon the type and complexity of the problem to be solved along with the expected output from the network. The output from a ConvNet can either be posterior probabilities of a sample belonging to various classes or discriminative features that can be used with any other classifier. Fig. 1 shows the general architecture of a convolutional neural network where an image is given as input to the first convolutional layer and there on it is processed by a subsampling layer before finally getting processed by a fully connected layer. The initial layers, which are convolutional and subsampling forms the feature extraction module whereas the last fully connected layers form the classification module. Generally, no hard boundary exist between these modules and fully connected layers may form part of the classification module, however the classification module is built up of one or more fully connected layers only. The last processing layer of a typical ConvNet is a fully connected layer which is also called as the output layer. This layer computes the posterior probabilities for the given input to belong to one of the classes. Removal of the output layer along with the optional removal of the last few fully connected layers transforms the ConvNet from a classifier to a feature extractor. The functioning of all the three types of processing layers namely convolutional, subsampling and fully connected is described in Sections Convolutional layer, Subsampling layer and Fully Connected layer.
Convolution involves the shift, multiply and sum operations. The main processing component of this layer is a filter or mask which is a matrix of weights. This mask is applied on a specified region of the feature map. The marker that specifies the region where the mask is applied is called as the receptive field. The computation involves computing the weighted sum of the values covered by the receptive field using the weights of the mask. The weighted sum is then added with a bias and then passed through an activation function of choice, to introduce non-linearity. Once this value is computed, the receptive field moves on the feature map by a number of steps as specified by a user given parameter called as the stride to cover a new area on the feature map for computation of the next value using the same mask. The purpose of this mask is to learn and identify basic pattern of which the objects in the image are made up of. As these basic shapes mostly form the edges of the object, the mask learns edge detection of the discriminating object and filters out all other features. Fig. 5b and Fig. 5c shows what happens to an image (shown in Fig. 5a), when it passes through Convolutional layer 1 and 2. The resulting output of the convolution operation on an image is a new modified filtered image. The weights of mask are shared across a feature map and this concept of shared weights and shifting receptive field not only makes ConvNet computationally faster by making the processing layer to have fewer number of trainable parameters but it also induces translation invariance in ConvNets.
Subsampling layer, subsamples features from the input provided to reduce the resolution of the feature map which in turn reduces the number of features. Subsampling also lowers the effect of local distortions. It is popularly used in variants of either mean or max pool and because of this the subsampling layer is also referred as the pooling layer. It is applied after one or more convolutional layer. There are other less popular variants of subsampling also, like the one used in LeNet (described in Page 3 uses a completely different type of subsampling which is defined as φ((A x X)+B) where X is the sum of values under receptive field on the feature map given as input to this layer. A, B are the trainable weights and bias of the layer. φ refers to the activation function which is logistic sigmoid in the case of LeNet.
Fully connected (FC) layers are optional layers which are used to generate new features, from the existing features. It takes as input, the output generated by the final pooling layer and transforms it non linearly into another space. Hyperbolic tangent is the preferred choice of activation function in FC layers. Generally this layer has most number of weights with no sharing of those weights. This layer can thus take time to train as compared to other layers. Latest ConvNet designs have shown that they can be constructed without the FC layers also.
LeNet: A shallow Convolutional Neural Network
LeNet was designed by Lecun et al. . It is one of the first shallow Convolutional neural network designed specifically to classify handwritten digit. It is trained and tested on the MNIST  data set to classify the input into one of the ten classes representing 0-9 digits. It has 2 Convolutional layers, 2 Subsampling layers, 2 hidden layers and 1 output layer. The arrangement and configuration of each of the layers is as shown in Fig. 2. Usually the non linearity is added after Convolutional layer but in the case of LeNet, the same has been added on to the Subsampling layer as described in Page 2. Convolutional layer 1 works on all the input feature maps however Convolutional layer 2 takes inputs from a subset of feature maps generated by Subsampling layer 1 (refer TABLE 1) to know the configuration) to generate 16 feature maps. The partial connections that are established instead of full connections is mainly for i) Reducing the number of weights and ii) Getting certain level of surety of extracting different features as different input features maps are being used to generate each output feature map. It can also be observed in Fig. 2 that the number of features in the input feature map is 32 x 32 (1024) which has been reduced down to 400, gradually with the help of Convolutional and Subsampling layers before going into the FC layers. The fully connected layers performs non linear transformations of these input features and finally the output layer generates 10 outputs to decide the class of the input.
LeNet was sufficiently good for the image classification problems of 1998 but the complexity of the problem got increased with time. The complexity here refers to the i) Size of data set which got increased from thousands to millions ii) Number of classes that got increased from few tens to thousands and iii) Resolution of individual images also got increased from few tens to thousands. Overall the entire task changed from classification of simple hand written digits to classification of complex scene images. LeNet lacked the amount of processing units which was required to deal with such complex classification problems and the need was to increase the number of layers of processing (the depth).
AlexNet: A Deep Convolutional neural Network
AlexNet (designed by Krizhevsky et al.  is one of the deep ConvNets designed to deal with complex scene classification task on Imagenet data. The task is to classify the given input into one of the 1000 classes.The main differences between LeNet and AlexNet are in the i) Number of processing layers and number of trainable parameters: AlexNet has 5 convolutional layers, 3 sub sampling layers, 3 fully connected layers and its total trainable parameters are in crores whereas LeNet has 2 convolutional layers, 2 Sub sampling layers and 3 fully connected layers and its total trainable parameters are in thousands. ii) The non linearity used in the feature extractor module of the AlexNet is ReLU whereas LeNet has logistic sigmoid. iii) AlexNet uses dropout where as no such concept is used in LeNet. A fully trained AlexNet on ImageNet data set can not only be used to classify Imagenet data set but it can also be used without the output layer to extract features from samples of any other data set.
The arrangement and configuration of all the layers of AlexNet is shown in Fig. 3. As can be seen in the figure, it uses a different activation function called Rectified Linear Unit (ReLU) (described in Section 4a) after every Convolutional layers and it also has a new type of processing called dropout after fully connected layers 1 and 2, the same is described in Page 5. The arrangement of Convolutional and pooling layers reduces the number of input features from 154587 to 1024 before sending them to the fully connected layers. Like LeNet, the Convolutional layers of AlexNet also takes inputs from a subset of feature maps generated by the previous layer. The detailed mapping is described in the work of Krizhevsky et al. . However, the purpose of such a mapping remains same as described before for LeNet in Page 3.
ReLU is a function first introduced by Hahnloser et al. . It was then stated by Nair et al.  that ReLU is an effective activation function for use in neural network as well. ReLU function is given by:
f(x) = max(0,x)
The plot of ReLU along with the plot of its derivative is shown in Fig. 4c and the corresponding expressions are given in Table II. ReLU is applied after the Convolutional layer, to induce sparsity in the features and to solve the problem of vanishing gradient. ReLU is not differentiable at 0 and this creates problem during back propagation which requires derivative at all points to be defined. SoftPlus which is shown in Fig. 4d is a smooth approximation to ReLU which was proposed by Dugas et al.  and is given by:
f(x) = log(1+ex)
Softplus is differentiable at 0 but it is still not used popularly due to computations involved in evaluating the exponent and logarithm functions. Instead ReLU is used by considering the derivative at 0 to be 0 or some small values ε.
ReLU has a disadvantage that the network using it suffers from the dying ReLU problem. The problem comes when a node generates a negative output. In such cases ReLU generates a derivative of 0, during backward pass and because of this the weights attached to the node behind ReLU are not trained. As they are not trained there is not much significant change in the value of the node in consecutive forward pass and the node does not get a chance to recover from the negative value, thus the node becomes potentially dead. When most of the nodes of a layer generate negative values, lesser is the training of the weights behind that layer. In an extreme case, if all the outputs from a layer become negative then no further training will be done for the weights behind that layer in consecutive training iterations.
A fact that can be observed in Fig. 3 is that the activation function ReLU is used only with the Convolutional layers and Hyperbolic Tangent is used only with the fully connected layers. The two main reasons for using ReLU in Convolutional layers are
- Faster convergence due to non existence of vanishing gradient problem. The gradient is made up of a product of derivatives of many activation functions which comes in the way from the node whose weights are to be updated to the output layer. In case of hyperbolic tangent these derivatives become very small when the output of the nodes goes in the saturating zone of the hyperbolic tangent function. As a result the product of such small derivatives which forms part of the gradient, becomes extremely small causing the gradient to vanish and this phenomena is also called as the vanishing gradient problem. Convolutional layer suffers more from these as they are many layers away from the output layer. The same does not happen with ReLU activation function (used in Convolutional layers), as it does not have a saturating zone for positive inputs.
- Inducing sparsity in features: Convolutional layer extract feature unlike fully connected layers. Feature extraction requires sparsity in the input feature maps and it should set to 0 as many features as possible. The features preserved with non zero values are those that are discriminative in nature. Fig. 5a, 5b and 5c shows the effect of Convolutional layers with ReLU on the image of a rectangle. It can be observed that all the features which are required to identify the rectangle are preserved and all others are set to 0 (represented in black color). The sparsity does not come into effect with other activation functions as they can generate small values instead of zeros. The sparsity in the features helps in speeding up the computation process by removing the undesired features.
The focus of FC layers is to generate new features rather than extracting features. This is the reason because of which hyperbolic tangent is used in fully connected (FC) layers. Moreover as FC layers are close to the output layer so it is less affected by the vanishing gradient problem. Because of these reasons, ReLU is not appropriate for them. Lecun et al.   has shown faster convergence achieved using Hyperbolic tangent as compared to Logistic sigmoid. They have later suggested that Hyperbolic tangent is a faster converging activation function because it is zero centered (shown in Fig. 4b) unlike Logistic sigmoid which is centered at 0.5 (shown in Fig. 4a). This means that on an average the values generated by Hyperbolic tangent are close to 0 and by Logistic sigmoid are close to 0.5. Hyperbolic tangent generates a strong / high derivative when its output is close to zero as compared to Logistic sigmoid which generates weaker / low derivatives when its output is close to its mean of 0.5 and this is why Hyperbolic tangent is the preferred choice of activation functions in fully connected layers.
Dropout: Counter measure for Overfitting in AlexNet
Overfitting occurs when the ConvNet model with high number of weights gets trained on the training data set with less samples and the model learns to identify the intrinsic noise of the training data. Fig. 5d shows that when overfitting occurs, the model learns to identify the entire shape instead of identifying minimum discriminating features. This reduces the model's capability to correctly classify samples which are outside of the training data set. AlexNet has a complex design with high number of weights and as per Bishop  , the number of training samples should be at-least 10 times the number of parameters to be tuned. When this is not available then AlexNet, has a higher risk of overfitting. The general techniques to reduce overfitting in a neural network are: i) To train same architecture with different training data sets and finally make a model with average of the tuned parameters. However, this is not possible due to lack of labelled training sample ii) To train multiple architectures with same data set and take the average of their prediction. However, this is also not possible as it is a computation intensive process and slows down the result generation during testing. iii) To train different architectures weight sharing and same data set (all the architectures being subsets of a large parent network). Dropout is used to generate 2n sub network architectures when n nodes are associated with Dropout in parent architecture. Each of the 2n architecture is rarely trained.
As per Hinton et al.  and Srivastava et al. , dropout helps in removing complex co-adaptions. Removal of complex co-adaptation implies training a node in a neural network with a randomly selected sample of other nodes. This makes the node more robust and drive it towards creating useful features, without relying much on other nodes. The regularizing effect of dropout is achieved as it gives the geometric mean of the predictions of 2n different architectures for a particular sample. The other view of looking at dropout is that it adds noise to the network and a network trained with dropout is more robust to noise.
Fig. 6 shows the connectivity of dropout layer with a fully connected layer in a neural network. As can be seen, the nodes of the dropout layer associate themselves with the nodes of the layer just before them and take the output of those nodes as input. During each forward execution of the training phase, dropout randomly sets 0 with a probability of (1-p) to some of the values which it receives and passes it on to the next layer. p is a user given parameter and represents retain probability of the layer and p is usually 0.5 when Dropout is attached to fully connected layers and 0.8 or 0.9 when dropout is attached to Convolutional layers. When the output of dropout is 0, then it is virtually same as the node to which dropout is attached does not exist. If the node is retained in any forward pass, then the output of that node gets multiplied with 1/p. This is to maintain the same expected output of a node. During the test phase, dropout nodes are always retained and dropout nodes always multiply 1/p to the input they receive and pass it on to the next layer.
Indoor Scene Recognition using AlexNet
Indoor Scene Recognition (ISR) is the task of recognition of images of indoor scenes, into various classes. ISR suffers from the problem of high intra-class variability and high inter-class similarity. This makes it difficult for a Convolutional neural network to classify images as it preserves the global spatial structure. Fig. 7 shows the case when images are having high intra-class variability and high inter-class similarity. This section describes the approach proposed by Khan et al.  which uses AlexNet as a feature extractor and SVM as the classifier, for ISR.
The approach begins with Data Augmentation, which augments the training data by generating multiple images corresponding to each image. Then Dense Patch Extraction is done, which is the process of extracting dense mid level features from images. These extracted feature vectors are then encoded into a new and more discriminative space with the help of Scene Representative Patches or SRPs. Training and Testing describes how the encoded features can be used to train a C-SVM for the purpose of classification.
16 images are generated corresponding to each training image. 5 crops of size 2/3 from four corners and center are extracted along with 2 additional images that are generated by rotating the original image 30o clockwise and counter clock wise. These 7 images along with the original image and horizontal flips of all these 8 images replace the original image in training data set.
2. Dense Patch Extraction
Overall structure of the image is not helpful for recognition and therefore the recognition depends upon objects of the image. Dense patch extraction is the process of extracting mid (object) level features from the images. It is performed by extracting patches of size 227x227 from each image of the training set with a stride of 32. The size of 227 was found to be of the appropriate size as it was able to capture mid level or object level relationships from the image and the images of this size can also be provided to AlexNet without any further scaling. The stride of 32 was found to be dense enough to capture all mid level information available in the image. A trained AlexNet, without the output layer is used to extract 4096 dimensional feature vector from the generated patches.
Visually, Scene Representative Patches or SRPs are images of objects mostly found indoor. Mathematically, it is a 4096 dimensional feature vector extracted by AlexNet of the mentioned images. Purpose of SRPs is to encode the give data for ISR, into a space which will be more discriminative for the purpose of classification. SRP are generated in 2 ways which are supervised and unsupervised.
Supervised generation of SRPs: Khan et al.  generated a new database called Object Categories in Indoor Scenes (OCIS) for the supervised generation of SRPs. The database consist of approximately 15000 images of various sizes from 1300 categories of objects found indoor. After scaling the lower dimension of these images to 256px the images undergo Dense Patch Extraction. Feature vectors of all the patches of all the images of a category are max pooled to generate one 4096 dimensional representative vector for that category. A 4096x1300 dimensional matrix is created by this step which is called as the supervised SRP codebook.
Unsupervised generation of SRPs: The supervised codebook is made from database that consist of images which belong to items most commonly found indoor. But the list of such items is not exhaustive and to complement the supervised codebook, the unsupervised codebook is used which is built up from the images of the training data. Clustering is applied on the extracted dense patches of the training data and the cluster centers forms the unsupervised SRP codebook. Khan et al. , recommended multiple codebooks of size 3000 each instead of a single big codebook. The choice of clustering algorithm remains open.
4. Encoding the Training Data
Encoding transforms the feature vectors of training data to a different space with the help of codebooks. Similarity Metric Coding SMC proposed by Khan et al.  encodes the feature vectors using a C-SVM with linear kernel. A codebook i of size m trains a m linear one-vs-all C-SVM. The trained SVM returns a 4096 x m dimensional matrix of weights Wi for each codebook i. The feature vectors of training data are encoded using the formula:
fi(n) = WiT * x(n)
fi (n) is the partially encoded feature vector corresponding to image patch x(n), encoded using the weight matrix Wi. Row vector Fj n = [f1(n) ... fI(n)] represents the final feature vector for the image patch n of image j.
Column Vector Fj n of all patches n belonging to an image j is max pooled to generate one feature vector corresponding to each training image. The encoded training data set is used to train a one-vs-one C-SVM. The test data undergoes all the steps mentioned for training data except the augmentation, to get classified using the trained SVM.Conclusion
This article described, how ConvNets are designed to learn and extract discriminating features from images. Architecture of LeNet was used to explain the functioning of a simple Convolutional neural network for handwritten digit classification. As LeNet was too simple to classify complex problems dealt today, so its successor AlexNet which is a deep Convolutional neural network was explained. The use of ReLU as the activation function in Convolutional layers was also described. Having very high number of weights, AlexNet suffered overfitting and therefore dropout was used to deal with it. At-last, the problem of Indoor Scene Recognition and how AlexNet can be used to solve it was elaborated. Overall ConvNets are the state-of-the-art to classify images. The architecture of the ConvNet depends upon the complexity of the problem. ConvNets can be used as feature extractors and the extracted features can be used with any type of classifiers. (Please SHARE this article and LIKE us. Your share helps us. Your Like motivates us)
About the Author
Author: Vikram Singh Webpage
About me: I am a PhD scholar at Indian Institute of Technology-Madras. I like to write about different topics which are related to Academics, Computer Science and Technology. I would love to resolve your queries, regarding this article. Please leave your questions / suggestions in the comments section below. I will reply to them as soon as possible.Follow @completegate
- If you find any mistake in this article, then inform us immediately so that we can correct it.
- If you find this article helpful then please LIKE our Facebook page, SHARE this article on your social media accounts and leave your COMMENT with feedback, questions, appreciation, suggestions, concern etc.
- Please remember that sharing is caring.