AlexNet: A Deep Convolutional neural Network
AlexNet (designed by Krizhevsky et al.  is one of the deep ConvNets designed to deal with complex scene classification task on Imagenet data. The task is to classify the given input into one of the 1000 classes.The main differences between LeNet and AlexNet are in the i) Number of processing layers and number of trainable parameters: AlexNet has 5 convolutional layers, 3 sub sampling layers, 3 fully connected layers and its total trainable parameters are in crores whereas LeNet has 2 convolutional layers, 2 Sub sampling layers and 3 fully connected layers and its total trainable parameters are in thousands. ii) The non linearity used in the feature extractor module of the AlexNet is ReLU whereas LeNet has logistic sigmoid. iii) AlexNet uses dropout where as no such concept is used in LeNet. A fully trained AlexNet on ImageNet data set can not only be used to classify Imagenet data set but it can also be used without the output layer to extract features from samples of any other data set.
The arrangement and configuration of all the layers of AlexNet is shown in Fig. 3. As can be seen in the figure, it uses a different activation function called Rectified Linear Unit (ReLU) (described in Section 4a) after every Convolutional layers and it also has a new type of processing called dropout after fully connected layers 1 and 2, the same is described in Page 5. The arrangement of Convolutional and pooling layers reduces the number of input features from 154587 to 1024 before sending them to the fully connected layers. Like LeNet, the Convolutional layers of AlexNet also takes inputs from a subset of feature maps generated by the previous layer. The detailed mapping is described in the work of Krizhevsky et al. . However, the purpose of such a mapping remains same as described before for LeNet in Page 3.
ReLU is a function first introduced by Hahnloser et al. . It was then stated by Nair et al.  that ReLU is an effective activation function for use in neural network as well. ReLU function is given by:
f(x) = max(0,x)
The plot of ReLU along with the plot of its derivative is shown in Fig. 4c and the corresponding expressions are given in Table II. ReLU is applied after the Convolutional layer, to induce sparsity in the features and to solve the problem of vanishing gradient. ReLU is not differentiable at 0 and this creates problem during back propagation which requires derivative at all points to be defined. SoftPlus which is shown in Fig. 4d is a smooth approximation to ReLU which was proposed by Dugas et al.  and is given by:
f(x) = log(1+ex)
Softplus is differentiable at 0 but it is still not used popularly due to computations involved in evaluating the exponent and logarithm functions. Instead ReLU is used by considering the derivative at 0 to be 0 or some small values ε.
ReLU has a disadvantage that the network using it suffers from the dying ReLU problem. The problem comes when a node generates a negative output. In such cases ReLU generates a derivative of 0, during backward pass and because of this the weights attached to the node behind ReLU are not trained. As they are not trained there is not much significant change in the value of the node in consecutive forward pass and the node does not get a chance to recover from the negative value, thus the node becomes potentially dead. When most of the nodes of a layer generate negative values, lesser is the training of the weights behind that layer. In an extreme case, if all the outputs from a layer become negative then no further training will be done for the weights behind that layer in consecutive training iterations.
A fact that can be observed in Fig. 3 is that the activation function ReLU is used only with the Convolutional layers and Hyperbolic Tangent is used only with the fully connected layers. The two main reasons for using ReLU in Convolutional layers are
- Faster convergence due to non existence of vanishing gradient problem. The gradient is made up of a product of derivatives of many activation functions which comes in the way from the node whose weights are to be updated to the output layer. In case of hyperbolic tangent these derivatives become very small when the output of the nodes goes in the saturating zone of the hyperbolic tangent function. As a result the product of such small derivatives which forms part of the gradient, becomes extremely small causing the gradient to vanish and this phenomena is also called as the vanishing gradient problem. Convolutional layer suffers more from these as they are many layers away from the output layer. The same does not happen with ReLU activation function (used in Convolutional layers), as it does not have a saturating zone for positive inputs.
- Inducing sparsity in features: Convolutional layer extract feature unlike fully connected layers. Feature extraction requires sparsity in the input feature maps and it should set to 0 as many features as possible. The features preserved with non zero values are those that are discriminative in nature. Fig. 5a, 5b and 5c shows the effect of Convolutional layers with ReLU on the image of a rectangle. It can be observed that all the features which are required to identify the rectangle are preserved and all others are set to 0 (represented in black color). The sparsity does not come into effect with other activation functions as they can generate small values instead of zeros. The sparsity in the features helps in speeding up the computation process by removing the undesired features.
The focus of FC layers is to generate new features rather than extracting features. This is the reason because of which hyperbolic tangent is used in fully connected (FC) layers. Moreover as FC layers are close to the output layer so it is less affected by the vanishing gradient problem. Because of these reasons, ReLU is not appropriate for them. Lecun et al.   has shown faster convergence achieved using Hyperbolic tangent as compared to Logistic sigmoid. They have later suggested that Hyperbolic tangent is a faster converging activation function because it is zero centered (shown in Fig. 4b) unlike Logistic sigmoid which is centered at 0.5 (shown in Fig. 4a). This means that on an average the values generated by Hyperbolic tangent are close to 0 and by Logistic sigmoid are close to 0.5. Hyperbolic tangent generates a strong / high derivative when its output is close to zero as compared to Logistic sigmoid which generates weaker / low derivatives when its output is close to its mean of 0.5 and this is why Hyperbolic tangent is the preferred choice of activation functions in fully connected layers.