What an artifical neuron do is to calculate a weighted sum of its input, adds a bias and then decides whether it should be “fired” or not. considering the following function:
\[ Y=\sum(\text {weight} * \text {input})+\text {bias} \]
the value of Y can be anything ranging from -inf to +inf. The neuron really doesn’t know the bounds of the value. How do we decide whether the neuron should fire or not? We use activatioin functions for this purpose.
Sigmoid Function We can consider the sigmoid function that maps the values from 0 to 1 and mimics the biological neuron. An important aspect to consider using the sigmoid function is that it makes the neural nework nonlinear, and we cannot reduce the NN into a simple linear equation. In modern NN there are some problems with sigmoid and is not longer used as often except in some specific cases.
Standardization It is of high importance to standardize, because we don’t want to have one rangein million and another range in millesimal; the sigmoid is problematic in this regard because it goes between 0 and 1 and the center is 0.5. As a conseguence, the output of a sigmoid can never be centerd around zero. This recall the concept of Uniformity of NN. The solution for this is the Hyperbolic tangent Function.
Hyperbolic tangent Function It has the same shape of the sigmoid but centered around zero.
This function is also called the 10 H which is short for hyperbolic tangent. There are hyperbolic versions for all the trigonometric functions: sin, cos, tan. So, we are using the tanh(x)=sinh(x)/cos(x) that is the hyperbolic sine over the hyerbolic cosine, that results in the following equation:
\[ \tanh (a)=\frac{\exp (2 a)-1}{\exp (2 a)+1} \]
which is similar to the sigmoid function. The major difference between the sigmoid and the 10 H is that the sigmoid goes between 0 and 1 while the 10 H goes between -1 and +1.
Although the 10 H is better than the sigmoid there are many other aspect to take in consideration. The major problem is the Vanishing Gradient. We have to find the gradient of the cost with respect to the parameters, but when we have very deep NN our gradient has to propagate backwards throughout the NN. The deeper a NN is, the more terms have to be multiplied in the gradient due to the chain rule of the calculus. So, multiple functions become multiplication in the derivative as shown below:
\[ \frac{\partial J}{\partial W^{(1)}}=\frac{\partial J}{\partial z^{(L)}} \frac{\partial z^{(L)}}{\partial z^{(L-1)}} \ldots \frac{\partial z^{(2)}}{\partial z^{(1)}} \frac{\partial z^{(1)}}{\partial W^{(1)}} \]
The problem with sigmoind ere is that its derivativeis very nearly zero, and only the center is non-zero as shown in the figure below, and the maximum value is only 0.25. By conseguence, when we moltiply a numbers that are very smallwe obtain a smaller one; the further we go back in the NN the smaller the gradient becomes. We call this the Vanishing Gradient Problem.
Looking at the graph on the right of the figure above, we can see the magnitute of the gradient at each ayer of the NN as it is trained. We notice that the futher we go back, the smaller the gradient gets. The end result is that the weights close to the input are almost not trained. The solution to train a very deep NN is to use the Rectifier Linear Unit Function - ReU.
Rectifier Linear Unit Function - ReU The ReLUfunctioin has a corner at zero where the derivative is technically not defined.
Having values grater than zero never have a zero gradient and this makes the training of the NN more efficient. The left half of the function has derivative that is exactly zero and so the function is already vanished. This is the so called phenomenon of the Dead Neuron.
Fixing Dead Neuron Some researchers have tried to make modification to solve the dead neuron phenomenon. One alternative is the Leaky ReLU Function - LReLU. The LReLU has small positive slope for negative inputs (like for example 0.1). Using LReLU our derivatives will always be positive
There are other optional function to solve the Dead Neuron such as the Exponential Linear UnitFunction - ELU which has a more steadly decreasing value on the left side of the fnction. This activation function speeds up learning and leads to higher accuracy than the LReLU. One interesting aspect of the ELU is that it allows its output to be negative and this goes to the idea to have the mean of the value close to zeor. Another option very similar to the ELU is the Softplus Function. As we can see from the formula below it takes the log of the exponent and for this reason the function look very linear where the inputs are very large as depicted in the figure below.
\[ f(x)=\log \left(1+e^{x}\right) \]For both Softplus and ELU there is thevanishing gradient on the left side but it is not so musch of a problem because the function works. Moreover they are in the range of of 0 and infinity and this means they are not centerd around zero.
Despite all these alternatives for the ReLU function, most people still use the ReLU as a reasonable defautl choice. The suggestion is that NN is experimentation and we have to try also to use Softplus and ELU.
Biological Plausibility of the ReLU Interestingly, some rsearchershave talked about the biological plausibility of the ReLU. In fact, the ReLUis more plausible than the sigmoid.
When we hear a very quiet sound, our neuros are only activated a little bit and we get some action potential but with a very low frequency as we can see from the spikes in the figure above. On the contrary, when we hear a very loud sound the neurons are getting lots of stimulation and the action potential is very frequent. This process in neuroscience is called Frequency Coding. In terms of the ReLU it can encoding the action potential frequency. That is why in ReLU we have zero as minimum value, because the less number of neuron spikes is zero. On the contrary, when the input into the neuron get larger, the receiving neuron also increase its frequency. From neuroscience we know that the relationship between the actionn potential frequency and stimulus intensity is nonlinear (e.g. log function or square root function). A good example is with the sound.
Multiclass Classification & Softmax Function Summarizing, we replaced sigmoid with ReLU in the hidden layers, but for outpu, sigmoid is still the right choice in the case of binary classification such as disease vs. no disease or fraud vs. not fraud, but in multiple categorical outputs we cannot use sigmoid. One example of multiple classification is handwriting recognition or speech recognitiion and image classification. We need the probability of k and the requirements are, first the probabilities must be non-negative(>=0), and second the probabilities for each output must sum to 1 as described below. Where k is the number of outputs/outcomes.
\[ \begin{aligned} &p(y=k | x) \geq 0\\ &\sum_{k=1}^{K} p(y=k | x)=1 \end{aligned} \]
There is a function that accomplish both these requirements called Sofmax Function.
\[ \begin{aligned} p(y=k | x)=& \frac{\exp \left(a_{k}\right)}{\sum_{j=1}^{K} \exp \left(a_{j}\right)} \\ \sum_{k=1}^{K} p(y=k | x)=\sum_{k=1}^{K} \frac{\exp \left(a_{k}\right)}{\sum_{j=1}^{K} \exp \left(a_{j}\right)}=\frac{\sum_{k=1}^{K} \exp \left(a_{k}\right)}{\sum_{j=1}^{K} \exp \left(a_{j}\right)} &=1 \end{aligned} \]
As we can see from the formulation above the numerator of exp(ak) is always positive and therefore each probabilities are non negative. The denominator is the sum of all possible values of the numerator and so we just get the sum and therefore we get the same sum on the top and on the bottom resulting to one.