The commonly used loss functions are
- Linear : g(x) = x. This is the simplest activation function. However it cannot model complex decision boundaries. A deep network with linear activations can be shown incapable of handling non-linear decision boundaries.
- Sigmoid : This is a common activation function in the last layer of the neural network particularly for a classification problem with cross entropy loss. The problem with sigmoid activation is that the gradient becomes close to 0 for high and low values making the learning slow and leading to vanishing gradient problems.
- Tanh : This is the most common activation function in the intermediate layers – a rescaled version of sigmoid. Not as prone to saturation as sigmoid. It has stronger gradients.
- Relu : g(x) = max(0,x)
Relu tends to give a sparse output since negative input is turned into 0. No vanishing gradient problem. A draw back of relu is that relu may blow the activation values