Why bother?
- To provide some non-linearity to the model –> allow it to learn more complex data.
- Hence, must be monotonic, differentiable, and quickly converging w.r.t the weights.
Types
- Sigmoid: input: all real (-inf,+inf), output: [0,1].
- Therefore, can represent the class probabilities.
- Derivatives can be expressed i.t.o. the function itself.
- Tanh (Hyperbolic Tangent): output: [-1,1].
- Derivatives can be expressed i.t.o. the function itself.
- RELU (Rectified Linear Unit):
- if the initial output is <0, then output 0. If not, do nothing to the initial output.
- The formula max(0,z) is deceptively simple.
- Same benefit as Sigmoid, but with better performance.
- Leaky ReLU:
- Purpose is to fix the “dying ReLU” gradient problem.
- When z<0, instead of output 0, it outputs a small, non-zero constant gradient alpha (normally, alpha=0.01). The performance consistency across tasks is still unclear.
- Parametric ReLU:
- Let Neurons choose what slope is best in the negative region.
- PReLU can become either ReLU or Leaky ReLU with certain values of alpha.
- Maxout:
- a generalization of the ReLU and the leaky ReLU functions. It is a piecewise linear function that returns the maximum of inputs, designed to be used in conjunction with the dropout regularization technique. Both ReLU and leaky ReLU are special cases of Maxout.
- The Maxout neuron, therefore, enjoys all the benefits of a ReLU unit and does not have any drawbacks like dying ReLU. However, it doubles the total number of parameters for each neuron, and hence, a higher total number of parameters need to be trained.
- ELU (Exponential Linear Unit):
- Converges faster, produces more accurate results.
- Unlike other activiation functions, ELU has an extra alpha const, which is a positive number.
- Similar to ReLU, except for negative inputs:
- ELU smoothly slowly until its output equal to -alpha, while ReLU sharply smoothes.
General Tips on what to choose
- Depends on the problem type, and the range of expected output.
- if want to predict value >1 🡪 use ReLU (can’t sigmoid or Tanh).
- if want to predict value in (0,1) or (-1,1) 🡪 don’t use ReLU.
- Classification to predict prob dist. Over mutually exclusive class labels 🡪 softmax in the last layer. If binary class 🡪 use Sigmoid in last layer.
- In hidden layer, don’t use Sigmoid or Tanh.
- In hidden layer, as a rule of thumb, use ReLU. Even better: Leaky ReLU.
New discoveries
- Sigmoid Self-Attention is Better than Softmax Self-Attention: A Mixture-of-Experts Perspective.
- Explain theoretically the advantages of using signmoid over softmax in the Attention mechanism in Transformer.
- Inspired by the Apple’s paper on the practical advantage of using sigmoid for Self-Attention mech.
- Summary: using sigmoid makes training and inference faster and more stable than softmax.