Activation functions
Why use activation functions?
make the neural network adapt to non-linear(more complex) pattern
Different activation functions
- sigmoid(logistic function)
$$\frac{1}{1+e^{-x}}$$
output is constrained between 0 to 1. Only strongly sensitive to their input when x is near 0
- tanh(hyperbolic tangent)
$$\frac{e^x-e^{-x}}{e^x+e^{-x}}$$
output is constrained between -1 to 1
typically perform better than sigmoid as it can normalize the input to “the average of each input variable over the training set is close to zero”, which makes the fast backpropagation.
- ReLU
$$max(0, x)$$
sigmoid and tanh are not easily trained in deep neural networks(gradient vanishing). Good for gradient descent as it’s linear in positive values.
Reference
- https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/
- https://stats.stackexchange.com/questions/330559/why-is-tanh-almost-always-better-than-sigmoid-as-an-activation-function
Mixture of Experts
The mixtures of Experts need a lot of data.
Q: Why not just average?
A: MoE will choose a special model focused on special cases(data), which is different from boosting algorithm.
We need to design an error function to encourage specialization.
$$E=\sum p_i(t-y_i)^2$$
\(p_i\) here is the probability of the “manager” picking expert i for this case.
Reference
- Neural Networks for Machine Learning by University of Toronto: https://www.youtube.com/watch?v=d_GVvIBlWtI
DeepSpeed runtime study
The runtime includes these parts:
The entry for the runtime is in https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py
Elasticity? not supported with model parallelism
DeepSpeed uses gradient accumulation to extract pipeline parallelism.
DeepSpeed source code study
DeepSpeed Architecture(from https://www.youtube.com/watch?v=zqsOEzKZX2Y&t):

There are major two parts: runtime and ops.
DeepSpeed runtime study: http://aheader.org/2021/10/02/deepspeed-runtime-study/
DeepSpeed ops study: