Categories: 未分类

# Activation functions

## Why use activation functions?

make the neural network adapt to non-linear(more complex) pattern

## Different activation functions

• sigmoid(logistic function)

$$\frac{1}{1+e^{-x}}$$

output is constrained between 0 to 1. Only strongly sensitive to their input when x is near 0

• tanh(hyperbolic tangent)

$$\frac{e^x-e^{-x}}{e^x+e^{-x}}$$

output is constrained between -1 to 1

typically perform better than sigmoid as it can normalize the input to “the average of each input variable over the training set is close to zero”, which makes the fast backpropagation.

• ReLU

$$max(0, x)$$

sigmoid and tanh are not easily trained in deep neural networks(gradient vanishing). Good for gradient descent as it’s linear in positive values.

## Reference

1. https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/
2. https://stats.stackexchange.com/questions/330559/why-is-tanh-almost-always-better-than-sigmoid-as-an-activation-function
Categories: 未分类

# Mixture of Experts

The mixtures of Experts need a lot of data.

Q: Why not just average?

A: MoE will choose a special model focused on special cases(data), which is different from boosting algorithm.

We need to design an error function to encourage specialization.

$$E=\sum p_i(t-y_i)^2$$

$$p_i$$ here is the probability of the “manager” picking expert i for this case.

## Reference

1. Neural Networks for Machine Learning by University of Toronto: https://www.youtube.com/watch?v=d_GVvIBlWtI

Categories: 未分类

# DeepSpeed runtime study

The runtime includes these parts:

The entry for the runtime is in https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py

Elasticity? not supported with model parallelism

DeepSpeed uses gradient accumulation to extract pipeline parallelism.

Categories: 未分类