The mixtures of Experts need a lot of data.
Q: Why not just average?
A: MoE will choose a special model focused on special cases(data), which is different from boosting algorithm.
We need to design an error function to encourage specialization.
\(p_i\) here is the probability of the “manager” picking expert i for this case.
- Neural Networks for Machine Learning by University of Toronto: https://www.youtube.com/watch?v=d_GVvIBlWtI