# Mixture of Experts

The mixtures of Experts need a lot of data.

Q: Why not just average?

A: MoE will choose a special model focused on special cases(data), which is different from boosting algorithm.

We need to design an error function to encourage specialization.

$$E=\sum p_i(t-y_i)^2$$

$$p_i$$ here is the probability of the “manager” picking expert i for this case.

## Reference

1. Neural Networks for Machine Learning by University of Toronto: https://www.youtube.com/watch?v=d_GVvIBlWtI

Categories: 未分类

# DeepSpeed runtime study

The runtime includes these parts:

The entry for the runtime is in https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py

Elasticity? not supported with model parallelism

DeepSpeed uses gradient accumulation to extract pipeline parallelism.

Categories: 未分类

Modify data

invariant, definition can be found here.

threads modifying data may break invariant(see the example of changing the doubly linked list)

problematic race condition: typically occur where completing an operation requires modification of two or more distinct pieces of data

data race(will be introduced in the future)

Solutions:

• only the thread performing a modification can see the intermediate states where the invariants are broken. (mutex)
• Change the data structure, makes it an indivisible change(lock-free programming).
• Handle the update as a transaction(like database)

## Protecting shared data with mutexes

access the data structure as mutually exclusive — use mutex

The mutex has its own problems: deadlock, protecting too much or too little data.

Besides, pointers may ruin the data protection. Programmers should follow: Don’t pass pointers and references to protected data outside the scope of the lock, whether by returning them from a function, storing them in externally visible memory, or passing them as arguments to user-supplied functions.

Example: a stack shared by multiple threads, empty() and top()

TODO: some options to avoid race conditions

thread-safe stack: see listing 3.5, but need to watch out:

1. delete some operator/functions
2. add a mutable class variable mutex
3. lock_guard in every operator function

std::lock—a function that can lock two or more mutexes at once without risk of deadlock

hierarchical lock

std::unique_lock, more flexible than lock_guard(eg. try_to_lock), automatically unlock(will judge by itself)

std::unique_lock contains a flag to indicate the ownership of the mutex, which increases the cost of this class.

std::call_once and std::once_flag to make sure initialization is done once

std::shared_timed_mutex(C++14), std::shared_mutex(C++17)

Categories: 未分类

# Some concepts in Machine Learning

Other terms and their explanations: https://developers.google.com/machine-learning/glossary/

Categories: 未分类