Multi-class neural networks involve classification problems that involve more than 2 outcomes. So instead of (for example) “spam vs not-spam”, a multi-class problem may involve three, four, five, or thousands of possible categories.
The Basics of Softmax
In situations with multiple outcomes, often only one can be the case. Eg: if we’re trying to identify a type of car, it cannot be a Ford AND a Toyota, it can only be one. This is called a “single-label” multi-class classification problem. In these situations, we want to ensure that our probabilities add up to one. (That is: if there’s an 80% chance it’s a Ford, the rest of the possibilities should add up to 20%). This idea – that each possible label has a percentage probability, and those probabilities should add up to 1, is called softmax.
Softmax is implemented as a layer just before the output. It must have the same number of nodes as the output layer.
Full Softmax vs Candidate Sampling
If we have a small number of possible labels, we can calculate the probability of every class. This is called Full Softmax.
But if we have many labels, we may want to consider candidate sampling. Candidate sampling involves calculating probabilities for all positive labels, but only a random sample for negative labels. Eg: if we’re interested in determining a type of car, we don’t have to worry about examples that aren’t of a car at all. We would use candidate sampling where, in cases with many labels, it is too expensive to calculate all permutations.
Note: softmax assumes that every example has only one label. If an example can have multiple labels, you’ll need to rely on multiple logistic regressions instead.