$\color{black}\rule{365px}{3px}$
$\color{black}\rule{365px}{3px}$
When we have two classes, with the label y being either $0$ or $1$, and the predicted probability $p$ for class $1$, the binary cross-entropy loss can be written as:
$$ Loss=−(y⋅\log(p)+(1−y)⋅\log(1−p)) $$
For the entire sample:
$$ Loss=−\sum_{i=1}^N(y_i⋅\log(p_i)+(1−y_i)⋅\log(1−p_i)) $$
where,
<aside> <img src="/icons/bookmark_blue.svg" alt="/icons/bookmark_blue.svg" width="40px" />
Why Separate Formula from Multi-Class?
Because in binary classification, we only have one probability prediction assigned to positive class. (i.e. Not a $P(pos) = 0.6$ and $P(neg) = 0.4$, but $P(pos) = 0.6$)
Thus, in order to calculate the likelihood for negative classes, which can be calculated as $1-p$, we add up all the likelihoods no matter its positive or negative class.
Why $1 -p$ ?
Because we need to calculate the likelihood from the standpoint of negative class, we need to subtract from $1$ as the probability $p$ is the probability of being “positive” class, and we want the probability predicted as “negative” class.
(e.g .probability of $0.6$ being positive is equal to probabilty of $0.4$ being negative)
</aside>
$\color{black}\rule{365px}{3px}$
For multi-class classification with more than two classes, the formula generalizes to: