$\color{black}\rule{365px}{3px}$Perplexity is a metric used to evaluate the quality/performance of a language model. It measures how well a model predicts a sequence of words.

Intuition

Perplexity tells us, on average, how "perplexed" or uncertain the model is when predicting each token in the sequence. It can be interpreted as the "average branching factor"(The average number of samples that the model’s wavering between on average) the model has to choose from at each step. For example:

A perplexity of 10 means the model, on average, considers 10 equally likely choices for the next word.🥲
A perplexity of 1 means the model predicts the sequence perfectly (no uncertainty).🙂

Definition

$$ PP(W) = {P(w_1,w_2,...w_N)}^{-\frac{1}{N}} = \prod_{t=1}^{N}{P(w_{t+1}|w_1, ..., w_{t})} $$

$$ \begin{align*}PP(W) = {P(w_1,w_2,...w_N)}^{-\frac{1}{N}} &= \text{P(Getting The Correct Sequence)}^{-\frac{1}{N}} \\ &= \left(\frac{1}{\text{P(Getting The Correct Sequence)}}\right)^{\frac{1}{N}} \end{align*} $$

<aside> 💡

Why take reciprocal and geometric mean?

Connection to Equally Likely Options:

Suppose the model assigns a probability $P=0.2$ to the correct token.
The reciprocal, $\frac{1}{P}=5$, can be interpreted as:

"The model's uncertainty is equivalent to having 5 equally likely options for the next token."
This makes sense because if each option were equally likely, the probability of any one token would be $1/5=0.2$.

Scaling to Sequences:

For a sequence of $N$ tokens, perplexity aggregates this uncertainty by taking the geometric mean of the probabilities for each token.
The formula $PP(W)=P(w_1,w_2,…,w_N)^{-\frac{1}{N}}$ effectively calculates the average number of equally likely options per token over the entire sequence. </aside>

Geometric Mean (Power of $\frac{1}{N})$

$N= 2 :$

$$ \text{Geometric Mean} = (2\times18)^{\frac{1}{2}} = 6 $$

$N=3:$

$$ \text{Geometric Mean} = (10\times51.2 \times 8)^{\frac{1}{3}} = 16 $$

So, Geometric Mean in perplexity is basically to find the average(normalized) probability at each time-stamp of the sequence the model is predicting.

</aside>

Relation to Cross-Entropy Loss