$\color{black}\rule{365px}{3px}$

“Attention Is All You Need” - 2017

Link to Paper:

arxiv.org


Introduction

$\color{black}\rule{365px}{3px}$

Regularization

$\color{black}\rule{365px}{3px}$

The Transformer model employs three regularization techniques:

  1. Residual Dropout: Dropout is applied to the output of each sub-layer before it is added to the input and normalized, with $P_{drop}=0.1$ for the base model.
  2. Embedding Dropout: Dropout is applied specifically to the sums of the embeddings and positional encodings to regularize the input to the encoder and decoder.
  3. Label Smoothing: A smoothing value of $ϵ_{ls}=0.1$ is employed during training. This reduces model confidence (hurting perplexity) but improves accuracy and BLEU scores.

<aside> <img src="/icons/report_green.svg" alt="/icons/report_green.svg" width="40px" />

Recall Label Smoothing


How Label Smoothing Modifies the Targets


$$ \text{Modified Target}=(1−ϵ_{ls})⋅\text{one-hot vector}+\frac{ϵ_{ls}}{K} $$

where $K$ is the total number of classes.