$\color{black}\rule{365px}{3px}$
Term Frequency - Inverse Document Frequency
TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus).
<aside> <img src="/icons/light-bulb_red.svg" alt="/icons/light-bulb_red.svg" width="40px" /> The main idea behind TF-IDF is to quantify the importance of a word based on how frequently it appears in a document (term frequency) and how unique or rare it is across all documents (inverse document frequency).
</aside>
Components of TF-IDF
<aside> <img src="/icons/triangle-one-third_red.svg" alt="/icons/triangle-one-third_red.svg" width="40px" /> Term Frequency (TF):
$$ ⁍ $$
</aside>
<aside> <img src="/icons/triangle-two-thirds_red.svg" alt="/icons/triangle-two-thirds_red.svg" width="40px" /> Inverse Document Frequency (IDF):
$$ IDF(t, D) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right) $$
Here, $D$ is the total number of documents in the corpus.
</aside>
<aside> <img src="/icons/triangle-alternate_red.svg" alt="/icons/triangle-alternate_red.svg" width="40px" /> TF-IDF Calculation:
$$ \text{TF-IDF}(t, d, D) = TF(t, d) \times IDF(t, D) $$
</aside>
Example Calculation
Let’s walk through an example using a small corpus.
Example Corpus:
• Document 1: “the cat sat on the mat”
• Document 2: “the dog sat on the log”
• Document 3: “the cat chased the dog”
Step-by-Step Calculation: