What is the formula for tf.idf ? Why do we use ‘log’ in idf formula ?

TF-IDF(tf.idf) stands for term frequency inverse document frequency. Its formula is given by:

    • tf.idf = tf * idf
    • tf takes care of the number of time the term occurs. Taking just the frequency count will outweigh documents large in size. Hence we normalise this by dividing by the length of document which is total number of terms in the document.

          \[tf\,of\,T = \frac{number \,of\,occurrences\,of\,T}{\sum_{t\epsilon D} t}\]

    • idf is to find how important is the term. As stop-words like “the”, “is” etc are not important for any differentiation or giving context.

          \[idf\,of\,T = log(\frac{D}{Number\,of\,documents\,in\,D\,with\,term\,T})\]

    • log is to dampen the effect of large idf numbers. If T is a rare word like a spelling mistake, the term inside log(…) will be very high and hence taking log dampens that effect.