Most frequent words are usually stop words like “in”, ,”so”, ”are”, ”this”, ”the”, ”that” ,”a” ,”is” etc.
Rare words could be because of spelling mistakes or due to the word being sparsely used in the data set.
Usually most frequent and most rare words are not useful in providing contextual information. Stop words occur in almost every sentence and hence they do not help in uniquely identifying the content in sentences. Words that occur rare could be very useful, but are often so sparse that it is hard to draw insights from them .
Most frequent and also most rare words can be handled using tf-idf instead of raw frequency count. tf-df is used to construct the feature vector in text processing. The tf–idf value increases proportionally to the number of times a word appears in the document. But it is offset by the number of documents in the corpus that contain the word.