Can text generation be modelled with regression ?Why do we need a language model?

To restate the question:  Given a sentence “I am about to complete this ”, can regression be used to predict the next word in this sentence? 

No It cannot be modeled with a regression task. There are multiple reasons :

  1. Any form of temporal data(text) will have a dependency or correlation between consecutive and even non-consecutive words. There is a clear correlation between the words in a sequence if one assumes each single word contributing to be the feature vector of one data sample. It’s hard to capture such word dependency in feature vectors in Regression.  
  2. To predict the next word after the sequence “I am about to complete this” using regression will require many occurrences of this exact sequence. Where as a language model would give good results even when sub-sequences like “… complete this sentence ” are present in the training set. So the data required to train regression model will be huge. 
  3. Ordinal regression, a variant of regression, is used when labels have an ordering(like ranking). Even though the text data is correlated but there is no fixed ordering. Hence even ordinal regression cannot be used here.

Therefore regression does not make sense for tasks where language model is more suitable.

Follow up question: what is a language model and how to generate text using HMM

