## How do you handle missing data in an ML algorithm ?

There is no fixed rule to deal with missing data but use the heuristics mentioned above. The most common way of dealing with missing data is to remove all rows with missing data if there are not too many rows with missing data. If more than 50-60% of rows of a specific column are missing…

## What are evaluation metrics for multi-class classification problem ?

Multi-class classification evaluation can be done in either of the following ways : Average precision of each class, average recall of each class treating classifier for each class as one vs all classifier. Average of accuracy of each class treating the classifier for each class as a one-vs-all classifier Sum of all true positive entries…

## What is page rank algorithm ?

Page Rank is an ML algorithm that determines the relative ranking of page importance in the world wide web by measuring how likely a random web surfer is likely to land on the particular page. Quote from wikipedia: A PageRank results from a mathematical algorithm based on the webgraph, created by all World Wide Web…

## How do you deal with out of vocabulary words during run time when you build a language model ?

Out of vocabulary(OOV) words are words that are not in the training set, but appear in the test set or the real data.  The main problem is that the model assigns a probability zero to OOV words resulting in a zero likelihood. This is a common problem, specially when you have not trained on a…

## What are the advantages and disadvantages using bag of words feature vector?

Bag of Words(BoW) model is a common way of representing text data as input feature vector to an ML model. Each document is encoded as a V dimensional feature vector, where V is the vocabulary size. Each dimension in the feature vector contains the count of number of times the word(corresponding to the dimension) occurs…

## Why don’t we tune hyper-parameters using test set and need a separate set like validation set?

We use test set such that we build a model that generalizes well on unseen dataset. If we use test set in tuning for hyper-parameters to select the model, we’re indirectly using test set in training or rather, our model has seen the test set. Hence, it is no longer an unseen dataset but already…

## Why do you need training set, test set and validation set ?

Before any model is built for the problem in hand, the dataset exists as a single entity. One can start learning from entire dataset and use the built models to make predictions on new unseen data. The later part is called generalization in Machine Learning terminology. However, training on entire dataset available would lead to…

## How do you eliminate underfitting ?

Make the model simpler Collect more data Collect more features Increase the regularization parameter Answer – (c) Underfitting is the opposite of overfitting and it occurs when model is too simple to learn from the given dataset. This could happen if right features were not selected or extracted, or the regularization was done with higher…

## What are common tools for speech recognition ? What are the advantages and disadvantages of each?

There are several ready tools for speech recognition, that one can use to train custom models given the appropriate dataset. CMU Sphinx : Used more in an academic setting, one of the oldest libraries. Kaldi – hard to set up, very flexible to use. Typically used by academics. Deep Speech – Easy to set up,…

## Overfitting is a result of which of the following causes :

Less amount of data Simple Model like a linear classifier Complex Model like a classifier of high degree polynomial All of the above Answer – (1), (3) Overfitting generally happens if the model tries to fit everything because it is too complex or there is too less amount of data. When your model performs well on…