Top Data Science Interview Questions

While there are many websites that have machine learning interview questions, this is one and only place that covers the depth and breadth of data science interview preparation. We provide a platform with self preparation and self evaluation quizzes to prepare for Data Science and Machine Learning interviews. These interview questions cover all the popular areas of Machine Learning including Deep Learning and NLP
Interview questions on have been curated by expert interviewers, who interviewed over a hundred candidates at top companies with large data science teams. Our team understand what it takes to crack a data science interview. Now you just need to focus on preparation and forget googling every other machine learning question. Here is a FREE DEMO for Self Preparation Tool!
Here are some top data science interview questions to get started!
What is bias variance trade-off in Machine Learning?

Any supervised learning model is the result of optimizing the errors due to model complexity and the prediction error on examples during training, also called the training error. Optimizing training error more(relative to model complexity) results into increased model complexity. This leads to overfitting and hence more prediction error on unseen examples(bad generalization). This is due to high variance in the model and called variance error.

Optimizing model complexity more(relative to training error) results into less complex model but more training error. This in turn leads to underfitting and hence bad generalization again. This is due to high bias in the model and called bias error.

Minimizing variance error leads to higher bias in the model and minimizing bias error leads to higher variance error. This is called bias-variance trade-off, where minimizing either variance error more or bias error more results into bad generalisation. There needs to be a right balance between the two, i.e. optimal model complexity and optimal training error. To know more about how bias and variance gets introduced, read this detailed intuition behind bias and variance error

How do you deal with missing data  in Machine Learning?

There is no fixed rule to deal with missing data but use the heuristics mentioned below.

  1. The most common way of dealing with missing data is to remove all rows with missing data if there are not too many rows with missing data.
  2. If more than 50-60% of rows of a specific column are missing data, it is common to remove the column. The main problem with removing missing data thus, is that it could introduce substantial bias.
  3. Imputation of data is also a common technique used to deal with missing data where the data is substituted with the best guess.
    1. Imputation with mean : Missing data is replaced by mean of the column
    2. Imputation with median : Missing data is replaced by median of the column
    3. Imputation with Mode: Missing data is replaced with mode of the column
    4. Imputation with linear regression : With real valued data, this is a common technique. The missing value is replaced by performing linear regression based on the other feature values.
Why is logistic regression a linear classifier ?

There are two parts to this question –

  1. Why is it a classifier when it does regression ?
    • Logistic regression is used to estimate the probability that an instance belongs to a particular class. If the estimated probability for an instance is greater than 0.5, then the model predicts that the instance belongs to that particular class, usually called positive class labelled as 1, or else it predicts that it belongs to the negative class labelled as 0). This makes it a binary classifier.
  2. Why is it a linear classifier when it uses a nonlinear sigmoid function ?
    • Direct output of logistic regression is not a probability number(between 0 and 1). Instead it can output any real value like a regression function. Sigmoid function is used to convert the real value to a real number between 0 and 1 and hence a valid probability number. Now sigmoid function is nonlinear function but logistic regression was trained with a linear cost function. In other words, logistic regression is linear in parameters like any other linear classifier. But the nonlinear sigmoid function is applied on the output of logistic regression function. And this is done in order to convert this output from any real number to a real number between 0 and 1. This makes it easier for us to define the threshold(decision boundary) and hence, distinguish between two classes.
How do you deal with class imbalance in Machine Learning?
  1. Designing an Asymmetric cost function where the cost of misclassifying a minority class is higher than the cost of misclassifying the majority class. Typically a scaling factor is assigned to the loss function terms belonging to the minority class, that can be adjusted during hyper parameter tuning. Note that the evaluation metric needs to be aligned as well to do hyper parameter tuning – for instance F1 score or AUC is a good measure over plain accuracy.
  2. Under-sampling the majority class :
    1. Remove randomly sampled data points.
    2. Cluster data points and remove points from large clusters with random sampling.
  3. Over-sampling the minority class :
    1. SMOTE(Synthetic Minority Over-sampling Technique) is a popular tool for oversampling the minority class. A random vector v is selected that lies between the given sample s and any of the k nearest neighbours of s. The vector v is multiplied with a random number between 0 and 1. And this vector is added to s to give the new synthetic data point. In a way, it is making the k nearest neighbour of samples(from minority) more dense by oversampling.
    2. Randomly resampling data points. But remember resampling does not lead to enough independent data points to learn complex functions, but has the effect of assigning higher weight to some minority class data points.
How do you evaluate a Machine Learning algorithm ?

Machine Learning algorithm could be either for supervised tasks like regression or classification, or for unsupervised learning tasks like clustering or more complex tasks like sequence to sequence modeling. Evaluation metric very much depends on what is the algorithm even if it is supervised learning.

For supervised learning tasks, some portion of the entire dataset is left out called test set and not used for training. Then the evaluation metric is calculated using this left out test set. For regression, most commonly used metric is the root mean square error(RMSE) and mean absolute error(MAE).

For classification there are a number of ways one can evaluate model performance. The metrics used(on test set) are Accuracy, Precision, Recall, F1 Score, Confusion Matrix, Relative Operating characteristic(ROC) curve, Area Under the Curve(AUC) of ROC curve. With so many metrics for supervised learning models, one needs to have a best strategy to select the evaluation metric

For unsupervised learning too, more often the metric depends on the algorithm being used. If it is clustering, one can also use the classification metric but only if some true labels are known. For K-means clustering, evaluation metric used are inertia and silhouette score. Inertia prefer clusters that minimizes distance instances and their cluster centres. On the other hand, silhouette score gives preference to clusters that maximizes the intra cluster distance.  

Please read this article to understand about each of these metrics in detail.

Why do you need training set, test set and validation set ?

Before any model is built for the problem in hand, the dataset exists as a single entity. One can start learning from entire dataset and use the built models to make predictions on new unseen data. The later part is called generalization in Machine Learning terminology. However, training on entire dataset available would lead to an overfitted model, highly adapted to the dataset used while training. Standard solution to this problem is to divide the dataset into 2 non overlapping sets called training set and test set.

Test set acts as unseen data such that model is trained on training data and tested on the test set. If performance on test set is bad but good on training set, it is an overfitting model and we try to resolve it using different resolutions for overfitting as mentioned here. Hence the purpose of testing and test set is for generalization on unseen samples.

Validation Set : Most of the Machine Learning models have hyper parameters to be tuned. For eg, right value of K in K-Means or K-NN are to be discovered by tuning. Now while training and testing, we figure out that we need to change our model or its parameters. We can do this with test set also by checking the error on test set for each combination of model and its parameters. But doing this will make our model more adaptable to both training and test set and may not generalize well on new unseen data. Hence we need another set called validation set which is useful for finding the right model and right parameters for the model. Therefore, purpose of validation and validation set is for tuning hyper-parameters used in the Machine Learning model. Another name for validation set is development set or dev set.

What is overfitting and underfitting ?  Why do they occur ? 

Overfitting is the result of over training the model while underfitting is the result of keeping the model too simple, both leading to high generalization error. Overtraining leads to a more complex model while underfitting results into a simpler model.

Overfitting could be due to

  1. The noise in the data which gets prioritized while training.
  2. Too less data compared to the amount required for a generalizable model.

Underfitting as it appears to be the opposite of overfitting occurs due to

  1. Too simple model or less number of parameters.
  2. Overly regularization which is done to control overfitting
  3. Less number of features or bad features used in training

For further reading, visit here.

What is the difference between word2Vec and Glove ?

Word2Vec is a Feed forward neural network based model to find word embeddings. The Skip-gram modelmodelled as predicting the context given a specific wordtakes the input as each word in the corpus, sends them to a hidden layer (embedding layer) and from there it predicts the context words. Once trained, the embedding for a particular word is obtained by feeding the word as input and taking the hidden layer value as the final embedding vector.

GloVe, global vectors, is based on matrix factorization techniques on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently we see this word in some “context” (the columns) in a large corpus.  The number of “contexts” is of course large, since it is essentially combinatorial in size. So then we factorize this matrix to yield a lower-dimensional (word x features) matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data. Here are more details.

What is the difference between parametric and nonparametric models ?

One of the obvious answers to this question is parametric models have parameters while nonparametric models do not. But there is a precise explanation to this statement. Parametric models have predetermined number of parameters before the training starts. This in turn limits the degree of freedom for such models. Limited degree of freedom reduces the risk of overfitting. Example of parametric models are Logistic Regression, Naive Bayes etc.

On the other hand, it is not true that nonparametric models do not have parameters. But instead, they have a lot of parameters. The catch is that in nonparametric models, number of parameters are not determined prior to training. This often can makes the model more adaptable to the training data which leads to overfitting. The solution to this is to restrict the degree of freedom for nonparametric models by regularization techniques. Example of nonparametric model is SVMs, K-NN etc.

How do you detect outliers in data ? How do you deal with them ?
There are two types of outliers – univariate and multivariate. Univariate outliers exist when one of the feature value is deviating from other data points on the same feature value. With low dimension data, one can find univariate outliers by plotting the data and detecting the outliers which lie far apart from most of the data. One such visualization is box plot where outliers will be visible in dots or points and majority of the data will be inside the box. Multivariate outliers can be found out by looking at n-dimensional feature set which is difficult for humans. Though bivariate outliers can be detected using scatter plots. Automated methods to detect outliers include Z-score, Probabilistic Modeling, Clustering, Linear Regression models etc. The most simplest method is Z-score which indicates how many standard deviations far is the data point from the mean assuming gaussian distribution. Z-score is useful for parametric distributions in low dimensional space.
DBSCAN is a density based clustering method useful for outlier detection. Points which do not get assigned to any cluster or form their own clusters are labelled outliers. Isolation forest is designed for outlier detection which is based on decision tree and more precisely random forests. This follows the mechanism of decision tree by splitting the dataset on random feature at first. At every split, this split is repeated with other random features. Number of splittings made by the algorithm is the path length for a fixed data point. Outliers are expected to have shorter path lengths and stay closer to the root.OneClass SVM, variant of SVM, is an outlier detection method. SVM is sensitive to outliers which is used to its advantage by OneClass SVM.
What is negative sampling when training the skip-gram model ?

Skip-Gram Recap: model tries to represent each word in a large text as a lower dimensional vector in a space of K dimensions making similar words also be close to each other. This is achieved by training a feed-forward network where we try to predict the context words given a specific words.  

Why is it slow:  In this architecture, a soft-max is used to predict each context word. In practice, soft-max function is very slow in computation, specially for large vocabulary size.

Resolution :

  • The objective function is reconstructed to treat the problem as classification problem where pairs of words : a given word and a corresponding context word are positive examples and a given word with non-context words are negative examples.
  • While there can be a limited number of positive examples, there are many negative examples. Hence a randomly sampled set of negative examples are taken for each word when crafting the objective function.

This algorithm/model is called Skip Gram Negative Sampling(SGNS)

What are the different ways of preventing over-fitting in a deep neural network ? Explain the intuition behind each
  1. L2 norm regularization : Make the weights closer to zero prevent overfitting.
  2. L1 Norm regularization : Make the weights closer to zero and also induce sparsity in weights. Less common form of regularization
  3. Dropout regularization : Ensure some of the hidden units are dropped out at random to ensure the network does not overfit by becoming too reliant on a neuron by letting it overfit
  4. Early stopping : Stop the training before weights are adjusted to overfit to the training data

Mail us at if you have any feedback or find any interview questions you’d like us to answer!