Suppose you build word vectors (embeddings) with each word vector having dimensions as the vocabulary size(V) and feature values as pPMI between corresponding words: What are the problems with this approach and how can you resolve them ?

Please read here to understand what is PMI and pPMI. Problems As the vocabulary size (V) is large, these vectors will be large in size. They will be sparse as a word may not have co-occurred with all possible words. Resolution Dimensionality Reduction using approaches like Singular Value Decomposition (SVD) of the term document matrix…

What is speaker segmentation in speech recognition ? How do you use it ?

Speaker diarization or speaker segmentation is the process of automatically assigning a speaker identity to each segment of the audio file. Segmenting by speaker is very useful in several applications  to understand who said what in a conversation. Typically speaker information is crucial for applications such as emotion detection, behavioural analysis or topic analysis of…

What are some common tools available for NER ? Named Entity Recognition ?

Notable NER platforms include: GATE supports NER across many languages and domains out of the box, usable via a graphical interface and a Java API. OpenNLP includes rule-based and statistical named-entity recognition. SpaCy features fast statistical NER as well as an open-source named-entity visualizer.  

What is the difference between word2Vec and Glove ?

Word2Vec is a Feed forward neural network based model to find word embeddings. The Skip-gram model, modelled as predicting the context given a specific word, takes the input as each word in the corpus, sends them to a hidden layer (embedding layer) and from there it predicts the context words. Once trained, the embedding for a particular…

What is the difference between paraphrasing and textual entailment ?

Textual entailment is the process of determining if a source T implies the hypothesis text H. Example :It is a unidirectional relationship : text: If you help the needy, God will reward you. hypothesis: Giving money to a poor man has good consequences. Some techniques for textual entailment include lexical similarity based techniques to identify…

What are the state of the art techniques for Machine Translation ?

Machine translation can be done by either of the following techniques : Rule based machine translation (Older techniques) : Uses dictionary between words of the two languages along with syntactic, semantic morphological analysis of the source sentence to define context. Linguistic Rules are defined to translate a specific word in a given context into target…

How do you design a system that reads a natural language question and retrieves the closest FAQ answer?

There are multiple approaches for FAQ based question answering Keyword based search (Information retrieval approach): Tag each question with keywords. Extract keywords from query and retrieve all relevant questions answers. Easy to scale with appropriate indexes reverse indexing. Lexical matching approach : word level overlap between query and question. These approaches might be harder to…

How do you deal with dataset imbalance in a problem like spam filtering ?

Class imbalance is a very common problem when applying ML algorithms. Spam filtering is one such application where class imbalance is apparent. There are many more non-spam emails in a typical inbox than spam emails. The following approaches can be used to address the class imbalance problem. Designing an Asymmetric cost function where the cost…

You have come up with a Spam classifier. How do you measure accuracy ?

Spam filtering is a classification problem. In a classification problem, the following are the common metrics used to measure efficacy : True positives : Those data points where the outcome is spam and the document is actually spam. True Negatives: Those data points where the outcome is not spam and the document is actually not…