Please read here to understand what is PMI and pPMI. Problems As the vocabulary size (V) is large, these vectors will be large in size. They will be sparse as a word may not have co-occurred with all possible words. Resolution Dimensionality Reduction using approaches like Singular Value Decomposition (SVD) of the term document matrix…
Tag: NLP
What is speaker segmentation in speech recognition ? How do you use it ?
Speaker diarization or speaker segmentation is the process of automatically assigning a speaker identity to each segment of the audio file. Segmenting by speaker is very useful in several applications to understand who said what in a conversation. Typically speaker information is crucial for applications such as emotion detection, behavioural analysis or topic analysis of…
What is a language model ? Why do you need a language model ?
A language model is a probability distribution over sequences of words given by It enables us to measure the relative likelihood of different phrases. Measuring the likelihood of a sequence of words is useful in many NLP tasks such as speech recognition, machine translation, POS tagging, parsing, and so on. Example : In…
What are some common tools available for NER ? Named Entity Recognition ?
Notable Named Entity Recognition platforms include: GATE supports NER across many languages and domains out of the box, usable via a graphical interface and a Java API. OpenNLP includes rule-based and statistical named-entity recognition. SpaCy features fast statistical NER as well as an open-source named-entity visualizer.
What is the difference between word2Vec and Glove ?
Word2Vec is a Feed forward neural network based model to find word embeddings. The Skip-gram model, modelled as predicting the context given a specific word, takes the input as each word in the corpus, sends them to a hidden layer (embedding layer) and from there it predicts the context words. Once trained, the embedding for a particular…
What is the difference between paraphrasing and textual entailment ?
Textual entailment is the process of determining if a source T implies the hypothesis text H. Example :It is a unidirectional relationship : text: If you help the needy, God will reward you. hypothesis: Giving money to a poor man has good consequences. Some techniques for textual entailment include lexical similarity based techniques to identify…
What are the state of the art techniques for Machine Translation ?
Machine translation can be done by either of the following techniques : Rule based machine translation (Older techniques) : Uses dictionary between words of the two languages along with syntactic, semantic morphological analysis of the source sentence to define context. Linguistic Rules are defined to translate a specific word in a given context into target…
How do you design a system that reads a natural language question and retrieves the closest FAQ answer?
There are multiple approaches for FAQ based question answering Keyword based search (Information retrieval approach): Tag each question with keywords. Extract keywords from query and retrieve all relevant questions answers. Easy to scale with appropriate indexes reverse indexing. Lexical matching approach : word level overlap between query and question. These approaches might be harder to…
How do you deal with dataset imbalance in a problem like spam filtering ?
Class imbalance is a very common problem when applying ML algorithms. Spam filtering is one such application where class imbalance is apparent. There are many more non-spam emails in a typical inbox than spam emails. The following approaches can be used to address the class imbalance problem. Designing an Asymmetric cost function where the cost…
You have come up with a Spam classifier. How do you measure accuracy ?
Spam filtering is a classification problem. In a classification problem, the following are the common metrics used to measure efficacy : True positives : Those data points where the outcome is spam and the document is actually spam. True Negatives: Those data points where the outcome is not spam and the document is actually not…