BLEU (Bilingual evaluation understudy) score is the most common metric used during machine translation. Typically, it is used to measure a candidate translation against a set of reference translations available as ground truth.
BLEU score is based on precision – how many of the words in the candidate sentence are in the reference sentence. But this score is easy to get wrong. For instance consider the candidate sentence : “the the the the the” and the reference sentence “the cat ate the rat” . I would get a precision of 1 since every word is in the reference sentence.. Hence, with blue score, the number of occurrences of each unique word in the candidate is capped to the maximum number of times it occurs in any reference. BLEU score is computed as
Score = (sum of capped occurrances of all unique words in candidate / length of candidate).
For the example above, this is ⅖. Bleu score is known to correlate well with human judgement. Please visit here for further reading on machine translation evaluation.