The Bayes classifier is a machine learning technique that can be used to classify objects such as text documents into two or more classes. The classifier is trained by analysing a set of training data, for which the correct classes are given.
The naive Bayes classifier can be used to determine the probabilities of the classes given a number of different observations. The assumption in the model is that the feature variables are conditionally independent given the class. (We will not discuss the meaning of conditional independence in this course. For our purposes, it is enough to be able to exploit conditional independence in building the classifier.)
We will use a spam email filter as a running example for illustrating the idea of the naive Bayes classifier. Thus, the class variable indicates whether a message is spam (or “junk email”) or whether it is a legitimate message (also called “ham”). The words in the message correspond to the feature variables, so that the number of feature variables in the model is determined by the length of the message.
Using spam filters as an example, the idea is to think of the words as being produced by choosing one word after the other so that the choice of the word depends only on whether the message is spam or ham. This is a crude simplification of the process because it means that there is no dependency between adjacent words, and the order of the words has no significance. This is in fact why the method is called naive.
The above idea is usually depicted using the following illustration where the class of the message (spam or ham) is the only factor that has an effect on the words.
Despite it’s naivete, the naive Bayes method tends to work very well in practice. This is a good example of what the common saying in statistics, “all models are wrong, but some are useful” means. (The aphorism is generally attributed to statistician George E.P. Box.)
To get started, we need to specify the prior odds for spam (against ham). For simplicity assume this to be 1:1 which means that on the average half of the incoming messages are spam. (In reality, the amount of spam is probably much higher.)
To get our likelihood ratios, we need two different probabilities for any word occurring: one in spam messages and another one in ham messages.
The word distributions for the two classes are best estimated from actual training data that contains some spam messages as well as legitimate messages. The simplest way is to count how many times each word, abacus, acacia, ..., zurg, appears in the data and divide the number by the total word count.
To illustrate the idea, let’s assume that we have at our disposal some spam and some ham. You can easily obtain such data by saving a batch of your emails in two files.
Assume that we have calculated the number of occurrences of the following words (along with all other words) in the two classes of messages:
We can now estimate that the probability that a word in a spam message is million, for example, is about 156 out of 95791, which is roughly the same as 1 in 614. Likewise, we get the estimate that 98 out of 306438 words, which is about the same as 1 in 3127, in a ham message are million. Both of these probability estimates are small, less than 1 in 500, but more importantly, the former is higher than the latter: 1 in 614 is higher than 1 in 3127. This means that the likelihood ratio, which is the first ratio divided by the second ratio, is more than one. To be more precise, the ratio is (1/614) / (1/3127) = 3127/614 = 5.1 (rounded to one decimal digit).
Recall that if you have any trouble at all with following the math in this section, you should refresh the arithmetics with fractions using the pointers we gave earlier (see the part about Odds in section Odds and Probability).
One problem with estimating the probabilities directly from the counts is that the zero counts lead to zero estimates. This can be quite harmful for the performance of the classifier - it easily leads to situations where the posterior odds is 0/0, which is nonsense. The simplest solution is to use a small lower bound for all probability estimates. The value 1/100000, for instance, does the job.
Using the above logic, we can determine the likelihood ratio for all possible words without having to use zero, giving us the following likelihood ratios:
We are now ready to apply the method to classify new messages.
Once we have the prior odds and the likelihood ratios calculated, we are ready to apply the Bayes rule, which we already practiced in the medical diagnosis case as our example. The reasoning goes just like it did before: we update the odds of spam by multiplying it by the likelihood ratio. To remind ourselves of the procedure, let's try a message with a single message to begin with. For the prior odds, as agreed above, you should use odds 1:1.
To handle the rest of the words in a message, we can use exactly the same procedure. The posterior odds after one word, which you calculated in the previous exercise, will become the prior odds for the next word, and so on.
Hooray! You have now mastered a powerful technique used every day in a wide range of real-world AI applications, the naive Bayes classifier. Even if you had to skip some of the technicalities, you should try to make sure you understood the basic principles of applying probabilities to update beliefs. As we discussed in the beginning of this Chapter, the main advantage of probabilistic reasoning is the ability to handle uncertain and conflicting evidence. Using examples in medical diagnosis and spam filtering, we demonstrated how this work is practice.
Please join the Elements of AI community at Spectrum to discuss and ask questions about this chapter.