Bayesian self-learning in Kerio Connect
There are many problems associated with detecting spam for the final recipient of an email. It is important to understand these problems in order to understand what Bayesian self-learning is and how it fits into Kerio's solution for spam protection.
Terminology
- Spam is a message the recipient considers an unsolicited junk email.
- Ham is a message the recipient considers to be not spam.
- False Positive is a message that is incorrectly marked as spam.
- False Negative is a message that is incorrectly marked as ham.
SpamAssassin
SpamAssassin uses static rule sets to determine if a message is spam.
Fixed set of rules cannot accurately define spam for everybody. It may result in SpamAssassin capturing most spam, however, it will always have some false positives and false negatives.
Also, the content in spam changes over time and the spam mutates. Unless the rules in SpamAssassin change, too, more and more spam gets in. Therefore, constant upgrades are necessary to maximize the spam blocking capabilities.
Bayesian filtering
Recipients can train the Bayes database to recognize messages as spam or ham. The filter breaks messages into small pieces called tokens and determines which tokens occur mostly in spam messages, and which tokens occur mostly in ham messages.
The Bayes database must learn a lot of emails before it can function effectively. In general, the Bayes database begins to work after it has learned at least 200 spams and 200 hams. End-users must train the Bayes database enough to effectively fight mutating spam.
Bayesian self-learning
SpamAssassin and additional Kerio Connect antispam features can help the Bayesian self-learning:
- The higher the SpamAssassin score, the more probable the message is a spam
- The lower the SpamAssassin score, the more probable the message is a ham.
SpamAssassin trains the Bayes database as follows:
- If the total SpamAssassin score is
more than 12
, and both the header score and body score aremore than 3
, consider the message as a spam. - If the total SpamAssassin score is
less than 0.1
, consider the message as not a spam.
Additional antispam tests in Kerio Connect, such as blacklists, SPFSender Policy Framework is an open source equivalent to Caller ID., header tests, train the Bayes database as follows:
- If the total score from tests other than SpamAssassin is
more than the required tag score
, and SpamAssassin score isless than 0.1
, consider the message as spam. - If the total score including SpamAssassin is
more than (block score-tag score/1.8)+tag score
, and SpamAssassin score isless than 12
, consider the message as spam. - If the total score from tests other than SpamAssassin is
less than 0
, and SpamAssassin trains the Bayes database withspam
, consider the message as ham.