Bayesian self-learning in Kerio Connect

There are many problems associated with detecting spam for the final recipient of an email. It is important to understand these problems in order to understand what Bayesian self-learning is and how it fits into Kerio's solution for spam protection.

Terminology

  • Spam is a message the recipient considers an unsolicited junk email.
  • Ham is a message the recipient considers to be not spam.
  • False Positive is a message that is incorrectly marked as spam.
  • False Negative is a message that is incorrectly marked as ham.

SpamAssassin

SpamAssassin uses static rule sets to determine if a message is spam.

Fixed set of rules cannot accurately define spam for everybody. It may result in SpamAssassin capturing most spam, however, it will always have some false positives and false negatives.

Also, the content in spam changes over time and the spam mutates. Unless the rules in SpamAssassin change, too, more and more spam gets in. Therefore, constant upgrades are necessary to maximize the spam blocking capabilities.

Bayesian filtering

Recipients can train the Bayes database to recognize messages as spam or ham. The filter breaks messages into small pieces called tokens and determines which tokens occur mostly in spam messages, and which tokens occur mostly in ham messages.

The Bayes database must learn a lot of emails before it can function effectively. In general, the Bayes database begins to work after it has learned at least 200 spams and 200 hams. End-users must train the Bayes database enough to effectively fight mutating spam.

Bayesian self-learning

SpamAssassin and additional Kerio Connect antispam features can help the Bayesian self-learning:

  • The higher the SpamAssassin score, the more probable the message is a spam
  • The lower the SpamAssassin score, the more probable the message is a ham.

SpamAssassin trains the Bayes database as follows:

  • If the total SpamAssassin score is more than 12, and both the header score and body score are more than 3, consider the message as a spam.
  • If the total SpamAssassin score is less than 0.1, consider the message as not a spam.

Additional antispam tests in Kerio Connect, such as blacklists, SPFSender Policy Framework is an open source equivalent to Caller ID., header tests, train the Bayes database as follows:

  • If the total score from tests other than SpamAssassin is more than the required tag score, and SpamAssassin score is less than 0.1, consider the message as spam.
  • If the total score including SpamAssassin is more than (block score-tag score/1.8)+tag score, and SpamAssassin score is less than 12, consider the message as spam.
  • If the total score from tests other than SpamAssassin is less than 0, and SpamAssassin trains the Bayes database with spam, consider the message as ham.