How Bayesian Analysis works

The Bayesian filter is an anti-spam technology used within GFI MailEssentials. It is an adaptive technique based on artificial intelligence algorithms, hardened to withstand the widest range of spamming techniques available today.

NOTE

1. The Bayesian anti-spam filter is disabled by default. It is highly recommended that you train the Bayesian filter before enabling it.

2. GFI MailEssentials must operate for at least one week for the Bayesian filter to achieve its optimal performance. This is required because the Bayesian filter acquires its highest detection rate when it adapts to your email patterns.

How does the Bayesian spam filter work?

Bayesian filtering is based on the principle that most events are dependent and that the probability of an event occurring in the future can be inferred from the previous occurrences of that event.

NOTE

Refer to the links below for more information on the mathematical basis of Bayesian filtering:

http://go.gfi.com/?pageid=ME_BayesianParameterEstimation

This same technique has been adapted by GFI MailEssentials to identify and classify spam. If a snippet of text frequently occurs in spam emails but not in legitimate emails, it would be reasonable to assume that this email is probably spam.

Creating a tailor-made Bayesian word database

Before Bayesian filtering is used, a database with words and tokens (for example $ sign, IP addresses and domains, etc,) must be created. This can be collected from a sample of spam email and valid email (referred to as ‘ham’).

A probability value is then assigned to each word or token; this is based on calculations that account for how often such word occurs in spam as opposed to ham. This is done by analyzing the users' outbound email and known spam: All the words and tokens in both pools of email are analyzed to generate the probability that a particular word points to the email being spam.

This probability is calculated as per following example:

If the word ‘mortgage’ occurs in 400 out of 3,000 spam emails and in 5 out of 300 legitimate emails then its spam probability would be 0.8889 (i.e. [400/3000] / [5/300 + 400/3000]).

Creating a custom ham email database

The analysis of ham email is performed on the company's email and therefore is tailored to that particular company.

  • Example: A financial institution might use the word ‘mortgage’ many times and would get many false positives if using a general anti-spam rule set. On the other hand, the Bayesian filter, if tailored to your company through an initial training period, takes note of the company's valid outbound email (and recognizes ‘mortgage’ as being frequently used in legitimate messages), it will have a much better spam detection rate and a far lower false positive rate.

Creating the Bayesian spam database

Besides ham email, the Bayesian filter also relies on a spam data file. This spam data file must include a large sample of known spam. In addition it must also constantly be updated with the latest spam by the anti-spam software. This will ensure that the Bayesian filter is aware of the latest spam trends, resulting in a high spam detection rate.

How is Bayesian filtering done?

Once the ham and spam databases have been created, the word probabilities can be calculated and the filter is ready for use.

On arrival, the new email is broken down into words and the most relevant words (those that are most significant in identifying whether the email is spam or not) are identified. Using these words, the Bayesian filter calculates the probability of the new message being spam. If the probability is greater than a threshold, the message is classified as spam.

NOTE

For more information on Bayesian Filtering and its advantages refer to:

http://go.gfi.com/?pageid=ME_Bayesian

Training the Bayesian Analysis filter

NOTE

The Bayesian Analysis filter can also be trained using Public folders. For more information refer to Configuring the Bayesian filter.

It is recommended that the Bayesian Analysis filter is trained through the organization’s mail flow over a period of time. It is also possible for Bayesian Analysis to be trained from emails sent or received before GFI MailEssentials is installed by using the Bayesian Analysis wizard. This allows Bayesian Analysis to be enabled immediately.

This wizard analyzes sources of:

  • legitimate mail - for example a mailbox’ sent items folder
  • spam mail - for example a mailbox folder dedicated to spam emails.

Step 1: Install the Bayesian Analysis wizard

The Bayesian Analysis wizard can be installed on:

  • A machine that communicates with Microsoft® Exchange - to analyze emails in a mailbox
  • A machine with Microsoft Outlook installed - to analyze emails in Microsoft Outlook

To install the Bayesian Analysis wizard:

  1. Copy the setup file Bayesian Analysis Wizard.exe to the chosen machine. This is located in: GFI MailEssentials installation path\AntiSpam\BSW\
  2. Launch Bayesian Analysis Wizard.exe.
  3. In the initial screen, choose the language and review the End-User License Agreement. Click Next.
  4. Select the installation folder and click Next.
  5. Click Install to start installation.
  6. Click Finish when installation is complete.

Step 2: Analyze legitimate and spam emails

To start analyzing emails using the Bayesian Analysis wizard:

1. Load the Bayesian Analysis wizard from Start > Programs > GFI MailEssentials > GFI MailEssentials Bayesian Analysis Wizard.

2. Click Next in the welcome screen.

3. Choose whether to:

  • Create a new Bayesian Spam Profile (.bsp) file or update an existing one. Specify the path where to store the file and the filename.
  • Update the Bayesian Spam profile used by the Bayesian Analysis filter directly when installing on the same machine as GFI MailEssentials.

Click Next to proceed.

4. Select how the wizard will access legitimate emails. Select:

  • Use Microsoft Outlook profile configured on this machine - Retrieves emails from a Microsoft Outlook mail folder. Microsoft Outlook must be running to use this option.
  • Connect to a Microsoft® Exchange Server mailbox store - Retrieves emails from a Microsoft® Exchange mailbox. Specify the logon credentials in the next screen.
  • Do not update legitimate mail (ham) in the Bayesian Spam profile - skip retrieval of legitimate emails. Skip to step 6.

Click Next to continue.

5. After the wizard connects to the source, select the folder containing the list of legitimate emails (e.g. the Sent items folder) and click Next.

6. Select how the wizard will access the source of spam emails. Select:

  • Download latest Spam profile from GFI website - Downloads a spam profile file that is regularly updated by collecting mail from leading spam archive sites. An Internet connection is required.
  • Use Microsoft Outlook profile configured on this machine - Retrieves spam from a Microsoft Outlook mail folder. Microsoft Outlook must be running to use this option.
  • Connect to a Microsoft® Exchange Server mailbox store - Retrieves spam from a Microsoft® Exchange mailbox. Specify the logon credentials in the next screen.
  • Do not update Spam in the Bayesian Spam profile - skip retrieval of spam emails. Skip to step 8.

Click Next to continue.

7. After the wizard connects to the source, select the folder containing the list of spam emails and click Next.

8. Click Next to start retrieving the sources specified. This process may take several minutes to complete.

9. Click Finish to close the wizard.

Step 3: Import the Bayesian Spam profile

When the wizard is not run on the GFI MailEssentials server, import the Bayesian Spam Profile (.bsp) file to GFI MailEssentials.

1. Move the file to the Data folder in the GFI MailEssentials installation path.

2. Restart the GFI MailEssentials AS Scan Engine and the GFI MailEssentials Legacy Attendant services.