Click to go Home.
Introduction
Report 1
Report 2
 
 

Modular Declude Log Processor (under development)

Spam Test Statistics
Report Legend
Utility Overview & Features
Theory Of Operations And Architecture
Auto Tune & Artificial Intelligence

QUICK! - LOOKING FOR SPAM TEST STATISTICS? - See Report 1 and Report 2

The data presented in these reports is updated every 12 hours or so based on live traffic flowing through our systems. This data is useful to anyone who is evaluating spam tests and blocking lists for use in their anti-spam efforts. The list of tests we evaluate is large, but not exhaustive by any means. There are many good (and bad) tests not represented here. The tests we have selected suit our needs or may be there for curiosity's sake. Any tests that are represented (or left out) were selected for no other purpose.

Once we release the MDLP utility we hope that other systems will publish their results so that a broader analysis will be available to everyone from a wide range of systems - each with a different combination of tests and a different mix of messages.

Report 1 (12 hours) and Report 2 (apx one month) were created from our IMail/Declude test bed which receives primarily spam, and a few domains of normal email traffic. The results shown are produced by the -html option on our MDLP utility which is currently in late beta.

The analysis is very useful for evaluating the relative accuracy and capture rates of many popular spam blocking lists and spam tests - including, of course, Message Sniffer (see SNIFFER). When evaluating a particular spam test pay close attention to the SQ (Spam Test Quality) number which is a good indicator of the relative accuracy of each test and the %OfSpam number which is a good indicator of how much of the spam will be captured by each test.

Use caution when looking at the SA (Spam Test Accuracy) numbers on our reports. Since our system receives much more spam than most production systems, it is relatively easy for spam tests to score high accuracy points. This is because there are relatively few opportunities for tests to show false positives. It is much better to look at the SQ (Spam Test Quality) number to compare tests. This number is more sensitive to errors.

By far, the most useful way to use our reports is to compare tests for the amount of spam that they capture, to spot any gross differences in accuracy (quality), and to see clearly any large scale deviations from our expectations. For example, some folks may be surprised by the measured results from SPF, SpamCop, and Bonded-Sender.

SQ and %OfSpam can help you understand the Risk/Benefit ratio for any given test. For example, a test that doesn't capture very much spam had better be extremely accurate as compared to a test that captures very large amounts of spam. All spam tests, not matter how good, will produce false positives on some system some where at some time. If there are going to be errors - you want the errors to be as few in number as possible and you want to balance that risk against the benefit of the test - namely - how much spam does it catch!

You also want to use a weighting system of some kind if at all possible. By combining many spam tests in a well balanced weighting system you can virtually eliminate false positives and dramatically improve your spam capture rate.

The weighting values shown in the analysis are developed using the artificial intelligence module of the MDLP utility. This module is capable of interpreting the relative strengths of each test and establishing optimized weights for balancing these tests.

While it might be tempting to use the weights shown in our analysis as a starting point on your system, you may not want to do that! - Remember that the AI has tuned these weights based on all of the shown tests being present and on the particular mix of data passing through our system. While this works for us, unless you duplicate that environment you may be surprised by the results - so use caution!

Wondering About SPF?

As it turns out, SPF doesn't work the way most people expect it to - at least not in practice. However, SPF is a useful spam test! The best way to use SPF is as a spam test - not a test for good messages. According to the data we see, if a message fails SPF then it is very likely (SPFFAIL 89% Accurate) to be spam. However, if you try to use to to indicate good messages you will be dissapointed. SPFPASS is actually 66% accurate at predicting spam not ham!

Wondering About SpamCop?

As much as SpamCop gets a bad reputation (on occasion) for false positives, it turns out to be a very accurate test (99.3%) and a fairly strong one also capturing about 60% of the spam on our system. HOWEVER, these numbers lie just a bit (see above re: SA & SQ) on our reports because we get MUCH more spam than we get legitimate messages. As a result, it's very easy for a test to score high accuracy values on our system.

Wondering about Bonded Sender?

According to the measured results, it turns out that Bonded Sender is not so good at predicting ham. In fact, it's not a very good predictor either way - apparently showing as many false predictions as true. To be fair, it is the nature of any white going test that spammers will find a way to abuse and discredit it.

What about Message Sniffer?

If you look closely you will notice that the SNIFFER test only shows about 91% accuracy. Can this be right? Well, yes and no. The measurement is accurate enough - meaning there is no problem with the math and the way the messages have been counted. HOWEVER, the meaning of this result is not what it appears. In fact, the false positive rate of Message Sniffer is extremely low.

The comparatively low accuracy number (SA) of SNIFFER in our MDLP reports is an artifact of how the results are determined. Each individual test is evaluated against the result of the weights of all tests. This creates a "Hyper-Accuracy Penalty" that introduces phantom false positive results whenever any single test is better at predicting spam than the rest of the tests in the group.

The Hyper-Accuracy Penalty Works Like This.

In a well balanced weighing system no single test is able to cause a message to be evaluated as spam. This is how the weighting system is able to mitigate the errors in each test. Since it is less likely for a number of tests to exhibit the same error at the same time, the weighting system requires a number of tests to fire before a message will be considered spam.

The down side to this is that some extra spam may also get through! Consider the case where there are five tests of about equal weight in a system where at least 3 of these tests must fire for a message to be considered spam. Any time a spam comes through where only 2 of the 5 tests fire, the weighting system will classify the message as ham!! This makes the two tests that fired appear as if they have indicated a false positive. Only a manual inspection of the email will tell for sure.

( Where Message Sniffer is concerned, we have thoroughly researched the hyper-accuracy problem. Each time we review the messages that fall into this category we consistently find that they are, in fact, spam that Message Sniffer tagged where other tests simply did not fire. Based on this finding, and comments from our users, it seems that it is a good practice to hold messages that are tagged by Message Sniffer alone and to delete messages that fail Message Sniffer and also reach the "spam threshold" for a given system. Of course, this is predicated on a policy that deletes messages at all - and many systems do not. Suffice to say, however, it is safe to treat any message that fails Message Sniffer with extreme prejudice. )

To the extent any single test is able to detect spam while other tests are not, the "hyper-accurate" test will appear to create false positives!

The Hyper-Accuracy Penalty is particularly strong on our system because the system processes primarily spam - and to a greater extent than normal, NEW spam for which we have created new anti-spam rules. As a result, our data correctly shows SNIFFER indicating more spam messages than the other tests can agree with.

Report Legend

Name - The test name used in our Declude configuration. Usually this is a shortened version of the blocking list name or the test name in Declude. Most of the time these names are self explanatory.

SS - Test says spam, final result was spam. This is the count of events where this test indicated spam and the final result agreed. For the purposes of MDLP, this is an Accurate Spam Result.

HH - Test says ham, final result was ham. This is the count of events where this test indicated the message was ham and the final result agreed. For the purposes of MDLP, this is an Accurate Ham Result.

HS - Test says ham, final result was spam. This is the count of events where this test indicated the message was ham and the final result did not agree. For the purposes of MDLP, this is an Inaccurate Ham Result.

SH - Test says spam, final result was ham. This is the count of events where this test indicated the message was spam and the final result did not agree. For the purposes of MDLP, this is an Inaccurate Spam Result, or Apparent False Positive. (See notes above regarding the Hyper-Accuracy Penalty.)

SA - Spam Test Accuracy. This is a calculated value: (SS-SH) / (SS+SH). This value indicates how reliably this test predicts that final result will be spam. When the value is 1.0 the test is a perfect predictor. When the value is 0.0 the test falsely predicts the final result as often as it correctly predicts the final result. If this value is negative then the test is actually working against it's stated purpose.

SQ - Spam Test Quality. This is a calculated value: SS2. The shape of this function causes accurate tests to approach 1.0 on a steep curve while causing less accurate tests to approach 0.0 asymptotically. Test accuracy values tend to be very close on the high end. This formula helps to separate close values and severely punish mediocre performers making the tests easier to compare.

HA - Ham Test Accuracy. This is a calculated value: (HH-HS) / (HH+HS). This value indicates how reliably this test predicts that the final result will be ham. When the value is 1.0 the test is a perfect predictor. When the value is 0.0 the test falsely predicts the final result as often as it correctly predicts the final result. If this value is negative then the test is actually working against it's stated purpose.

HQ - Ham Test Quality. This is a calculated value: HH2. The shape of this function causes accurate tests to approach 1.0 on a steep curve while causing less accurate tests to approach 0.0 asymptotically. Test accuracy values tend to be very close on the high end. This formula helps to separate close values and severely punish mediocre performers making the tests easier to compare.

SI - Spam Test Important Count. This value indicates the number of test events where the indicated test was important.

What it means to be "important": Each time a message is tested (each test event) some collection of spam tests will fire. Each of these tests will add it's own weight to the total and that total will be compared with the ham/spam threshold. If a test has a weight high enough so that if the test were removed the ham/spam result would be changed then that test is said to be "important" for that event. If removing the test would not change the final result then the test is not "important" for that event. For example, if the total weight is 150, and the threshold is 100, then any test with a weight greater than 50 would be "important" because removing that test would cause the total weight to fall below the threshold of 100.

HI - Ham Test Important Count. This value indicates the number of test events where the indicated test was important.

avgSD - Average Spam Dominance. This value indicates the tendency for a given spam test to dominate any given test event. For each test event a "dominance figure" can be calculated for each important test by counting the number of tests that were important and taking the reciprocal. For example: If a given test event has 4 important tests then the dominance figure for the important tests can be calculated as 1 / 4 or 0.25. Similarly, if a test event has only one important test then that test was also dominant and so the dominance figure for that event would be 1 / 1 or 1.0 meaning that the test was absolutely dominant. The average dominance for a given test is the average of the dominance figures collected by that test for all of the measured test events.

What it means to be "dominant": For most test events where the message is considered spam, some group of spam tests will fire. Typically some subgroup of those tests will be important. If only one of these tests is important then it can be said that test was "dominant" because no other individual test was able to swing the final result. This condition indicates that the system is relatively insensitive to the other tests during the test event. It is undesirable for any single test to consistently dominate test events because this condition tends to defeat the purpose of a weighting system.

Given a collection of reasonably good tests, a well tuned weighting system can achieve extremely accurate results by mitigating the error rates of the individual tests. This is because it is unlikely for the errors to be identical between all of the tests and so it is unlikely that the errors will "agree" during any test event. As a result the errors tend to be suppressed and the combined accuracy tends to far exceed the accuracy of any individual test.

If any single test is weighted so that it tends to dominate test events then the overall accuracy of the system tends to be reduced as it begins to approach the error rate of the dominant test.

avgHD - Average Ham Dominance. This value indicates the tendency for a given ham test to dominate any given test event. It is the average dominance figure for all of the test events where this test was important.

avgSB - Average Spam Balance. This value indicates the remaining weight (on average) for any test event where the final result is spam with the indicated test removed. For example, if a given test event has a total weight of 350 and the test in question has a weight of 50 then the balance would be 300.

avgHB - Average Ham Balance. This value indicates the remaining weight (on average) for any test event where the final result is ham with the indicated test removed. For example, if a given test event has a total weight of -100 and the test in question has a weight of -125 then the balance would be 25.

%OfSpam - % Of Spam Captured. This value indicates how much of the spam results were covered by the indicated spam test. This is a good indicator of the test's contribution to the overall effort.

%OfHam - % Of Ham Captured. This value indicates how much of the ham results were covered by the indicated ham test. This is a good indicator of the test's contribution to the overall effort.

avgResult - Average Test Result. This value indicates the average result for the given test. Usually this value should match the weight assigned to the test during the period covered by the log file. However, if the weight changes during that period then this value provides the average. When the MDLP AI tunes a test it uses this value as an indicator of how the test is currently weighted rather than looking at the value in the GLOBAL.CFG file. This is important because the results being measured are based on the weight in place during the time of each test event and not the value present in the GLOBAL.CFG file at the time the MDLP utility is run.

RSW - Recommended Spam Test Weight. This value indicates the recommended adjusted weight for a given spam test based on the current analysis. If the MDLP AI is allowed to adjust the weights automatically then the current spam test weight (first value) in the GLOBAL.CFG file will be replaced with this number. It is important to note that the MDLP AI works by migrating weights over time. The RSW value at any point in time is only one step in a particular direction and should not be read to mean that the AI has recommended this as the "best" weight for the test. The RSW should be thought of as a snapshot of the weight as it is approaching the best value or perhaps an indicator of the direction and speed of the weight migration.

RHW - Recommended Ham Test Weight. This value indicates the recommended adjusted weight for a given ham test based on the current analysis. If the MDLP AI is allowed to adjust the weights automatically then the current ham test weight (second value) in the GLOBAL.CFG file will be replaced with this number. It is important to note that the MDLP AI works by migrating weights over time. The RHW value at any point in time is only one step in a particular direction and should not be read to mean that the AI has recommended this as the "best" weight for the test. The RHW should be thought of as a snapshot of the weight as it is approaching the best value or perhaps an indicator of the direction and speed of the weight migration.

Utility Overview & Features

This section is under construction... Stay Tuned.

Theory Of Operations And Architecture

This section is under construction... Stay Tuned.

Auto Tune & Artificial Intelligence

This section is under construction... Stay Tuned.

 

 

©Copyright 2002-2004 MicroNeil Research Corporation