|
Modular
Declude Log Processor (under
development)
Spam
Test Statistics
Report Legend
Utility Overview
& Features
Theory Of Operations
And Architecture
Auto
Tune & Artificial Intelligence
QUICK!
- LOOKING
FOR SPAM TEST STATISTICS? - See Report 1 and Report 2
The data presented
in these reports is updated every 12 hours or so based on live
traffic flowing through our systems. This
data is useful to anyone who is evaluating spam tests and blocking
lists for use in their anti-spam efforts. The list
of tests we evaluate is large, but not exhaustive by any means.
There are many good (and bad) tests not represented here. The
tests we have selected suit our needs or may be there for curiosity's
sake. Any tests that are represented (or left out) were selected
for no other purpose.
Once we release
the MDLP utility we hope that other systems will publish their
results so that a broader analysis will be available to everyone
from a wide range of systems - each with a different combination
of tests and a different mix of messages.
Report 1 (12
hours) and Report 2 (apx one month) were created from our IMail/Declude
test bed which receives primarily spam, and a few domains of normal
email traffic. The results shown are produced by the -html option
on our MDLP utility which is currently in late beta.
The analysis
is very useful for evaluating the relative accuracy and capture
rates of many popular spam blocking lists and spam tests - including,
of course, Message Sniffer (see SNIFFER). When evaluating a particular
spam test pay close attention to the SQ (Spam Test Quality) number
which is a good indicator of the relative accuracy of each test
and the %OfSpam number which is a good indicator of how much of
the spam will be captured by each test.
Use
caution when looking at the SA (Spam Test Accuracy) numbers on
our reports. Since our system receives much more spam than most
production systems, it is relatively easy for spam tests to score
high accuracy points. This is because there are relatively few
opportunities for tests to show false positives. It is much better
to look at the SQ (Spam Test Quality) number to compare tests.
This number is more sensitive to errors.
By
far, the most useful way to use our reports is to compare tests
for the amount of spam that they capture, to spot any gross differences
in accuracy (quality), and to see clearly any large scale deviations
from our expectations. For example, some folks may be surprised
by the measured results from SPF, SpamCop, and Bonded-Sender.
SQ and %OfSpam
can help you understand the Risk/Benefit ratio for any given test.
For example, a test that doesn't capture very much spam had better
be extremely accurate as compared to a test that captures very
large amounts of spam. All spam tests,
not matter how good, will produce false
positives on some system some where at some time. If
there are going to be errors - you want the errors to be as few
in number as possible and you want to balance that risk against
the benefit of the test - namely - how much spam does it catch!
You also want
to use a weighting system of some kind if at all possible. By
combining many spam tests in a well balanced weighting system
you can virtually eliminate false positives and dramatically improve
your spam capture rate.
The weighting
values shown in the analysis are developed using the artificial
intelligence module of the MDLP utility. This module is capable
of interpreting the relative strengths of each test and establishing
optimized weights for balancing these tests.
While it might
be tempting to use the weights shown in our analysis as a starting
point on your system, you may not want to do that! - Remember
that the AI has tuned these weights based on all of the shown
tests being present and on the particular mix of data passing
through our system. While this works for us, unless you duplicate
that environment you may be surprised by the results - so use
caution!
Wondering
About SPF?
As it turns
out, SPF doesn't work the way most people expect it to - at least
not in practice. However, SPF is a useful
spam test! The best way to use SPF is as a spam test
- not a test for good messages. According to the data we see,
if a message fails SPF then it is very likely (SPFFAIL 89% Accurate)
to be spam. However, if you try to use to to indicate good messages
you will be dissapointed. SPFPASS is actually 66% accurate at
predicting spam not ham!
Wondering
About SpamCop?
As much as
SpamCop gets a bad reputation (on occasion) for false positives,
it turns out to be a very accurate test (99.3%) and a fairly strong
one also capturing about 60% of the spam on our system. HOWEVER,
these numbers lie just a bit (see above re: SA & SQ) on our
reports because we get MUCH more spam than we get legitimate messages.
As a result, it's very easy for a test to score high accuracy
values on our system.
Wondering
about Bonded Sender?
According
to the measured results, it turns out that Bonded Sender is not
so good at predicting ham. In fact, it's not a very good predictor
either way - apparently showing as many false predictions as true.
To be fair, it is the nature of any white going test that spammers
will find a way to abuse and discredit it.
What about
Message Sniffer?
If you look
closely you will notice that the SNIFFER test only shows about
91% accuracy. Can this be right? Well, yes and no. The measurement
is accurate enough - meaning there is no problem with the math
and the way the messages have been counted. HOWEVER, the meaning
of this result is not what it appears. In fact, the false positive
rate of Message Sniffer is extremely low.
The comparatively
low accuracy number (SA) of SNIFFER in our MDLP reports is an
artifact of how the results are determined. Each individual test
is evaluated against the result of the weights of all tests. This
creates a "Hyper-Accuracy Penalty" that introduces
phantom false positive results whenever any single test is better
at predicting spam than the rest of the tests in the group.
The Hyper-Accuracy
Penalty Works Like This.
In a well
balanced weighing system no single test is able to cause a message
to be evaluated as spam. This is how the weighting system is able
to mitigate the errors in each test. Since it is less likely for
a number of tests to exhibit the same error at the same time,
the weighting system requires a number of tests to fire before
a message will be considered spam.
The down side
to this is that some extra spam may also get through! Consider
the case where there are five tests of about equal weight in a
system where at least 3 of these tests must fire for a message
to be considered spam. Any time a spam comes through where only
2 of the 5 tests fire, the weighting system will classify the
message as ham!! This makes the two tests that fired appear as
if they have indicated a false positive. Only a manual inspection
of the email will tell for sure.
( Where
Message Sniffer is concerned, we have thoroughly researched the
hyper-accuracy problem. Each time we review the messages that
fall into this category we consistently find that they are, in
fact, spam that Message Sniffer tagged where other tests simply
did not fire. Based on this finding, and comments from our users,
it seems that it is a good practice to hold messages that are
tagged by Message Sniffer alone and to delete messages that fail
Message Sniffer and also reach the "spam threshold"
for a given system. Of course, this is predicated on a policy
that deletes messages at all - and many systems do not. Suffice
to say, however, it is safe to treat any message that fails Message
Sniffer with extreme prejudice. )
To the extent
any single test is able to detect spam while other tests are not,
the "hyper-accurate" test will appear to create false
positives!
The
Hyper-Accuracy Penalty is particularly strong on our system
because the system processes primarily spam - and to a greater
extent than normal, NEW spam for which we have created new anti-spam
rules. As a result, our data correctly shows SNIFFER indicating
more spam messages than the other tests can agree with.
Report
Legend
Name -
The
test name used in our Declude configuration. Usually this is a
shortened version of the blocking list name or the test name in
Declude. Most of the time these names are self explanatory.
SS - Test
says spam, final result was spam. This is the count of events
where this test indicated spam and the final result agreed. For
the purposes of MDLP, this is an Accurate Spam Result.
HH - Test
says ham, final result was ham. This is the count of events where
this test indicated the message was ham and the final result agreed.
For the purposes of MDLP, this is an Accurate Ham Result.
HS - Test
says ham, final result was spam. This is the count of events where
this test indicated the message was ham and the final result did
not agree. For the purposes of MDLP, this is an Inaccurate Ham
Result.
SH -
Test says spam, final result was ham. This is the count of events
where this test indicated the message was spam and the final result
did not agree. For the purposes of MDLP, this is an Inaccurate
Spam Result, or Apparent False Positive. (See notes above regarding
the Hyper-Accuracy Penalty.)
SA - Spam
Test Accuracy. This is a calculated value: (SS-SH) / (SS+SH).
This value indicates how reliably this test predicts that final
result will be spam. When the value is 1.0 the test is a perfect
predictor. When the value is 0.0 the test falsely predicts the
final result as often as it correctly predicts the final result.
If this value is negative then the test is actually working against
it's stated purpose.
SQ - Spam
Test Quality. This is a calculated value: SS2. The
shape of this function causes accurate tests to approach 1.0 on
a steep curve while causing less accurate tests to approach 0.0
asymptotically. Test accuracy values tend to be very close on
the high end. This formula helps to separate close values and
severely punish mediocre performers making the tests easier to
compare.
HA - Ham
Test Accuracy. This is a calculated value: (HH-HS) / (HH+HS).
This value indicates how reliably this test predicts that the
final result will be ham. When the value is 1.0 the test is a
perfect predictor. When the value is 0.0 the test falsely predicts
the final result as often as it correctly predicts the final result.
If this value is negative then the test is actually working against
it's stated purpose.
HQ - Ham
Test Quality. This is a calculated value: HH2. The
shape of this function causes accurate tests to approach 1.0 on
a steep curve while causing less accurate tests to approach 0.0
asymptotically. Test accuracy values tend to be very close on
the high end. This formula helps to separate close values and
severely punish mediocre performers making the tests easier to
compare.
SI - Spam
Test Important Count. This value indicates the number of test
events where the indicated test was important.
What
it means to be "important": Each time a message
is tested (each test event) some collection of spam tests will
fire. Each of these tests will add it's own weight to the total
and that total will be compared with the ham/spam threshold. If
a test has a weight high enough so that if the test were removed
the ham/spam result would be changed then that test is said to
be "important" for that event. If removing the test
would not change the final result then the test is not "important"
for that event. For example, if the total weight is 150, and the
threshold is 100, then any test with a weight greater than 50
would be "important" because removing that test would
cause the total weight to fall below the threshold of 100.
HI - Ham
Test Important Count. This value indicates the number of test
events where the indicated test was important.
avgSD -
Average
Spam Dominance. This value indicates the tendency for a given
spam test to dominate any given test event. For each test event
a "dominance figure" can be calculated for each important
test by counting the number of tests that were important and taking
the reciprocal. For example: If a given test event has 4 important
tests then the dominance figure for the important tests can be
calculated as 1 / 4 or 0.25. Similarly, if a test event has only
one important test then that test was also dominant and so the
dominance figure for that event would be 1 / 1 or 1.0 meaning
that the test was absolutely dominant. The average dominance for
a given test is the average of the dominance figures collected
by that test for all of the measured test events.
What
it means to be "dominant": For most test
events where the message is considered spam, some group of spam
tests will fire. Typically some subgroup of those tests will be
important. If only one of these tests is important then it can
be said that test was "dominant" because no other individual
test was able to swing the final result. This condition indicates
that the system is relatively insensitive to the other tests during
the test event. It is undesirable for any single test to consistently
dominate test events because this condition tends to defeat the
purpose of a weighting system.
Given a collection
of reasonably good tests, a well tuned weighting system can achieve
extremely accurate results by mitigating the error rates of the
individual tests. This is because it is unlikely for the errors
to be identical between all of the tests and so it is unlikely
that the errors will "agree" during any test event.
As a result the errors tend to be suppressed and the combined
accuracy tends to far exceed the accuracy of any individual test.
If any single
test is weighted so that it tends to dominate test events then
the overall accuracy of the system tends to be reduced as it begins
to approach the error rate of the dominant test.
avgHD -
Average
Ham Dominance. This value indicates the tendency for a given ham
test to dominate any given test event. It is the average dominance
figure for all of the test events where this test was important.
avgSB -
Average
Spam Balance. This value indicates the remaining weight (on average)
for any test event where the final result is spam with the indicated
test removed. For example, if a given test event has a total weight
of 350 and the test in question has a weight of 50 then the balance
would be 300.
avgHB -
Average
Ham Balance. This value indicates the remaining weight (on average)
for any test event where the final result is ham with the indicated
test removed. For example, if a given test event has a total weight
of -100 and the test in question has a weight of -125 then the
balance would be 25.
%OfSpam
- %
Of Spam Captured. This value indicates how much of the spam results
were covered by the indicated spam test. This is a good indicator
of the test's contribution to the overall effort.
%OfHam
- %
Of Ham Captured. This value indicates how much of the ham results
were covered by the indicated ham test. This is a good indicator
of the test's contribution to the overall effort.
avgResult
- Average
Test Result. This value indicates the average result for the given
test. Usually this value should match the weight assigned to the
test during the period covered by the log file. However, if the
weight changes during that period then this value provides the
average. When the MDLP AI tunes a test it uses this value as an
indicator of how the test is currently weighted rather than looking
at the value in the GLOBAL.CFG file. This is important because
the results being measured are based on the weight in place during
the time of each test event and not the value present in the GLOBAL.CFG
file at the time the MDLP utility is run.
RSW - Recommended
Spam Test Weight. This value indicates the recommended adjusted
weight for a given spam test based on the current analysis. If
the MDLP AI is allowed to adjust the weights automatically then
the current spam test weight (first value) in the GLOBAL.CFG file
will be replaced with this number. It is important to note that
the MDLP AI works by migrating weights over time. The RSW value
at any point in time is only one step in a particular direction
and should not be read to mean that the AI has recommended this
as the "best" weight for the test. The RSW should be
thought of as a snapshot of the weight as it is approaching the
best value or perhaps an indicator of the direction and speed
of the weight migration.
RHW - Recommended
Ham Test Weight. This value indicates the recommended adjusted
weight for a given ham test based on the current analysis. If
the MDLP AI is allowed to adjust the weights automatically then
the current ham test weight (second value) in the GLOBAL.CFG file
will be replaced with this number. It is important to note that
the MDLP AI works by migrating weights over time. The RHW value
at any point in time is only one step in a particular direction
and should not be read to mean that the AI has recommended this
as the "best" weight for the test. The RHW should be
thought of as a snapshot of the weight as it is approaching the
best value or perhaps an indicator of the direction and speed
of the weight migration.
Utility
Overview & Features
This section
is under construction... Stay Tuned.
Theory
Of Operations And Architecture
This section
is under construction... Stay Tuned.
Auto
Tune & Artificial Intelligence
This section
is under construction... Stay Tuned.
|