SpamAssassin on mail.pa.msu.edu
Part 3: SpamAssassin's "learning" feature
- SpamAssassin "learning" overview
The SpamAssassin
software performs an analysis of patterns within each incoming E-mail
message, enters pertinent information into a database, and compares it
to your previously received messages' patterns.
This incorporates "learning" behavior (also
referred to as Bayesian
analysis) and acts upon
it, based on
the particular spam messages received by each individual user. This
analysis yields a final probability number based on how similar the
message is to messages in the database previously called spam, and how
similar it is to
messages in the database previously identified as "not spam" (or
"ham"). If the
database of previously identified messages is large enough (over 200
each of spam and non-spam), this probability will be assigned a score
to be added to or subtracted from the total score, influencing whether
or not the message ultimately is labelled spam or non-spam.
- Auto-learn feature
If a message scores high enough in each of a number of
categories of SpamAssassin test, in addition to receiving its score
(which generally will classify it as "spam"), it will qualify for the "autolearn=spam"
classification (and it will end up in the user's IN.spam
folder). Its entry in
the database will be associated with spam.
Likewise, a message which
scores low enough will qualify for the "autolearn=ham"
classification and its entry in the database will be associated with
non-spam.
Messages scoring in between will not be entered into the
database automatically. This is an example of SpamAssassin's
conservatism. If it is not quite sure a message is either spam or
non-spam, regardless of the score (which may actually be high enough to
be flagged as spam otherwise), it will not use it as a basis of
comparison during the Bayesian analysis. In general, the contents of a
user's IN.probable-spam
folder fit into this "probably spam but not sure enough to
automatically learn from it" category.
Unfortunately, this limits the
usefulness of the "learning" behavior, in that only messages which are
quite clearly spam or are very unlikely to be spam are entered into the
database. This means that the auto-learn process does not automatically
help to
distinguish spam from non-spam in the "grey area" where the messages
have high enough scores to be candidates for spam classification, but
not so high that they're virtually certain to be spam. The auto-learn
feature thus only reinforces the classification of those messages which
would be called spam for other reasons anyway.
Fortunately,
SpamAssassin offers a way to clarify scoring in the "grey area": the
intervention of human intelligence, as described in the next
section.
- Manual "learning" procedure
To assist SpamAssassin in its quest to distinguish good
messages from bad, human intervention is allowed. SpamAssassin is
willing to be told that certain messages are definitely spam, and that
other messages are definitely not spam. Here is the procedure, which
for most people can be
done when they feel like it. When first starting with SpamAssassin,
doing this about once a week
for the first month is good. Later, as the learning process has already
been primed with the user's
own mix of spam and non-spam messages, it can be done less often.
- From your E-mail program:
- Make sure that the IN.probable-spam
folder contains only spam messages.
- Put any definitely-spam messages which ended up in
your regular INBOX into the IN.probable-spam folder.
- Now, log into
kepler.pa.msu.edu using ssh
(preferred) or telnet (discouraged for security
reasons).
- Type the command
thisisspam mail/IN.probable-spam
at the command prompt.
The command will report on each
message it processes as it processes it.
It typically takes 6 to 12
seconds per message, so if you have hundreds of messages, it will take
some time. It should be safe to minimize this login window and work on
something else while the
learning progresses.
- Log out with the command
exit.
The thisisspam command may be run on any
mail folder you like. It takes a
file name as an argument. In most cases, mail/IN.probable-spam
(i.e., the IN.probable-spam file in the mail
sub-directory) is the one you want.
Generally, the messages in the IN.spam folder
have scored
high enough that they
have already been learned by the auto-learn feature, so running the thisisspam
command on the IN.spam folder is unnecessary.
There is also an equivalent "thisisnotspam filename"
command, but most
users will not need to use it under ordinary circumstances.
Another command, myspamstats, typed at the
command prompt with no arguments,
will show how many entries in the learning database are associated with
"spam" and how many
are "non-spam" (or "ham"). Recall that each of these has to be at least
200 for the results of
comparing new messages to the learned messages to count extra points
towards a message's total score.
- Links
|
Questions not covered in this FAQ? Make sure to send them in!
|
|
Last Updated: Tuesday, 21 March 2005 by G J Perkins
|
|