Home » Email » SpamAssassin Page 3

SpamAssassin on mail.pa.msu.edu

Part 3: SpamAssassin's "learning" feature


  1. SpamAssassin "learning" overview

    The SpamAssassin software performs an analysis of patterns within each incoming E-mail message, enters pertinent information into a database, and compares it to your previously received messages' patterns.

    This incorporates "learning" behavior (also referred to as Bayesian analysis) and acts upon it, based on the particular spam messages received by each individual user. This analysis yields a final probability number based on how similar the message is to messages in the database previously called spam, and how similar it is to messages in the database previously identified as "not spam" (or "ham"). If the database of previously identified messages is large enough (over 200 each of spam and non-spam), this probability will be assigned a score to be added to or subtracted from the total score, influencing whether or not the message ultimately is labelled spam or non-spam.

  2. Auto-learn feature

    If a message scores high enough in each of a number of categories of SpamAssassin test, in addition to receiving its score (which generally will classify it as "spam"), it will qualify for the "autolearn=spam" classification (and it will end up in the user's IN.spam folder). Its entry in the database will be associated with spam.

    Likewise, a message which scores low enough will qualify for the "autolearn=ham" classification and its entry in the database will be associated with non-spam.

    Messages scoring in between will not be entered into the database automatically. This is an example of SpamAssassin's conservatism. If it is not quite sure a message is either spam or non-spam, regardless of the score (which may actually be high enough to be flagged as spam otherwise), it will not use it as a basis of comparison during the Bayesian analysis. In general, the contents of a user's IN.probable-spam folder fit into this "probably spam but not sure enough to automatically learn from it" category.

    Unfortunately, this limits the usefulness of the "learning" behavior, in that only messages which are quite clearly spam or are very unlikely to be spam are entered into the database. This means that the auto-learn process does not automatically help to distinguish spam from non-spam in the "grey area" where the messages have high enough scores to be candidates for spam classification, but not so high that they're virtually certain to be spam. The auto-learn feature thus only reinforces the classification of those messages which would be called spam for other reasons anyway.

    Fortunately, SpamAssassin offers a way to clarify scoring in the "grey area": the intervention of human intelligence, as described in the next section.

  3. Manual "learning" procedure

    To assist SpamAssassin in its quest to distinguish good messages from bad, human intervention is allowed. SpamAssassin is willing to be told that certain messages are definitely spam, and that other messages are definitely not spam. Here is the procedure, which for most people can be done when they feel like it. When first starting with SpamAssassin, doing this about once a week for the first month is good. Later, as the learning process has already been primed with the user's own mix of spam and non-spam messages, it can be done less often.

    • From your E-mail program:
      1. Make sure that the IN.probable-spam folder contains only spam messages.
      2. Put any definitely-spam messages which ended up in your regular INBOX into the IN.probable-spam folder.

       
    • Now, log into kepler.pa.msu.edu using ssh (preferred) or telnet (discouraged for security reasons).
      1. Type the command
        thisisspam  mail/IN.probable-spam
        at the command prompt. The command will report on each message it processes as it processes it.
        It typically takes 6 to 12 seconds per message, so if you have hundreds of messages, it will take some time. It should be safe to minimize this login window and work on something else while the learning progresses.
      2. Log out with the command exit.

    The thisisspam command may be run on any mail folder you like. It takes a file name as an argument. In most cases, mail/IN.probable-spam (i.e., the IN.probable-spam file in the mail sub-directory) is the one you want.

    Generally, the messages in the IN.spam folder have scored high enough that they have already been learned by the auto-learn feature, so running the thisisspam command on the IN.spam folder is unnecessary.

    There is also an equivalent "thisisnotspam  filename" command, but most users will not need to use it under ordinary circumstances.

    Another command, myspamstats, typed at the command prompt with no arguments, will show how many entries in the learning database are associated with "spam" and how many are "non-spam" (or "ham"). Recall that each of these has to be at least 200 for the results of comparing new messages to the learned messages to count extra points towards a message's total score.

  4. Links



Questions not covered in this FAQ? Make sure to send them in!

Last Updated: Tuesday, 21 March 2005 by G J Perkins