How to Filter Spam

Introduction

The CSE email-processing servers run ClamAV, SpamAssassin, and Distributes Checksum Clearinghouse (DCC) processes to mark messages as spam. These rules constitute the default .procmailrc file. Remember that procmail processes rules in .procmailrc from top to bottom.

In this example, we filter bulk messages into a folder called IN.BULK and virus-laded messages into a folder called IN.VIRUS. To automatically delete them, replace target locations IN.BULK and IN.VIRUS with /dev/null.

Rules


:0:
*X-Virus-Status: Infected.*
IN.VIRUS

:0:
*X-Spam-Flag: YES
IN.BULK

:0:
*!^From:.*buffalo\.edu
*X-DCC.*(Body|Fuz[1234])=([0-9]*[0-9][0-9][0-9]|[5-9][0-9])
IN.BULK

:0:
*!^From:.*buffalo\.edu
*X-DCC.*(Body|Fuz[1234])=many
IN.BULK

Notes

  1. It is best if periodically (once every few days, or perhaps once a week) you at least look at what is in the IN.BULK folder. Set it up so that the contents get sorted by either sender or subject (I find subject to be best because so much of *my* spam comes with similar subjects but some people find sorting by sender more effective). Glance at who the messages seem to be from and the subject. The idea here is to see if you notice messages from people you know and care about - as we have tried to stress there is a CHANCE valid email messages you would have liked to see got flagged as SPAM. Using this method, you should be able to perform large bulk removals of virtually all of the messages there, not actually bothering to look at the messages themselves. Please do use this procedure at least periodically to keep the IN.BULK folder from growing forever, wasting space.

    If you read e-mail on a Windows PC, we suggest you bulk remove all of the messages in the IN.VIRUS folder - it's best to not read any of those at all. If you read e-mail on something other than a Windows PC, then it should be safe to treat the IN.VIRUS folder the same as the IN.BULK folder - periodically check it to see if any valid messages got flagged as a virus by mistake. The virus scanner is more conservative than the SPAM scanner, but there is still a very small chance something valid might wind up here.

Detail

Bulk Mail


:0:
*!^From:.*buffalo\.edu
*X-DCC.*(Body|Fuz[1234])=([0-9]*[0-9][0-9][0-9]|[5-9][0-9])
IN.BULK

:0:
*!^From:.*buffalo\.edu
*X-DCC.*(Body|Fuz[1234])=many
IN.BULK

procmail can be set up to look at the headers of email that arrives for you, and sorts messages based on what it sees. It is what you should use to check for the X-DCC headers and route the messages based on the checksum values.

This will save all messages that have a checksum value of 50 or more in any of the pieces the server checks.

Example: The second X-DCC recipe (above) will match a message if (based on the first regular expression) it does NOT come from a user in "buffalo.edu" and one of the X-DCC checksums was "many". If a message matches those regular expressions, it will be placed in your IN.BULK folder instead of your normal email spool file.

The IN.BULK rules will drop bulk mail that has been received by over 50 people on campus into a file called IN.BULK. You can change the filenames to whatever you want. If you're brave, you can just delete these files every so often, but we encourage you to at least browse the subjects occasionally to make sure you didn't forget a mailing list or two when setting it up.

When email arrives, the server performs a mathematical checksum on various pieces of the message (the header, body, attachments, etc.). The server then asks a central "clearinghouse" how many times those checksums have been seen. After that, it adds a header line of this form to the message:

X-DCC-Buffalo.EDU-Metrics: electra.cse.Buffalo.EDU 1029; Body=1 Fuz1=1 Fuz2=124

This line says that the clearinghouse has seen the attachment 124 times. There is a threshold at which the clearinghouse stops counting. Instead of a number, it just says "many".

  • procmail handles bulk email using simple mathematical checksums, NOT word recognition.
  • The central servers do not drop any email based on procmail rules. They simply add the X-DCC email header. You control what happens to your mail based on the results of the checksums.
  • It is unlikely--but possible--that a unique email message sent to you by someone you trust could wind up having the same mathematical checksum as a different "Spam" message. In practice this is extremely rare, but it can happen.
  • By definition, Listservs send their messages to lots of places, and thus will wind up having their checksums reach the point they are considered "bulk".

Keep these things in mind while setting up what you do with bulk email. We strongly recommend that you set things up so that messages considered "bulk" go to a special mail folder, and that you examine that folder periodically. When looking through this folder, it is usually simple enough to look at who the message is From: and its Subject: - those alone are almost always enough to decide if the message is junk and you can simply mark it for deletion, unread. However, you may notice a message or two that you feel you should check while doing this.

References

  1. ClamAV Official Page
  2. SpamAssassin Official Page
  3. Distributed Checksum Clearinghouse (DCC) Official Page