FuzzyOcr Plugin Install Howto

Akkor szerepeljen itt a FuzzyOcr telepítésének függőségei és menete. Fontos, hogy nem fontos a szükséges libeket, és mindent innen feltenni, ha Linux disztribúciódban szerepelnek ezek csomagok formájában elég ha onnan telepíted. Én Ubuntu 6.10-en minden függőséget megtaláltam, csak magát a programot nem. Akkor nézzük mi szükségeltetik a képformátumú spamek kiszűréséhez, és hogyan kell beüzemelni.

Netpbm

1. Install the Netpbm tools and libraries:

The first thing you’ll need is a set of image manipulation tools, provided by the popular Netpbm library. If you’re not downloading the full source code, you’ll at least require the binaries themselves, as well as the libraries and header files. These packages might be referred to as netpbm-progs and netpbm-devel, libnetpbm and netpbm or somesuch, depending on your distribution.

GIFLIB

2. Install the GIFLIB tools and libraries

Next you’ll want to install the GIFLIB library and its associated tools. Specifically, it’s the giffix utility you want, in order to be able to „fix” prematurely truncated GIF images, since spammers aren’t known for providing well-formed images. Your distribution repositories probably have the GIFLIB libraries and tools as libungif and libungif-progs, giflib and giflib-progs or something similar.

Image::ExifTool

3. Install the Image::ExifTool perl module:

You’ll also need the Image::ExifTool module, which you may be able to get from your favourite distribution’s repository, or directly from CPAN. This will allow you to read the EXIF text encoded into certain image formats.

4. Patch the Image::ExifTool module:

Unfortunately Image::ExifTool needs a small patch to correct the way it handles GIFs with empty colour tables. This is not normally a big concern with most image-viewing software, but when a malformed image could crash your mail filter you want to make sure your image-scanning tools are robust enough to deal with such things gracefully.

Go to the directory where the Image/ExifTool module was installed, most likely somewhere under your /usr/lib/perl5 tree, and apply the patch to the GIF.pm file.

cd /usr/lib/perl5/site-perl/5.8.6/Image/ExifTool/
patch -p3 < patch-GIF-Colortable

String::Approx

5. Install the String::Approx perl module:

The String::Approx perl module provides „fuzzy” matching for text strings, which is helpful for detecting misspelled words, words that an OCR engine misreads, and words that spammers intentionally obfuscate. For instance, '1' and 'l' look very similar to an OCR engine, so a word like „email” could be seen as „emai1” by mistake, but with fuzzy matching the two words would be seen as equivalent. You can get this module from your favourite distribution’s repository, or directly from CPAN.

GOCR

6. Download the GOCR source code:

The OCR process is handled by GOCR, but while there are binary packages available for a number of distributions, you’ll want the source package in this case, because there’s a small patch you need to apply.

7. Patch the GOCR source code:

Some grey images have been known to trigger segmentation faults in GOCR 0.40, so a small patch has been devised to fix this vulnerability. Once again, this is not much of an issue in most normal OCR environments, since choking on an input image doesn’t usually have serious consequences, but in a spam-filtering environment we need to be more graceful in how we handle such situations.

Once you’ve downloaded the GOCR 0.40 source code package and unpacked it, go to the src subdirectory and apply the patch to the pgm2asc.c file:

cd src
patch -p1 < patch-gocr-segfault

8. Compile and install GOCR:

From there, building GOCR is straightforward:

./configure --prefix=/usr
make
make install

9. Test your OCR setup:

To make sure that everything you’ve installed so far works, test it with a sample image copied from some spam you’ve received, preferably an image that contains some text. If it’s a GIF image, for instance, run it through giftopnm:

giftopnm image001.gif > image001.pnm
giftopnm: too much input data, ignoring extra...
giftopnm: bogus character 0x00, ignoring

Don’t be distressed by the informational messages you receive as a result; remember that spammers aren’t always going to supply you with a standards-compliant image to work with. In fact, they often hope that a deliberately-malformed image will break your scanner, or at least register an error that leaves your scanner unsure how to classify it, hoping for the benefit of the doubt.

That’s where giffix comes in. Try repairing the same image, and then run it through giftopnm again:

giffix image001.gif > image001-fixed.gif
giftopnm image001-fixed.gif > image001-fixed.pnm

This time the warning messages should be gone.

Now run the output through GOCR:

gocr image001-fixed.pnm

A second or two later, you should see a bunch of text as read from the image (presuming it had any to begin with, of course). There’s likely to be some other garbage too, and not all of the words will be properly read–in particular, the OCR software has trouble distinguishing 'r' from 'n', and 'I' from 'l', but for the most part you should be able to recognize the words–and that’s good enough for our purposes, especially with our „fuzzy” matching tools.

FuzzyOCR Plugin for SpamAssassin

10. Download and install the FuzzyOCR plugin for SpamAssassin:

Now that you’ve got the underlying tools installed and working, you can download the FuzzyOCR Plugin for SpamAssassin. To install it, unpack the tarball and copy FuzzyOcr.pm (the plugin itself) and FuzzyOcr.cf (its configuration file) to your SpamAssassin directory, wherever your local.cf file is located (e.g. /etc/mail/spamassassin).

Note: If there’s a loadplugin line at the top of FuzzyOcr.cf, delete it; that line belongs elsewhere, as the next step explains.

11. Tell SpamAssassin to load the FuzzyOCR plugin

Add the following lines to your v310.pre file, so that the plugin gets loaded at startup:

# FuzzyOCR - performs fuzzy Optical Character Recognition on spam images
#
loadplugin FuzzyOcr /etc/mail/spamassassin/FuzzyOcr.pm
loadplugin Mail::SpamAssassin::Timeout

Note that some binary packages of SpamAssassin don’t seem to include the Timeout plugin, so if you don’t have a Timeout.pm file in your SpamAssassin perl library you may need to download the full SpamAssassin source package for your version and copy the Timeout.pm file from it. If you have to do so, be sure to place the Timeout.pm file in the same place as the rest of your SpamAssassin plugins are found, usually something like /usr/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin.

12. Edit the plugin configuration

Edit the FuzzyOcr.cf file, adjusting any of the default settings you like, and adding target words to the default list (or removing some):

# Here we define the words to scan for

focr_word stock
focr_word investor
focr_word international
focr_word company
focr_word money
focr_word million
focr_word thousand
focr_word buy
focr_word price
focr_word trade
focr_word banking
focr_word service
focr_word kunde
focr_word volksbank
focr_word sparkasse
focr_word software
focr_word viagra
focr_word cialis
focr_word levitra
focr_word medicine
focr_word legal
focr_word medication
focr_word click here
focr_word penis
focr_word growth
focr_word drugs
focr_word pharmacy

With respect to the other configuration parameters, some of these might interest you:

# Detection threshold (see manual)
#focr_threshold 0.3
#
# This is the score for a hit after focr_counts_required matches
#focr_base_score 4
#
# This is the additional score for every additional match after focr_counts_required matches
#focr_add_score 1
#
# This is the score to give for a wrong content-type (e.g. JPEG image but content type says GIF)
#focr_wrongctype_score 1.5
#
# This is the score to give for a corrupted image (This currently affects only GIF images)
#focr_corrupt_score 3.5
#
# This is used to disable the OCR engine if the message has already more points than this value
#focr_autodisable_score 50
#
# Number of minimum matches before the rule scores
#focr_counts_required 2
#
# Verbosity level (see manual) Attention: Don't set to 0, but to 0.0 (will hopefully be fixed soon)
#focr_verbose 1
#
# Path for temporary files
focr_tmp_path /tmp

In particular, it’s useful to understand how the plugin assigns its score value to the FUZZY_OCR rule. The rule is only triggered if there are at least focr_counts_required word matches (default: 2) in the image. At that point, the rule’s score becomes focr_base_score + focr_add_score for every additional word match (default: 4 + 1/word after the second match). At default values, then, two matching words would score a total of 4 points; three matching words would score 5 points; four would score 6 points, etc. Feel free to adjust these values to your tastes. Don’t forget to uncomment these values if you change them!

The focr_wrongctype_score setting lets you penalize mail that contains images that claim to be one type but are actually another, such as a GIF that’s advertised as a JPEG in the MIME content-type header. focr_corrupt_score similarly penalizes malformed GIF images. Eventually perhaps this will penalize malformed images of other types.

The focr_autodisable_score setting is more controversial. In principle it’s a way to save some processing cycles by avoiding an OCR scan if there are already enough other rules triggering on the mail to achieve this minimum score (default: 50). The downside is that this mucks with efforts to statistically measure the performance of the OCR-based rules, since there’s no longer any guarantee that these rules will be called every time they should be. Upcoming Maia features such as Dynamic Score Balancing will not work properly if this setting is used, so unless you’re truly strapped for processor cycles it’s advisable to set this value to an unrealistically high value (e.g. 999) to effectively disable it.

Note: If you do use the focr_autodisable_score feature, you’ll also need to make sure that the OCR plugin doesn’t get run until after most other SpamAssassin rules have had a chance to trigger. Do that by adding the following to the FuzzyOcr.cf file:

priority FUZZY_OCR   600

Finally, there are some application paths you may wish to set:

# Location of helper applications (path + binary)
#focr_bin_giffix /usr/bin/giffix
#focr_bin_giftopnm /usr/bin/giftopnm
#focr_bin_jpegtopnm /usr/bin/jpegtopnm
#focr_bin_pngtopnm /usr/bin/pngtopnm
#focr_bin_gocr /usr/bin/gocr

13. Test the installation

Now you can verify that you’ve got all the paths set properly and that you have all of the necessary pieces in place. As your amavis/maia user, run:

spamassassin -D --lint

If everything is working properly, this shouldn’t produce any errors, and in particular you should see something like:

...
plugin: loading FuzzyOcr from /etc/mail/spamassassin/FuzzyOcr.pm
plugin: registered FuzzyOcr=HASH(0xb9fde84)
plugin: loading Mail::SpamAssassin::Timeout from @INC
plugin: registered Mail::SpamAssassin::Timeout=HASH(0xb18501c)
...

If for some reason you don’t see the FuzzyOCR module being loaded, it may be because of some security-related settings in your operating system that may require Perl modules to have their execute bits set. Usually this is unnecessary (and inadvisable), but one Maia user has reported that this was necessary to get the plugin to load properly:

chmod 744 FuzzyOcr.pm

14. Tell Maia about the new rules

If everything is working properly, you’ll want to run the load-sa-rules script to make sure that Maia discovers the new rules you just added (in the FuzzyOcr.cf file). There should be a handful of new rules:

[load-sa-rules] Adding new rule: FUZZY_OCR (Mail contains an image with common spam text inside)
[load-sa-rules] Adding new rule: FUZZY_OCR_WRONG_CTYPE (Mail contains an image with wrong content-type set)
[load-sa-rules] Adding new rule: FUZZY_OCR_CORRUPT_IMG (Mail contains a corrupted image)
[load-sa-rules] 3 new rules added (3213 rules total), all scores updated.

15. Restart amavisd-maia

Now you can restart amavisd-maia and start looking for these rules in your log files, and in Maia’s mail viewer, once you begin receiving mail items that contain images with text in them. The processing time on such items will be a few seconds longer than usual, but mail items without images in them won’t be affected, since the FuzzyOCR plugin won’t be called in those cases.