PHP Spam detection project

While I was googling-up for know how the LibTextCat works internally, I found the paper that had changed my life, N-Gram-Based Text Categorization. This papers talk about N-gram (An n-gram is a sub-sequence of n items from a given sequence), and how it can help to construct language independent algorithms to categorize texts.

Some time ago, for apply the theory, I wrote the Bayesian Spam Filter class to prove the theory that I have, “traditional spam detecting techniques can be mixed with N-Grams, to have a generic language independence spam detecting system”.

The first versions it works with a reasonably acceptable performance.

The new version of Bayesian Spam Filter class had new feature that had decrease the knowledge-database size, adding one new algorithm that get betters results, improve the computation time and memory requirement at comparison time.

All those benefits gave me a great idea, which is build a “akismet” like site, of course it is missing to much to the class to be as or better than akismet. Of course, the main feature is that it will free and open source. My idea is give some work to my new server at linode, building a web-site where people will be able to test it, but different to akismet, you’ll need to register, to download a copy of the database ( and synchronize, send spams and hams examples every day), so it will help you to have your own mechanism to detect spam inside your server. Basically it will be PHP, SQLite and some methods to do an efficient copy of databases (something like rsync).

It is only an idea, so your feedback will be very useful.

6 Comments

  1. Comment by Sander on August 12, 2008 5:13 am

    May this idea become real feel free to contact me! I’m the host of a large forum community (own written software) and we’re battling against spam! Main problem is that servers like askimet are to slow to check every forum post/reply on spam so we need fast inhouse software..

  2. Comment by buggedcom on August 15, 2008 3:02 pm

    an additional way to check for spam is to check url links against a blacklist, add this to the ngram based detection and this could eventually get as good as askimet.

  3. Comment by Andries Louw Wolthuizen on August 15, 2008 8:15 pm

    Excellent idea! I was thinking of developing the same concept, I manage over 100 sites, and (almost) everyone of them has a guestbook, comments, or some other user-input forms that need spam checking.

    My idea was to let users report spam that passed the detection, send it to an central interface so I could double-check it, and that I could edit the rules for spam detection in a file that every website downloaded periodically.

    The power of Akismet is the central server, that spammers don’t know their algorithm, and that you can update spam-definitions frequently and fast, but it is, at the same time, a big disadvantage, because making a connection to Akismet (or any other not-in-your-lan server) is slow (Sander already menthioned that).

    You could start simple by providing a central, downloadable, knowledge-database via a big-free-host like Sourceforge, Google Code, or something else, so that we can check on updates, download them, and use them in combination with your script.

    Later you can add an option to report something as spam/ham, but you’ll have to find a method that keeps the size of transfers low, and not to frequent.

    Maybe it can be achieved by doing the system in reverse, we provide url’s to our ratings on messages, stored in an http-accessible file, and your server can download them whenever you want and/or have time to evaluate them.

    An big advantage of on-demand getting ratings from your users is that your server can decide when he has the resources and time to download (and parse) them.

    If you let users push the ratings, your server will get enormous amounts of requests to handle on the “peak-times” of internet.

  4. Comment by Andries Louw Wolthuizen on August 15, 2008 8:31 pm

    Oh, and please, please, don’t make this system a “Wordpress-only-plugin”, but make it a function that can be implemented in every system.

    It would also help to make the knowledge-database file universal, so that anyone could write a script in his programming language that uses this file. Because it is not easy to parse a PHP-serialized array with (for example) ASP.

    I would personally use the function by storing all messages (like Akismet does) in a queue, and check 5-10 messages per minute on spam with a cronjob to keep the load low.

  5. Comment by David on December 27, 2008 9:35 pm

    Thanks for the PHP libraries - I need to implement something like this and your scripts will help in looking into text algorithms.

  6. Comment by Irvin on January 7, 2009 5:06 pm

    8s9NDGxNCfL0K

Comments RSS

Leave a comment