PHP Spam detection project
While I was googling-up for know how the LibTextCat works internally, I found the paper that had changed my life, N-Gram-Based Text Categorization. This papers talk about N-gram (An n-gram is a sub-sequence of n items from a given sequence), and how it can help to construct language independent algorithms to categorize texts.
Some time ago, for apply the theory, I wrote the Bayesian Spam Filter class to prove the theory that I have, “traditional spam detecting techniques can be mixed with N-Grams, to have a generic language independence spam detecting system”.
The first versions it works with a reasonably acceptable performance.
The new version of Bayesian Spam Filter class had new feature that had decrease the knowledge-database size, adding one new algorithm that get betters results, improve the computation time and memory requirement at comparison time.
All those benefits gave me a great idea, which is build a “akismet” like site, of course it is missing to much to the class to be as or better than akismet. Of course, the main feature is that it will free and open source. My idea is give some work to my new server at linode, building a web-site where people will be able to test it, but different to akismet, you’ll need to register, to download a copy of the database ( and synchronize, send spams and hams examples every day), so it will help you to have your own mechanism to detect spam inside your server. Basically it will be PHP, SQLite and some methods to do an efficient copy of databases (something like rsync).
It is only an idea, so your feedback will be very useful.
No Comments
No comments yet.
Leave a comment






