GSoC - WP - Category Suggester - [Proposal]
This project is an idea that I got while I was reading the N-gram text Categorization. paper wrote by William B. Cavnar and John M. Trenkle. The idea is to suggest categories¹ to new post while it is written based on similarities with previous post’s categories.
Using n-grams (a n-gram is a sequence of n letters) instead of words had many advantages:
- Language independent: it is not necessary to use stemming algorithm, to use word’s root (i.e. working, worked = work). Also, it doesn’t work only for English, since the algorithm learns about previous post
- Easy tokenization: it has a unique way to parse and extract features, since it is not working with words it doesn’t care about the language.
- Perfect for blogs: It is perfect for blogs since most of authors do not write their post with a program that can check misspell errors, therefore most of them have errors. The n-grams are tolerant to misspell errors.
The project will be based on Cavnar’s work and also will extend it using Naive Bayes classifier, and others method used today to classify texts.
The project will learn from previous post. The “knowledge” will be store on the hard disk (flat file, database or the folk will be able to write a new storage defining a php-classes) for better performance or it can learn on the fly. Another good idea is to build public (and free) knowledge databases of common things, such as languages texts, that people will be able to download and install on their WP.
Based on my previous investigation with N-grams and text categorization (My hobbie is read paper about text categorization, you can take a look here some implementation) that it can have an acceptable performance up to 50 possible categories. The number of documents do not have a negative impacts quite the opposite, as much post it will have better results.
The project will be independent to it WordPress interface, this mean that it can a Plug-in or in the WP Core. It depends of WP mentors or folk decide.
¹ In this document “categories” is only a generic name which mean categories, tags, language (English, Spanish or any other).
Please feel free to suggest things!
No Comments
No comments yet.
Leave a comment






