This project is an idea that I got while I was reading the N-gram text Categorization. paper wrote by William B. Cavnar and John M. Trenkle. The idea is to suggest categories¹ to new post while it is written based on similarities with previous post’s categories.
Using n-grams (a n-gram is a sequence of n letters) instead of words had many advantages:
- Language independent: it is not necessary to use stemming algorithm, to use word’s root (i.e. working, worked = work). Also, it doesn’t work only for English, since the algorithm learns about previous post
- Easy tokenization: it has a unique way to parse and extract features, since it is not working with words it doesn’t care about the language.
- Perfect for blogs: It is perfect for blogs since most of authors do not write their post with a program that can check misspell errors, therefore most of them have errors. The n-grams are tolerant to misspell errors.
The project will be based on Cavnar’s work and also will extend it using Naive Bayes classifier, and others method used today to classify texts.
The project will learn from previous post. The “knowledge” will be store on the hard disk (flat file, database or the folk will be able to write a new storage defining a php-classes) for better performance or it can learn on the fly. Another good idea is to build public (and free) knowledge databases of common things, such as languages texts, that people will be able to download and install on their WP.
Based on my previous investigation with N-grams and text categorization (My hobbie is read paper about text categorization, you can take a look here some implementation) that it can have an acceptable performance up to 50 possible categories. The number of documents do not have a negative impacts quite the opposite, as much post it will have better results.
The project will be independent to it WordPress interface, this mean that it can a Plug-in or in the WP Core. It depends of WP mentors or folk decide.
¹ In this document “categories” is only a generic name which mean categories, tags, language (English, Spanish or any other).
Please feel free to suggest things!
tags: google summer of code, wordpress author: Cesar D. Rodas comments: 3 Comments
Hello dude! welcome back to ThyPHP, your PHP blog!. Today I will write about how to compile an extension of PHP on Unix and about an interesting tip found on PHP creator blog.
Yesterday I saw a post in Rasmus Lerdorf “toys page”. It was a nice post about MVC model… but the part that I surprise me… Read more…
tags: php4, php5 author: Cesar D. Rodas comments: 8 Comments
While I was googling-up for know how the LibTextCat works internally, I found the paper that had changed my life, N-Gram-Based Text Categorization. This papers talk about N-gram (An n-gram is a sub-sequence of n items from a given sequence), and how it can help to construct language independent algorithms to categorize texts. Read more…
tags: artificial intelligence, php classes, spam author: Cesar D. Rodas comments: 6 Comments
After while away from thyphp, I am back
and continuing with the a saga of SEO (Search engine optimization) focusing on Friendly URL.
In this post I will continue Friendly URL! is it really needed? but focusing on performance with Apache.
The most common way to have friendly URL is with the mod_rewrite, which is an Apache web-server module. This Apache module accept regular expressions and transform to a local file (normally a php-script
). Read more…
tags: performance, php4, php5, seo author: Cesar D. Rodas comments: 27 Comments
Write Ajax applications in theses day are a common thing, the “day by day” of the web developer. Sometimes, especially for complex things, write ajax is not so easy, because you need to know the client side (javascript) and the server side. The dream of every php developer (at least mine dream
) is to write ajax without need to carry about the javascript code.
Even there is a lot of ajax framework that minimize what the php developer need to know about javascript, sometimes their very hard to use (a lot of php code). Read more…
tags: ajax, php classes, php4, php5, phpajax author: Cesar D. Rodas comments: 19 Comments
The trend of the web right now is to avoid as much as possible to pass GET’s variables to a web site, instead of that it is used “friendly URL”.
This help to increase visitors to your web site, because the URL describe the content of your web page, and web search engine shows your page result on the firsts position, this is on one important point on SEO, this is because “ajax-ajax-and-more-ajax.html” its more descriptive than “foo.php?id=6″
Read more…
tags: php4, php5, seo author: Cesar D. Rodas comments: 10 Comments
Ajax, ajax, ajax.. you may head so much about ajax, and you may be wondering, what’s ajax?, is it a new programming language? Well, this is not a programming language, this is only the new trend which is download a web-site or a portion of it when you need, avoiding reloading the hole page.
This is model is quite obsolete at this days, because usually web sites has a unique view, and when you link only is update a text, or fraction of the page, with the HTTP protocol you reload the whole page.
Because the HTTP is widely protocol, and it was wrote long time ago (1991), and in that time the Internet were not so popular as now. For avoiding this problem, there is a new methodology of develop websites, called AJAX (Asynchronous Javascript And XML), this is a mixture between client side (javascript) and server side (usually PHP, but could be write in any language, even statics files).
Here is where the problem begins, you must decide if your site will be AJAX or not, in this point there is a still a considerable number of browsers which doesn’t support because their javascript interpreter is a quite old or simple doesn’t have javascript interpreter.
Another important so important thing, is that not all the web-crawler doesn’t support ajax, so If you have a site which 100% ajax web search engines won’t be able to navigate and gather information about your site, if search engines could not crawl your site, visitors from all around the world won’t be able to discover your site searching.
To avoid this problem, we must have two versions of the website, Ajax and “traditional”, because ajax is much better to surf for users who have a new browser, and we cannot ignore web-robots because they’ll give benefits. Read more…
tags: ajax, php classes, php4, php5 author: Cesar D. Rodas comments: 7 Comments
<?=”Hello world”;?>, the typically first program of all programmer, who’s the one who never did that?
Well this blog goal is give information, news, talk about projects and anything realted to PHP. This blog is a fork of my personal blog, I decide that in personal blog other users are not able to write for obviously reasons
(the name of the blog).
If you are a php fan and you want to write here feel free to write something and I’ll contact you, I don’t want to write alone :).
Also I want to listen (or read) you, please fill the form down with the subjects you’d like to read on this blog.
tags: sites news author: Cesar D. Rodas comments: No Comments