Category Archives: python

  • 0

Building a very easy text classifier in python

Category : python

Some of the developers at match2blue are creating a text-interest-matcher. Leaving buzzword bingo aside, that means the software calculates whether a text is interesting based on users’ interests. So basically you, as a user, have to enter some interests and will be presented some pieces of data in order of their relevance. You can also think of it as text classification into either good or bad.

This software has become quite complicated, because it is necessary to have some kind of semantic knowledge about the interests. But there are different methods of text classification. I was curious about how hard it is to write the most simple text classifier that gives you decent results. Well, turns out it is remarkably simple. Creating the classifier took less than two hours. And here is the source code:

How does it work? There are two very simple steps it does. First, the classifier has to be trained with existing textfiles. The result is a dictionary that consists of many inner dictionaries. Let’s feed it with some text to see what happens.

In the result we can see that three words followed ‘a’. Two times it was ‘good’ and one time ‘test’. This is all we need for classifying. Now we can apply the classify function. It goes through the text to classify and looks for known word follow-ups. If there is a known one, the probability of this follow-up is added. So, in our example the probability of ‘a good’ is 2/3 and for ‘a test’ it is 1/3.

The first example has similar words and ordering as the trained text. The second one also has some exact same words, but they are in a different order. Therefore, the probability of this text beeing a is 0.0.

If you want to try it with some longer text, you can download the classifier from Google Code. It is easy to create your own categories by adding a file in the appropriate folder. Just make sure you have a decent amount of data. Three sentences are not enough for a good classification.

In my tests I got quite good results even for classifying authors writing about the same topic.


Tweets