FastText - First Steps

Today I tried a new Facebook tool for text classification called fasttext.

I am using a famours twitter dataset for this tutorial. You can download it right here: http://help.sentiment140.com/for-students. The fasttext installation is quite easy (you find a tutorial in the official github repository).

First you have to prepare this dataset to fit the format of fasttext. Each label (class) is prefixed with __label__.

unzip trainingandtestdata.zip 
cut -d, -f1,6- training.1600000.processed.noemoticon.csv > trainingsset.csv #remove unused columns

sed -i -e 's/^"0"/__label__neg /g' trainingsset.csv
sed -i -e 's/^"2"/__label__neutral /g' trainingsset.csv
sed -i -e 's/^"4"/__label__pos /g' trainingsset.csv

Now you can train a new model like this:

./fasttext supervised -input trainingsset.csv -output model

The package also provides a test-mode to calculate the precision and recall. But first you have to reformat the testdata like the trainingsset before:

cut -d, -f1,6- testdata.manual.2009.06.14.csv > testdata.csv

sed -i -e 's/^"0"/__label__neg /g' testdata.csv
sed -i -e 's/^"2"/__label__neutral /g' testdata.csv
sed -i -e 's/^"4"/__label__pos /g' testdata.csv
./fasttext test model.bin testdata.csv

With this dataset I get a recall and precision of 0.783 which is not bad, but also not quite good. I'll take a deeper look at the optimization possibilities but I am optimistic :) It's also strange, that the twitter trainingsset does not include any neutral elements, although the documentation indicates it.