A walkthrough of Twitter Text Mining

In order not to miss any details, this is the transcript of a real session in which I started to analyze tweets until I classify certain sentences from the Bible according to their language.

Step 1: Create a Problem File

martin$ bzcat sample.bz2 | jq -r  '"__label__"+(if .lang == "en" then .lang else "different" end)+" " + (.text | gsub("[\\n\\t]"; "") |  gsub("[^A-z ]";"") | gsub("[\\s]+";" ")) ' > fasttext-problem.txt

The file now look slike this:

__label__different Kawawa naman mga taga val
__label__different P rz
__label__different mmm nu blir jag poppis JA det r bra och viktigt med organiserat sjlvfrsvar mot nazister ja man fr sl 
__label__en _BarbbMarley happy birthday beautiful 
__label__different No dejes 

This looks already pretty nice. The labels have the common prefix, every tweet is one line and one label. So we could use Fasttext on it.

Step 2: Train/Test Split

But wait, we need three sets (train, test, valid). But for now, we stick with only two sets for this illustration. But a train/test split is the minimum to do. Note that -n l/2 splits the file at a linebreak near to the middle of the file. Consequently, both files will have different numbers of lines, but similar amount of information.

martin$ split -n l/2 fasttext-problem.txt split-
martin$ mv split-aa train.txt
martin$ mv split-ab test.txt

Step 3: Train a model.

Now, we can start training and if we had a valiation set, we could actually look for good parameters. For this page, we stick to a reasonable simple set of parameters:

martin$ fasttext supervised -input train.txt -epoch 10 -dim 20 -lr 0.05 -output model

Read 5M words
Number of words:  780277
Number of labels: 2
Progress: 100.0% words/sec/thread: 1882036 lr:  0.000000 loss:  0.056944 ETA:   0h 0m
martin$ 

Step 4: Evaluate the model

Let us now look how the model performs on the test set:

martin$ fasttext test model.bin test.txt

N	503480
P@1	0.957
R@1	0.957
martin$

This tells us that precision and recall are about 95%. A good performance for a model like this.

Step 5: Apply model

Okay, now we might be faced with some text where we want to select only English sentences. Lets have a look at the start of the bible (probably the most well-translated text body ever):

martin$ cat <<EOF > bible.txt
>Am Anfang schuf Gott Himmel und Erde.
>In het begin heeft God de hemelen en de aarde gemaakt.
>In the beginning, God created the heavens and the earth. 
>Au commencement, Dieu créa le ciel et la terre.
>Dios, en el principio, creó los cielos y la tierra.
>Quando Deus começou criando o firmamento e a Terra.
>In principio creavit Deus cælum et terram.
>Başlangıçta Tanrı göğü ve yeri yarattı.
>Na samém počátku historie naší země stojí Bůh. Ano, byl to On, kdo stvořil nebe a zemi. 
EOF

martin$

You might be able to see that the third sentence is English, the others are different languages. The text is roughly the same (the beginning words of the Bible).

Lets predict these:

martin$ cat bible.txt | fasttext predict-prob model.bin - | tee bible.prediction

__label__different 0.999278
__label__different 0.999987
__label__en 0.975278
__label__different 0.999992
__label__different 0.999662
__label__different 1.00001
__label__different 0.999977
__label__different 0.987726
__label__different 0.949688
martin$

Well done, the model detects English only for the third sentence from the file. But wait, can we combine these results to see which sentence belongs to which result? We can, the tool for it is known as paste:

martin$ paste bible.prediction bible.txt
__label__different 0.999278	Am Anfang schuf Gott Himmel und Erde.
__label__different 0.999987	In het begin heeft God de hemelen en de aarde gemaakt.
__label__en 0.975278	In the beginning, God created the heavens and the earth. 
__label__different 0.999992	Au commencement, Dieu créa le ciel et la terre.
__label__different 0.999662	Dios, en el principio, creó los cielos y la tierra.
__label__different 1.00001	Quando Deus começou criando o firmamento e a Terra.
__label__different 0.999977	In principio creavit Deus cælum et terram.
__label__different 0.987726	Başlangıçta Tanrı göğü ve yeri yarattı.
__label__different 0.949688	Na samém počátku historie naší země stojí Bůh. Ano, byl to On, kdo stvořil nebe a zemi. 
martin$