LP peaks much earlier The most obvious male is authorwith a resounding Looking at his texts, we indeed see a prototypical young male Twitter user: For each system, we provided the first N principal components for various N.

But it might alsomean that the gender just influences all feature types to a similar degree. In scores, too, we see far more variation. Normalized 4-gram About K features. This is in accordance with the hypothesis just suggested for the token n-grams, as normalization too brings the character n-grams closer to token unigrams.

However, our starting point will always be SVR with token unigrams, this being the best performing combination. Then we outline how we evaluated the various strategies Section 3. We used the most frequent, as measured on our tweet collection, of which the example tweet contains the words ik, dat, heeft, op, een, voor, and het.

We achieved the best results, As we approached the task from a machine learning viewpoint, we needed to select text features to be provided as input to the machine learning systems, as well as machine learning systems which are to use this input for classification.

If, in any application, unbalanced collections are expected, the effects of biases, and corrections for them, will have to be investigated. The exception also leads to more varied classification by the different systems, yielding a wide range of scores. However, we used two types of character n-grams.

The age is reconfirmed by the endearingly high presence of mama and papa. We start with the accuracy of the various features and systems Section 5.

As for style, the Poolse mannen dating real factor is echt really. The creators themselves used it for various classification tasks, including gender recognition Koppel et al.


With one exception author is recognized as male when using trigramsall feature types agree on the misclassification. Normalized 5-gram About K features. After this, we examine the classification of individual authors Section 5.

The male which is attributed the most female score is author In the example tweet, e. Although LP performs worse than it could on fixed numbers of principal components, its more detailed confidence score allows a better hyperparameter selection, on average selecting around 9 principal components, where TiMBL chooses a wide range of numbers, and generally far lower than is optimal.

Juola and Koppel et al. When looking at his tweets, we Apart from normal tokens like words, numbers and dates, it is also able to recognize a wide variety of emoticons. The use of syntax or even higher level features is for now impossible as the language use on Twitter deviates too much from standard Dutch, and we have no tools to provide reliable analyses.

For the other feature types, we see some variation, but most scores are found near the top of the lists. This apparently colours not only the discussion topics, which might be expected, but also the general language use.

When using all user tweets, they reached an accuracy of Starting with the systems, we see that SVR using original vectors consistently outperforms the other two. As a result, the systems accuracy was partly dependent on the quality of the hyperparameter selection mechanism.

Gender Recognition Gender recognition is a subtask in the general field of authorship recognition and profiling, which has reached maturity in the last decades for an overview, see e.

For whom we already know that they are an individual person rather than, say, a husband and wife couple or a board of editors for an official Twitterfeed. Their highest score when using just text features was And also some more negative emotions, such as haat hate and pijn pain.

Trigrams Three adjacent tokens.