twazn me!!! ;(’ Automatic Authorship Analysis of Micro-Blogging Messages

Please download to get full document.

View again

of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Categories
Published
In this paper we propose a set of stylistic markers for automatically attributing authorship to micro-blogging messages. The proposed markers include highly personal and idiosyncratic editing options, such as ‘emoticons’, interjections, punctuation,
  ‘twazn me!!! ;(’Automatic Authorship Analysis of Micro-Blogging Messages Rui Sousa Silva 1 , 3 , Gustavo Laboreiro 2 , 4 ,Lu´ıs Sarmento 2 , 4 , Tim Grant 1 ,Eug´enio Oliveira 2 , and Belinda Maia 3 1 Centre for Forensic Linguistics at Aston University 2 Faculdade de Engenharia da Universidade do Porto - DEI - LIACC 3 CLUP - Centro de Lingu´ıstica da Universidade do Porto 4 SAPO Labs Porto Abstract.  In this paper we propose a set of stylistic markers for auto-matically attributing authorship to micro-blogging messages. The pro-posed markers include highly personal and idiosyncratic editing options,such as ‘emoticons’, interjections, punctuation, abbreviations and otherlow-level features. We evaluate the ability of these features to help dis-criminate the authorship of Twitter messages among three authors. Forthat purpose, we train SVM classifiers to learn stylometric models foreach author based on different combinations of the groups of stylisticfeatures that we propose. Results show a relatively good-performancein attributing authorship of micro-blogging messages ( F   = 0.63) usingthis set of features, even when training the classifiers with as few as 60examples from each author ( F   = 0.54). Additionally, we conclude thatemoticons are the most discriminating features in these groups. 1 Introduction In January 2010 the  New York Daily News   reported that a series of Twitter mes-sages exchanged between two childhood friends led to one murdering the other.The set of Twitter messages exchanged between the victim and the accusedwas considered a potential key evidence in trial, but such evidence can be chal-lenged if and when the alleged author  refutes   its authorship.  Authorship analysis  can, in this context, contribute to confirming or excluding the hypothesis that agiven person is the true author of a queried message,  among several candidates  .However, the micro-blogging environment raises new, significant challenges asthe messages are  extremely short   and fragmentary. For example, Twitter mes-sages are limited to 140 characters, but very frequently have only 10 or evenfewer words. Standard stylistic markers such as  lexical richness  ,  frequency of  function words  , or  syntactic measures   — which are known to perform well withlonger, ‘standard’ language texts — perform worse with such short texts, whose R. Mu˜noz et al. (Eds.): NLDB 2011, LNCS 6716, pp. 161–168, 2011. c  Springer-Verlag Berlin Heidelberg 2011  162 R. Sousa Silva et al. language is ‘fragmentary’ [1]. Traditional authorship analysis methods are con-sidered unreliable for text excerpts smaller than 250-500 words, as the accuracytends to drop significantly with text length decrease [9].In this paper we use a text classification approach to investigate whether some‘non-traditional’ stylistic markers, such as the type of emoticons, provide enoughstylistic information to be used in authorship attribution. We focus specificallyon  Twitter   for its popularity, and address Portuguese in particular, which is oneof the most widely used languages in this medium 1 . 2 Related Work In recent years, there has been considerable research on authorship attribution of some  user-generated contents   — such as  e-mail   (e.g. [2]) and, more recently,  weblogs   (e.g. [3,4,5]) and ‘opinion spam’ (e.g. [6]). However, research on authorship attribution of Twitter messages has been scarce, and raised robustness problems.To tackle the problem of robustness in computational stylometric analysis,research (e.g. the ‘Writeprints technique’ [10]) was applied to four different textgenres to discriminate authorship and detect similarity of online texts among100 authors. The performance obtained was good, but (a) the procedure didnot prove to be content-agnostic, and (b) did not analyse Twitter messages.Also, using structural features that are possibly due to editing and considering‘idiosyncratic features’ usage anomalies to include misspellings and grammarmistakes it is bound to compromise the results.More recently, it has been demonstrated that the authorship of twitter mes-sages can be attributed with a certain degree of certainty [11]. Surprisingly, theauthors concluded that authorship could be identified at 120 tweets per user,and that more messages would not improve accuracy significantly. However,their method compromises the authorship identification task of most unknownmessages, as they reported a loss of 27% accuracy when information about theinterlocutor’s user data was removed.It has also been demonstrated that authorship could be attributed using ‘prob-abilistic context-free grammars’ [12] by building complete models of each au-thor’s (3 to 6) syntax. Nevertheless, the authors used both syntactic and lexicalinformation to determine each author’s writing style.Conversely, we propose a content-agnostic method, based on low-level featuresto identify authorship of unknown messages. This method is independent of userinformation, so not knowing the communication participants is irrelevant tothe identification task. Moreover, although some of the features used have beenstudied independently, this method is innovative in that the specific combinationof the different stylistic features has never been used before and has not beenapplied to such short texts. 1 http://semiocast.com/downloads/Semiocast_Half_of_messages_on_Twitter_are_not_in_English_20100224.pdf  ‘twazn me!!! ;(’ Automatic Authorship Analysis of Micro-Blogging Messages 163 3 Method Description and Stylistic Features Authorship attribution can be seen as a typical  text classification   task: givenexamples of messages written by a set of authors (classes), we aim to attributeauthorship of messages of unknown authorship. In a forensic scenario, the taskconsists of discriminating the authorship of messages of a small number of po-tential authors (e.g. 2 to 5), or determining whether a message can be attributedto a certain (‘suspect’) author.The key to framing authorship attribution as a text classification problem isthe selection of the feature sets that best describe the  style   of the authors. Wepropose four groups of stylistic features for automatic authorship analysis, eachdealing with a particular aspect of tweets. All features are  content-agnostic  ; toensure a robust authorship attribution and prevent the analysis from relying ontopic-related clues, they do not contain lexical information. Group 1: Quantitative Markers.  These features attempt to grasp simplequantitative style markers from the message as a whole. The set includes messagestatistics, e.g. length (in characters) and number of tokens, as well as token-related statistics (e.g. average length, number of 1-character tokens, 2-consonanttokens, numeral tokens, choice of case, etc). We also consider other markers, e.g.use of dates, and words not found in the dictionary 2 to indicate possible spellingmistakes or potential use of specialised language.As  Twitter  -specific features, we compute the number of user references (e.g.@user 123), number and position of   hashtags   (e.g. #music), in-message URLsand the URL shortening service used. We also take note of messages startingwith a username (a reply), as the author may alter their writing style whenaddressing another person. Group 2: Marks of Emotion.  Another highly personal — and hence id-iosyncratic stylistic marker — is the device used to convey emotion. There aremainly three non-verbal ways of expressing emotion in user-generated contents:(i)  smileys  ; (ii)  ‘LOLs’  ; and (iii)  interjections  . Smileys   (‘:-)’) are used creatively to reflect human emotions by changing thecombination of eyes, nose and mouth. This work explores three axes of idiosyn-cratic variation:  range   (e.g. number of happy smileys per message),  structure  (e.g. whether the smiley has a nose) and  direction   of the smiley.Another form of expression is the prevalent ‘LOL’, which usually stands for Laughing Out Loud  . Frequently users manipulate the basic ‘LOL’ and ‘maximise’it in various other forms, e.g. by repeating its letters (e.g. ‘LLOOOLLL’) orcreating a loop (e.g. ‘LOLOL’). This subgroup describes several instances of  length  ,  case   and  ratio  between ‘L’ / ‘O’, so as to distinguish between ‘LOL’ andthe exaggeration in multiplying the ‘O’, as in ‘LOOOOL’.We identify interjections as tokens consisting of only two alternating lettersthat are not a ‘LOL’, such as ‘haaahahahah’. Other popular and characteristic 2 We use the GNU Aspell dictionary for European Portuguese.  164 R. Sousa Silva et al. examples are the typical Brazilian laughing ‘rsrsrs’ and the Spanish laughing ‘je- jeje’ — both of which are now commonly found in European Portuguese  Twitter  .We count the number of interjections used in a message, their average lengthand number of characters. Group 3: Punctuation.  The choice of punctuation is a case of writing style[13], mostly in languages whose syntax and morphology is highly flexible (suchas Portuguese and Spanish). Some authors occasionally make use of expressiveand non-standard punctuation, either by repeating (‘!!!’) or combining it (‘?!?’).Others simply skip punctuation, assuming the meaning of the message will notbe affected. Ellipsis in particular can be constructed in less usual ways (e.g. ‘..’or ‘......’). We count the frequency of these and other peculiar cases, such as theuse of punctuation after a ‘LOL’ and at the end of a message (while ignoringURLs and  hashtags  ). Group 4: Abbreviations.  Some abbreviations are highly idiolectal, thus de-pending on personal choice. We monitor the use of three types of abbreviations:2-consonant tokens (e.g. ‘bk’ for ‘back’), 1- or 2-letter tokens followed by ‘.’ or‘/’ (e.g. ‘p/’) and 3-letter tokens ending in two consonants, with (possibly) a dotat the end (e.g. ‘etc.’). 4 Experimental Setup This study is focused on the authorship identification of a message among threecandidate authors. We consider only three possible authors as forensic linguis-tic scenarios usually imply a limited number of suspect authors, and is hencemore realistic. We chose to use Support Vector Machines (SVM) [14] as the clas-sification algorithm for its proven effectiveness in text classification tasks androbustness in handling a large number of features. The  SVM-Light   implemen-tation [14] has been used, parametrised to a linear kernel. We employ a  1-vs-all  classification strategy; for each author, we use a SVM to learn the correspond-ing stylistic model, capable of discriminating each author’s messages. Given asuspect message from each author, we use each SVM to predict the degree of likelihood that each author is the true author. The message authorship is at-tributed to the author of the highest scoring SVM. We also consider a thresholdon the minimum value of the SVM score, so as to introduce a  confidence   parame-ter (the minimum score of the SVM classifier considered valid) in the authorshipattribution process. When none of the SVM scores achieves the minimum value,authorship is left undefined.Our data set consists of   Twitter   messages from authors in Portugal, collectedin 2010 (January 12 to October 1). We counted over 200,000 users and over 4million messages during this period (excluding messages posted automatically,such as news feeds). From these, we selected the 120 most prolific  Twitter   authorsin the set, responsible for at least 2,000  distinct   and  srcinal   messages (i.e.excluding  retweets  ), to extract the sets of messages for our experiments. We  ‘twazn me!!! ;(’ Automatic Authorship Analysis of Micro-Blogging Messages 165 divide the 120 authors into 40 groups of 3 users at random, and maintain thesegroups throughout our experiments. The group of 3 authors forms the basictesting unit of our experiment.We perform two sets of experiments. In  Experimental Set 1 , the classificationprocedure uses all possible groups of features to describe the messages. We usedata sets of sizes 75, 250, 1,250 and 2,000 messages/author. In  Experimental Set 2  , we run the training and classification procedure using  only one   groupof features at a time. We use the largest data set from the previous experiment(2,000 messages/author) for this analysis. We measure  Precision   ( P  ),  Recall   ( R )and  F   (2 PR/ ( P   + R )) considering: P   = # messages correctly attributed# messages attributed  R  = # messages correctly attributed# messages in the setWe run the training and classification procedures in each set of experimentsand use the  confidence   parameter to draw  Precision vs. Recall graphs  . As theseexperiments consider  three   different authors, the baseline is  F   = 0 . 33 ( P   = 0 . 33at R  = 0 . 33). All experiments were conducted using a 5-fold cross validation, andrun for all 40 groups of 3 authors. For varying levels of Recall (increments of 0.01)we calculate the maximum, minimum and average Precision that was obtainedfor all 40 groups. All  F   values are calculated using the average Precision. 5 Results and Analysis Figure 1 shows the Precision vs Recall graphs for Experimental Set 1. Dataset increases (from 75 to 2,000 messages/author) returns improvements in the  0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5        P     r     e     c       i     s       i     o     n Recall1a) Data set: 75 messages/author 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5        P     r     e     c       i     s       i     o     n Recall1b) Data set: 250 messages/author 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5        P     r     e     c       i     s       i     o     n Recall1c) Data set: 1,250 messages/author 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5        P     r     e     c       i     s       i     o     n Recall1d) Data set: 2,000 messages/author Fig.1.  Performance of each data set size. Each graph plots maximum, average andminimum Precision at varying levels of Recall (40 groups of 3 authors).
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x