Twenglish: A New Variety of English? A quantitative analysis of a Twitter based corpus


The most striking feature which affects the lexis of the messages of the corpus is the presence of acronyms. An acronym can be defined an abbreviation consisting of the first letters of each word in the name of something, and which is pronounced as a word (Cambridge Advanced Learner's Dictionary). Acronyms are therefore usually employed in order to abbreviate names of organizations (NATO, UN), illnesses (AIDS, HIV), technologies (DVD, VHS) and so on. In other words, they are simple abbreviations of noun phrases.

As far as CMC is concerned, though, the situation changes: in fact, Internet acronyms allow speakers to convey not only noun phrases, but even complete sentences, by using a single acronym. In our case, we may distinguish between two types of acronyms, namely lexical acronyms and non-lexical acronyms. The first type of acronyms refers to the ones which convey a meaningful phrase, and which can actually be part of a sentence; the second one, instead, refers to acronyms which do not convey a lexical meaning, but rather an emotional element, such as surprise or happiness. Furthermore, unlike standard acronyms, the Internet ones can be frequently found in small letters, due to typing speed reasons.

Besides, as far as their meaning is concerned, it can be said that most of these acronyms have a negative connotation, due to the presence of swearwords. This kind of words, however, do not have a proper meaning, but they are rather used in order to reinforce the idea the speaker is expressing. […]

Let us now analyse two examples of acronyms, in order to better understand the difference between the lexical and the non-lexical ones:

#ID2: Stfu, you can't even remember her name douchebag

#ID127: omg lol it's snowing inside my bedroom lololololol

As it can be seen, in #ID2 we have a lexical acronym: in this case, in fact, stfu is an actual part of the sentence, completing its meaning. #ID127, instead, presents three non-lexical acronyms in the same sentence. Unlike the previous example, these acronyms do not carry any lexical meaning at all, but rather indicate an element of surprise and happiness. Furthermore, the doubling of the acronym lol in the last part of the sentence strengthens our theory according to which non-lexical acronyms tend to lose their original status of acronyms, acquiring the meaning of the emotion they convey. Therefore, in the case of lol, the speaker has in mind a laughter (eheheh), and this influences the way he/she writes. This theory is also supported by Baron (2008), especially when referring to IM (Instant Messaging) platforms, by which Twitter draws some features.

Let us now analyse another typical feature of the corpus which involves its lexis, which is the presence of words which refer to Twitter and to the Internet. We decided to analyse this particular semantic field in order to see whether the corpus showed a deep connection with words which are strictly related to Twitter. We have discovered that the group of Twitter-related words is the largest group among the two, being present in 7.78% of the messages of the corpus, whereas the lexical field of the Internet is almost completely absent. The word frequency related to this category, however, is higher than the one referred to the previous group of words: in fact, the frequency of Twitter-related words in the corpus is v=0.80, while the most frequent word related to this lexical field is RT, whose frequency is v=0.73.

Other frequent words which are related to Twitter obviously are words such as follow, Twitter, tweet, and other elements referred to the functionalities of the social platform, such as the above-written acronym RT (retweet), which is used to quote the message of another user. By the way, the fact that this semantic field is mostly made of words which refer to the media itself should not be surprising, as there are no other words referring to the semantic field of Twitter: it is therefore a very restricted field, with roughly ten words which can affect the texts in a very limited way only.
A third and last feature which characterizes the lexis of our corpus is the presence of slang and swear words. These words are quite frequent in our corpus: if we sum up the actual slang and swear words, we notice that these kind of words are present in 6.29% of the tweets; the word frequency will therefore be v=0.83. Besides, the word which is the most frequent among this group is fuck, and it can be found with a frequency of v=0.66.

As we have already said when we analysed the acronyms, offensive words often serve to reinforce an idea, but they can also be used to identify oneself with a certain group. In fact, as Crystal (1986) points out, swearwords can function as a marker of identity within a social group and of differentiation from another one: the more swearwords you use, the stronger your affirmation of solidarity with the group. […]

