Computational Linguistics and Lexicography: English Monolingual General Purpose Doctionaries Freely Available on the Internet

Monolingual general purpose dictionaries have been chosen as a subject of this thesis because of their special position in all societies, and in the English-speaking community in particular. Owing to the numerous influences on the language in different parts of the world, English has probably the richest vocabulary of all languages, but it has never been possible to establish a standard. Italy and France have academies to provide guidance and authority in linguistic questions; English speakers have turned to dictionaries. Dictionaries have been important for keeping track of the development of the language, especially since English gained the status of a global language. Cyberspace is a conceptual place where people from all around the world meet to collaborate; to be able to communicate they need a common language: this language is English. Many people use the Internet in order to learn English: they need valuable reference works, and their free availability is essential especially for those who live in the less rich parts of the world.

In order to evaluate the dictionaries found during the meticulous Internet research over an extended period of time, we have first laid foundations for the methodology to be adopted. After a short explanation of the terms and description of the main steps in the development of English and American lexicography in chapters 1 and 2, in chapter 3 we have studied the main principles of lexicography, and analyzed in detail the structure of the printed monolingual dictionary. Particular attention was paid to the microstructure, especially to the composition of information within the entries because online dictionaries should provide at least the same information as their printed counterparts.

In chapter 4 we have looked at the all-embracing effect that the advent of computers has had on dictionary making. Large computer corpora have enabled lexicographers to take a completely different view on the language: instead of relying on their fine linguistic sensitivity only, they suddenly have quantitative evidence on how the language is actually used. Creating an adequate corpus is a very long and expensive process; consequently, standards of encoding have been created that allow exchange of data between different systems in order to put several resources together. In this chapter we have described main markup languages: SGML, XML, HTML, XHTML.

The macrostructure of an online dictionary described in chapter 5 is completely different from the printed one as it does not refer to an artifact, but rather to a virtual dictionary which only exists in a particular moment on the screen of a particular user, and which may be accessed in different ways. The main characteristic of hypertext is that it is not permanent: any user creates their own route to information; they are active in acquiring knowledge. Information retrieval facility is a distinguishing feature of electronic dictionaries and the possibility to add multimedia information is another special feature of electronic medium.

After having outlined a detailed typology of online dictionaries in chapter 6, in chapter 7 we have examined how the English monolingual general purpose dictionaries make use of the electronic medium in different ways and to a different extent. We have identified two kinds of monolingual general purpose dictionaries online: digitized printed dictionaries and original electronic dictionaries. We have created a table of parameters for equal formal validation and validation of content of the dictionaries and listed their main features in order to draw a conclusion about the usefulness of the English monolingual general purpose dictionaries freely available on the Internet.

The future of the publishing industry has been frequently questioned after the advent of the Internet. The financial part of the enterprise plays an important role in the development of original dictionaries, and has crucial implications for the future of the dictionary making. The ideal situation would be close collaboration between Natural Language Processing researchers and lexicographers to produce electronic dictionaries which would have both, high lexicographic quality and sophisticated computing functions.

Introduction Since the 1940’s the Information Technology1 revolution has slowly af- fected the whole spectrum of everyday life worldwide. Hardware prices have continued to drop, its efficiency has been increasing exponentially, and more and more sophisticated software has been created. Gordon Moore, the co-founder of Intel2, predicted in 1965 that the number of transistors the industry would be able to place on a computer chip would double every couple of years (Moore, 1965): it was called ’Moore’s Law’, and it has operated relentlessly since then. But “the speed with which things are developing on the Internet is unlike anything we have experienced before” (Rundell, 1996). The Internet3 was born in the United States originally for defence reasons: the communication between military commands in case of a nuclear war was the priority. In the 1980s the Internet entered aca- demic life to ease the communication between the US universities. Ma- jor research into the Information Technology sector was, again, made in the United States. As a consequence, English, especially in its Ameri- can variety, became the universal language of the Information Technol- ogy: in fact, the majority of current programming languages are based on the English language. Since 1989 the World Wide Web has revolutionized the communica- tion between people all over the planet. Its creator Tim Berners-Lee had a dream of “a common information space in which we commu- 1For the terms written in italic, see the Glossary of terms on page 161. “Double quotation marks” indicate an exact shorter quotation from a work cited in bibliogra- phy, while longer quotations will constitute a separate paragraph. ’Single quotation marks’ indicate examples. An asterisk indicates a wrong form of o word or *fraze, apart from being the wildcard to substitute any string of characters. 2Gordon Moore’s Home Page at Intel kits/bios/moore.htm 3A short introduction to Internet can be found on uni/papers/www/www.htm. For more detailed information see htm 5

Tesi di Laurea

Facoltà: Lingue e Letterature Straniere

Autore: Jitka Horcickova Contatta »

Composta da 221 pagine.


Questa tesi ha raggiunto 1124 click dal 21/04/2006.


Consultata integralmente 5 volte.

Disponibile in PDF, la consultazione è esclusivamente in formato digitale.