Working Paper Series ISN 17-777X MINING MEANING FROM WIKIPEDIA Olena Medelyan, Catherine Leg, David ilne and Ian H. Witen Working Paper: 1/208 September 208 ? 208 Olena Medelyan, Catherine Leg, David ilne and Ian H. Witen Department of Computer Science The University of Waikato Private Bag 3105 Hamilton, New Zealand Mining meaning from Wikipedia OLENA MEDELYAN, CATHERINE LEG, DAVID MILNE and IAN H. WITEN University of Waikato, New Zealand ____________________________________ Wikipedia is a goldmine of information; not just for its many readers, but also for the growing comunity of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual efort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being aplied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: aplying Wikipedia to natural language procesing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article adreses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced. We also discus the implications of this work for the long-awaited semantic web. ____________________________________ 1. INTRODUCTION Wikipedia requires litle introduction or explanation. As everyone knows, it was launched in 201 with the goal of building fre encyclopedias in al languages. Today it is easily the largest and most widely-used encyclopedia in existence. Wikipedia has become something of a phenomenon among computer scientists as wel as the general public. It represents a vast investment of frely-given manual efort and judgment, and the last few years have sen a multitude of papers that aply it to a host of diferent problems. This paper provides the first comprehensive sumary of this research (up to mid-208), which we colect under the deliberately vague umbrela of mining meaning from Wikipedia. By meaning, we encompas everything from concepts, topics, and descriptions to facts, semantic relations, and ways of organizing information. Mining involves both gathering meaning into machine-readable structures (such as ontologies), and using it in areas like information retrieval and natural language procesing. Traditional aproaches to mining meaning fal into two broad camps. On one side are carefuly hand-crafted resources, such as thesauri and ontologies. These resources are generaly of high quality, but by necesity are restricted in size and coverage. They rely on the input of experts, who canot hope to kep abreast of the incalculable tide of new discoveries and topics that arise constantly. Even the most extensive manualy created resource?the Cyc ontology, whose hundreds of contributors have toiled for 20 years? has limited size and patchy coverage [Sowa 204]. The other option is to sacrifice quality for quantity and obtain knowledge by performing large-scale analysis of unstructured text. However, human language is rife with inconsistency, and our intuitive understanding of it canot be entirely replicated in rules or trends, no mater how much data they are based upon. Aproaches based on statistical inference might emulate human inteligence for specific tasks and in specific situations, but cracks apear when generalizing or moving into new domains and tasks. Wikipedia provides a midle ground betwen these two camps?quality and quantity?by ofering a rare mix of scale and structure. With two milion articles and thousands of contributors, it dwarfs any other manualy created resource by an order of magnitude in the number of concepts covered, has far greater potential for growth, and ofers a wealth of further useful structural features. It contains around 18 Gb of text, and its extensive network of links, categories and infoboxes provide a variety of explicitly defined semantics that other corpora lack. One must, however, kep Wikipedia in perspective. It does not always engender the same level of trust or expectations of quality as traditional resources, because its contributors are largely unknown and unqualified. It is also much smaler and les representative of al human language use than the web as a whole. Nevertheles, Wikipedia has received enthusiastic atention as a promising natural language and informational resource of unexpected quality and utility. Here we focus on research that makes use of Wikipedia, and as far as posible leave aside its controversial nature. This paper is structured as folows. In the next section we describe Wikipedia?s creation proces and structure, and how it is viewed by computer scientists as anything from a corpus, taxonomy, thesaurus, or hierarchy of knowledge topics to a ful-blown ontology. The next four sections describe diferent research aplications. Section 3 explains how it is being drawn upon for natural language procesing; understanding writen text. In Section 4 we describe its aplications for information retrieval; searching through documents, organizing them and answering questions. Section 5 focuses on information extraction; mining text for topics, relations and facts. Section 6 describes uses of Wikipedia for ontology building, and asks whether this ads up to Tim Berners- Lee?s long-delayed vision of the semantic web. Section 7 documents the people and research groups involved, while Section 8 lists the resources they have produced, with URLs. The final section gives a brief overal sumary. 2 WIKIPEDIA: A RESOURCE FOR MINIG MEANIG Wikipedia, one of the most visited sites on the web, outstrips al other encyclopedias in size and coverage. Its English language articles alone are 10 times the size of the Encyclopedia Britanica, its nearest rival. But material in English constitutes only a quarter of Wikipedia?it has articles in 250 other languages as wel. Co-founder Jimy Wales is on record as saying that he aspires to distribute a fre encyclopedia to every person on the planet, in their own language. This section provides a general overview of Wikipedia, as background to our discusions in Sections 3?6. We begin with an insight into its unique editing methods, their benefits and chalenges (Section 2.1); and then outline its key structural features, such as articles, hyperlinks and categories (Section 2.2). In Section 2.3 we identify some diferent roles that Wikipedia as a whole may usefuly be regarded as playing?for instance, as wel as an encyclopedia it can be viewed as a linguistic corpus. We conclude in Section 2.4 with some practical information on how to work with Wikipedia data. 2.1 The Encyclopedic Wisdom of Crowds From its inception the Wikipedia project ofered a unique, entirely open, colaborative editing proces, scafolded by then-new iki software for group website building, and it is fascinating to se how the resource has flourished under this system. It has efectively enabled the entire world to become a panel of experts, authors and reviewers? contributing under their own name, or, if they wish, anonymously. In its early days the project atracted widespread skepticism. It was thought that its editing system was so anarchic that it would surely fil up with misconceptions, outright lies, vanity pieces and other worse-than-useles human output. A piece in The Onion satirical newspaper ?Wikipedia Celebrates 750 Years Of American Independence: Founding Fathers, Patriots, Mr. T. Honored? 1 nicely captures this point of view. Moreover, it was argued, surely the ability for anyone to make any change, on any page, entirely anonymously, would leave the resource ludicrously vulnerable to vandalism, particularly to articles that cover sensitive topics. What if the hard work of 200 people were erased by one ecentric? And inded, ?edit wars? did erupt, though it turned out that some of the most vicious raged over such aparently trivial topics as the ancestry of Fredy Mercury and the true speling of yoghurt. Yet this turbulent experience was chaneled into developing a set of ever-more sophisticated Wikipedia policies and guidelines, 2 as wel as a more subtle code of recomended god maners refered to as Wikiquete. 3 A self-selecting set of administrators emerged, who performed regulatory functions such as blocking individuals from editing for periods of time?for instance edit wariors, identified by the fact that they ?revert? an article more than thre times in 24 hours. Interestingly, the development of these rules was guided by the goal of reaching consensus, just as the encyclopedia?s content is. 1 htp:/ww.theonion.com/content/node/50902 2 htp:/en.wikipedia.org/wiki/Wikipedia:Policies_and_guidelines 3 htp:/en.wikipedia.org/wiki/Wikipedia:WQT Somehow these proceses worked suficiently to shepherd the resource through its growing pains, and today Wikipedia is wildly popular and growing al the time. Section 2.3.1 discuses its acuracy and trustworthines as an encyclopedia. There is stil skepticism. For example, Magnus [206], a philosopher, argues that Wikipedia does not enable him to use the methods he usualy uses to ?ases claims,? such as relying on the reputation of the source, asesing whether the claims are writen in an apropriate style or have content that sounds plausible to him. However, these observations can be placed in the context of larger philosophical discusions about the nature of knowledge and truth: potentialy chalenging contemporary philosophical wisdom itself. In many ways the history of the so-caled ?modern? period in Western culture?the 30 years or so since the Scientific Revolution?may be sen as the strugle to escape a medieval conception of knowledge as defined by some kind of stamp of aproval confered on human beliefs by a recognized authority. The key medieval authorities were the Bible and Aristotle, and although humanity now avails itself of many more sources of information, including scientific experiments, arguably Universities stil claim the same kind of authoritative role as validators of knowledge, in particular through the per review proces, which underpins what is published. The received wisdom is that surely some external source or body has to validate knowledge claims, or where would we be? Yet Wikipedia threatens to tear this function from the academy. Many scholars have noticed this, and some fight back?for instance by baning students from using it [Baker 208]. Other models of knowledge have ben ofered, however, that cast Wikipedia?s suces in a new light. In the late 19th century the pragmatist Peirce proposed that beliefs be understod as knowledge due not to their prior justification, but to their usefulnes, public character and future development. His acount of knowledge was based on a unique acount of truth, which claimed that true beliefs are those that al sincere participants in a ?comunity of inquiry? would converge on, given enough time. Influential 20th century philosophers [e.g. Quine 1960] scofed at this notion as being insuficiently objective. Yet Peirce claimed that there is a kind of person whose greatest pasion is to render the Universe inteligible and wil frely give time to do so, and that over the long run, within a suficiently broad comunity, the use of signs is intrinsicaly self-corecting [Peirce 1868]. Wikipedia can be sen as a fascinating and unanticipated concrete realization of these aparently wildly idealistic claims. In this context it is interesting to note that Lary Sanger, Wikipedia co-founder and editor-in-chief, had his initial training as a philosopher?with a specialization in theory of knowledge. In public acounts of his work he has tried to bypas vexed philosophical discusions of truth by claiming that Wikipedians are not seking it but rather a neutral point of view. 4 But as the purpose of this is to suport every reader being able to build their own opinion, it can be argued that somewhat paradoxicaly this is the fastest route to genuine consensus. Interestingly, however, he and the other co-founder Jimy Wales eventualy clashed over the isue of expert opinion?s role in Wikipedia. Thus, in 207 Sanger diverged to found a new public online encyclopedia Citizendium 5 in an atempt to ?do beter? than Wikipedia, aparently reaserting validation by external authority, e.g. academics. Interestingly, although it is early days, Citizendium sems to lack Wikipedia?s popularity and momentum. Wikipedia?s unique editing methods, and the isues that suround them, have complex implications for mining. First, unlike a traditional corpus, it is constantly growing and changing, so results obtained at any given time can become stale. Some research strives to measure the degre of diference betwen Wikipedia versions over time (though this is only useful insofar as Wikipedia?s rate of change is itself constant), and ases the impact on comon research tasks [e.g. Ponzeto and Strube 207a]. Second, how are projects that incorporate Wikipedia data to be evaluated? If Wikipedia editors are the only people in the world who have ben enthusiastic enough to write up certain topics (for instance, details of TV program plots), how is one to determine ?ground truth? for evaluating aplications that utilize this information? The third factor is more of an oportunity than a chalenge. The awe-inspiring abundance of manual labor given frely to Wikipedia raises the posibility of a new kind of research project, which would consist in encouraging Wikipedians themselves to perform certain tasks on the researchers? behalf (posibly tasks of a scale the researchers themselves could not hope to achieve). As we wil se (for instance in Section 6), some have begun to glimpse this posibility, while others continue to view Wikipedia in more traditional ?product? rather than ?proces? terms. At any rate, this research area sits on a fascinating interface betwen software and social enginering. 2.2. Wikipedia's structure Traditional paper encyclopedias consist of articles aranged alphabeticaly, with internal cros-references to other relevant places in the encyclopedia, external references to the academic literature, and some kind of general index of topics. These structural features have ben adapted by Wikipedia for the online environment, and some new features arising from the Wiki editing proces have ben aded. The statistics presented in this section were obtained from a version of English Wikipedia released in July 208. 4 htp:/en.wikipedia.org/wiki/Wikipedia:Neutral_point_of_view 5 htp:/en.citizendium.org 2.2.1. Articles: The basic unit of information in Wikipedia is the article. Internationaly, Wikipedia contains 10M articles in its 250 diferent languages. 6 The English version contains 2.4M articles (not counting redirects and disambiguation pages, which are discused below). About 1.8M of these are bona fide articles with more than 30 words of descriptive text and at least one incoming link from elsewhere in Wikipedia. Articles are writen in a form of fre text that folows a comprehensive set of editorial and structural guidelines in order to promote consistency and cohesion. These are laid down in the Manual of Style, 7 and include the folowing: 1. Each article describes a single concept, and there is a single article for each concept. 2. Article titles are sucinct phrases that resemble terms in a conventional thesaurus. 3. Equivalent terms are linked to an article using redirects (Section 2.2.2). 4. Disambiguation pages present various posible meanings from which users can select an intended article. (Section 2.2.3). 5. Articles begin with a brief overview of the topic, and the first sentence defines the entity and its type. 6 htp:/en.wikipedia.org/wiki/Wikipedia 7 htp:/en.wikipedia.org/wiki/Wikipedia:Manual_of_Style Figure 1. Wikipedia article on Library. 6. Articles contain hyperlinks that expres relationships to other articles (Section 2.2.6). Figure 1 shows a typical article, entitled Library. The first sentence describes the concept: A library is a colection of information, sources, resources, and services: it is organized for use and maintained by a public body, an institution, or a private individual. Here the article?s title is the single word Library, but titles are often qualified by apending parenthetical expresions. For example, there are other articles entitled Library (computing), Library (electronics), and Library (biology). Wikipedia distinguishes capitalization when it is relevant: the article Optic nerve (the nerve) is distinguished from Optic Nerve (the comic bok). 2.2.2. Redirects: A redirect page is one with no text other than a directive in the form of a redirect link. There are about a dozen for Library and just under thre milion in the entire English Wikipedia; they encode pluralism (libraries), technical terms (bibliotheca), comon mispelings (libary), and other variants (reading rom, bok stack). The aim is to have a single article for each concept and define redirects to link equivalent terms to the article?s prefered title. As we wil se, this helps with mining because to resolve synonymy an external thesaurus is unecesary. 2.2.3. Disambiguation pages: Instead of taking readers to an article named by the term, as Library does, the Wikipedia search engine sometimes takes them directly to a special disambiguation page where they can click on the meaning they want. These pages are identified by invoking certain templates (discused in Section 2.2.8) or asigning them to certain categories (Section 2.2.6), and often contain (disambiguation) in their title. The English Wikipedia contains 10,00 disambiguation pages. The first line of the Library article in Figure 1 (?For other uses ??) links to a disambiguation page that lists Library (computing), Library (electronics), Library (biology), and other senses of the term. Brief scope notes acompany each sense, to help users identify the corect one. For instance Library (computer science) is ?a colection of subprograms used to develop software.? The articles themselves serve as detailed scope notes. Disambiguation pages are helpful sources of information concerning homonyms. 2.2.5. Hyperlinks: Articles are pepered with hyperlinks to other articles: on average, about 25 of them. The English Wikipedia contains 60 milion in total. They provide explanations of the topics being discused and suport an environment where serendipitous encounters with information are comonplace. Anyone who has browsed Wikipedia has likely experienced the feling of being hapily lost, browsing from one interesting topic to the next and encountering information that they would never have searched for. Wikipedia?s hyperlinks are also useful from a linguistic standpoint. They are an aditional source of synonyms that are not captured by redirects, because the terms used as anchors are often couched in diferent words. Library, for example, is referenced by 20 diferent anchors including library, libraries, and biblioteca. They also complement disambiguation pages by encoding polysemy; library links to diferent articles depending on the context in which it is found. They also give a sense of how el known each sense is; 84% of library links go to the article shown in Figure 1, while only 13% go to Library (computing). Furthermore, since hyperlinks in Wikipedia indicate that one article relates to another in some respect, this fundamental structure can be mined for meaning in many interesting ways?capturing the asociative relations included in standard thesauri (Section 5.2), to give just one example. 2.2.6. Category structure: Authors are encouraged to asign categories to their articles. For example, the article Library fals in the category Book Promotion. Authors are also encouraged to asign the categories themselves to other more general categories; Book Promotion belongs to Books, which in turn belongs to Writen Comunication. These categorizations, like the articles themselves, can be modified by anyone. There are almost 40,00 categories in the English Wikipedia, with an average of 19 articles and two subcategories each. Categories are not themselves articles. They are merely nodes for organizing the articles they contain, with a minimum of explanatory text. Often (in about a third of cases), categories corespond to a concept that requires further description. In these cases they are paired with an article of the same name: the category Libraries is paired with the article Library, and Bilionaires with Bilionaire. Other categories, such as Libraries by country, have no coresponding articles and serve only to organize the content. For clarity, in this paper we indicate categories in the form Category:Boks unles it is obvious that we are not talking about an article. The goal of the category structure is to represent information hierarchy. It is not a simple tre-structured taxonomy, but a graph in which multiple organization schemes coexist. Thus both articles and categories can belong to more than one category. The category Libraries belongs to four: Buildings and structures, Civil services, Culture and Library and information science. The overal structure aproximates an acyclic directed graph; al relations are directional, and although cycles sometimes ocur, they are uncomon. Acording to Wikipedia?s own guidelines, cycles are generaly discouraged but may be aceptable in rare cases. For example, Education is a field within Social Sciences, which is an Academic discipline, which belongs under Education. In other words, you can educate people about how to educate. A relatively recent adition to the encyclopedia, and les visible than articles, the category structure is haphazard, redundant, incomplete, and inconsistent [Chernov et al. 2006; Muchnik et al. 207]. Links represent a wide variety of types and strengths of relationships. Although there has ben much cleanup and the greatest proportion of links now represent clas membership (isa), there are stil many representing physical parthod, geographical location and many other merely thematic asociations betwen entities?as wel as meta-categories used for editorial purposes, such as Disambiguation. Thus Category:Pork curently contains, among others, the categories Domestic Pig, Bacon Bits, Religious Restrictions on the Consumption of Pork, and Ful Breakfast. We wil se in Section 6 that there are oportunities for recruiting users to help with data cleaning. We wil also se in Section 5 that the isues mentioned above have not prevented researchers from inovatively and fruitfuly mining the category structure for a range of diferent purposes. 2.2.8 Templates and infoboxes: Templates are pages that are not used in isolation, but are instead invoked to ad information to other pages in a reusable fashion. Wikipedia contains 174,00 diferent templates, which have ben invoked 23 milion times. They are comonly used to identify articles that require atention; e.g. if they are biased, porly writen, or lacking citations. They can also define pages of diferent types, such as disambiguation pages or featured (high quality) articles. A comon aplication is to provide navigational links, such as the for other uses link shown in Figure 1. An infobox is a special type of template that displays factual information in a structured uniform format. Figure 2 shows one from the article on the Library of Congres. It was created by invoking the Infobox Library template and populating its fields, such as location and colection size. There are 8,00 diferent infobox templates that are used for anything from animal species to strategies for starting a game of ches, and the number is growing rapidly. There are several simple ways in which the infobox structure could be improved. Standard representations for units would alow quantities to be extracted reliably. Diferent atribute names are often used for the same kind of content. More far-reaching would be to asociate data types with atribute values, and alow language and unit tags when information can be expresed in diferent ways (e.g. Euro and USD). Many Wikipedia articles use tables for structured information that would be beter represented as templates [Auer and Lehman 207]. Despite these problems, it is surprising how much meaningful and machine-interpretable information can be extracted from Wikipedia templates. This is discused further in Sections 5.3 and 6.6. 2.2.4. Discusion Pages: A discusion tab at the top of each article takes readers to its Talk page, representing a forum for discusions (often longer than the article itself) as to how it might be criticized, improved or extended in the future. For example, the talk page of the Library article, Talk:Library, contains the folowing observations, among many others: location? Libraries can also be found in churches, prisons, hotels etc. Should there be any mention of this? ? Daniel C. Boyer 20:38, 10 Nov 203 Libraries can be found in many places, and articles should be writen and linked. A wiki article on libraries can never be more of a sumary, and wil always be expandable ? DG 04:18, 1 September 206 There are talk pages for other aspects of Wikipedia?s structure, such as templates and categories, as wel as user talk pages that editors use to comunicate with each other. These pages are a unique and interesting feature of Wikipedia not replicated in traditional encyclopedias. They have ben mined for determining quality metrics of Wikipedia edits [Emigh et al. 205; Vi?gas et al. 207] but have not ben yet employed for any tasks discused in this paper?perhaps because of their unstructured nature. 2.2.5 Edit histories: To the right of the discusion tab is a history tab that takes readers to each article?s editing history. This contains the name or pseudonym of every editor, with the changes they made. From the revision history of Library we can se that this article was created on 9 November 201 in the form of a short note?which, in fact, bears litle relationship to the curent version?and has ben edited about 150 times since. Recent edits ad new links and new entries to lists; indicate posible vandalism and its reversal; corect speling mistakes; and so on. Figure 2. Infobox for the Library of Congres Analyzing editing history is an interesting research area its own right. For example, Vi?gas [204] describes how history pages can be mined to discover colaboration paterns. Nelken and Yamangil [208] discus several ways of utilizing the unique properties of history pages as a corpus for extracting lexical erors caled egcorns, e.g. , as wel as phrases that can be droped to compres sentences, a useful component of automatic text sumarization. It is natural to ask whether the content of individual articles converges in some semantic sense, staying stable despite continuing edits. Thomas and Amit [207] cal the information in a Wikipedia article ?justified? if, after going through the comunity proces of discusion, repeated editing, and so on, it has reached a stable state. They found that articles do, in general, become stable, but that it is dificult to predict where in its journey towards maturity a given article is at any point in time. They also point out that although information about an article?s edit history might indicate its likely quality, mining systems invariably ignore it. Table 1 breaks down the number of diferent pages and conections in the English version at the time of writing. There are almost 5.5 milion pages in the section dedicated to articles. Most are redirects. Many others are disambiguation pages, lists (which group related articles but do not provide explanatory text themselves) and stubs (incomplete articles with fewer than 30 words or at least one incoming link from elsewhere in Wikipedia). Removing al these leaves about 1.8 milion bona-fide articles, each with an edit history and most with some content on their discusion page. The articles are organized into 40,00 diferent categories and augmented with 170,00 diferent templates. They are densely interlinked, with 62 milion conections?an average of 25 incoming and 25 outgoing links from each article. Articles and related pages 5,460,00 Categories 390,00 redirects 2,970,00 disambiguation pages 10,00 Templates 174,00 Lists and stubs 620,00 infoboxes 9,00 bona-fide articles 1,760,00 other 165,00 Links betwen articles 62,00,00 betwen category and subcategory 740,00 betwen category and article 7,270,00 Table 1. Content of English Wikipedia. 2.3. Perspectives on Wikipedia Wikipedia is a rich resource with several diferent broad functionalities. We wil se in subsequent sections that researchers have developed sophisticated mining techniques with which they can identify, isolate and utilize these diferent perspectives. Here we introduce the most important examples. 2.3.1 Wikipedia as an encyclopedia: The first and most obvious usage for Wikipedia is exactly what it was intended as: an encyclopedia. Ironicaly, this is the very aplication that has generated most doubt and cynicism. As noted above, the open editing policy has led many to doubt its authority. Dening et al. [205] provide a god review of early concerns. They conclude that, while Wikipedia is an interesting example of large-scale colaboration, its use as an information source is risky. Their core argument is the lack of formal expert review procedures, which gives rise to two key isues: acuracy within articles, and bias of coverage acros them. Acuracy within articles is investigated by Giles [205], who compares randomly selected scientific Wikipedia articles with their equivalent entries in Encyclopedia Britanica. Both sources were equaly prone to significant erors, such as misinterpretation of important concepts. More subtle erors, however, such as omisions or misleading statements, were more comon in Wikipedia. In the 41 articles reviewed there were 162 mistakes in Wikipedia versus 123 for Britanica. Britanica Inc. atacked Giles? study as ?fataly flawed? 8 and demanded a retraction; Nature defended itself and declined to retract. 9 Ironicaly, while Britanica?s part in the debate has ben polemical and plainly biased, Wikipedia provides objective coverage on the controversy in its article on Encyclopedia Britanica. Several authors have developed metrics that evaluate the quality of Wikipedia articles based on such features as number of authors, number of edits, internal and external linking, and article size, e.g. Lih [204] and Wilkinson and Huberman [207]; article stability, e.g. Dondio et al. [206]; and the amount of conflict an article generates, e.g. Kitur [207]. Emigh and Hering [205] perform a genre analysis on Wikipedia using corpus linguistic methods to determine ?features of formality and informality,? and claim that its degre of post-production editorial control produces entries as standardized as those in traditional print encyclopedias. Vi?gas et al. [207] claim that overal cordination and organization, one of the fastest growing areas of Wikipedia, ensures great resilience to malicious editing despite high trafic; they highlight in particular the role played by discusion pages. 8 htp:/ww.corporate.britanica.com/britanica_nature_response.pdf 9 htp:/ww.nature.com/pres_releases/Britanica_response.pdf So much for acuracy. A second isue is bias of coverage. Wikipedia is edited by volunters, who naturaly aply more efort to describing topics that pique their interest. For example, there are 60 diferent articles dedicated to the The Simpsons carton. In contrast, there are half as many pages about the namesake of the carton?s main character, the Grek poet Homer, and al the literary works he created and inspired. Lih [204] shows that Wikipedia?s content, and therefore bias, is also driven to a large extent by the pres. Milne et al. [206] identify a bias towards concepts that are general or introductory, and therefore more relevant to ?everyman.? 2.3.2. Wikipedia as corpus: Large text colections are useful for creating language models that capture particular characteristics of language use. For example, the language in which a text is writen can be determined by analyzing the statistical distribution of the leter n-grams it contains [Cavnar and Trenkle 194], whereas word co-ocurence statistics are helpful in tasks like speling corection [Mays et al. 191]. Aligned text corpora in diferent languages are extremely useful in machine translation [Brown et al. 193]. Extensive coverage and high quality of the corpus is a crucial criterion in the suces of such aplications. While the web has enabled rapid acquisition of large text corpora, their quality leaves much to be desired, due to spaming and the varying format of websites. In particular, manualy anotated corpora and aligned multilingual corpora are stil rare and in high demand. Wikipedia provides a plethora of wel-writen and wel-formulated articles?several gigabytes in the English version alone?that can easily be separated from other parts of the website. The Simple Wikipedia is significantly smaler, but its articles are writen for non-English speakers and do not contain complex sentences. This makes automatic linguistic procesing easier, and some researchers focus on Simple Wikipedia for their experiments [Ruiz-Casado et al. 205; Toral and Mu?os 206]. Many researchers take advantage of the large number of definitions in Wikipedia for question answering (Section 4.3) and automatic extraction of semantic relations (Section 5.1). Section 2.2.5 mentions how Wikipedia history pages can be used as a corpus for training text sumarization algorithms, as wel as for determining the quality of the articles themselves. Wikipedia also contains anotations in the form of targeted hyperlinks. Consider the folowing two sentences from the article about the Formula One team named McLaren. 1. The [Kiwi (people)|Kiwi] made the team?s Grand Prix debut at the 196 Monaco race. 2. Original McLaren [Kiwi|kiwi] logo; a New Zealand icon. In the first case the word kiwi links to Kiwi (people); in the second, to Kiwi, the article describing the bird. This mark-up is nothing more or les than word sense anotation. Mihalcea [207] shows that Wikipedia is a ful fledged alternative to manualy sense- taged corpora. Section 3.2 discuses research that makes use of these anotations for word sense disambiguation and computing the semantic similarity betwen words. Although the exploration of Wikipedia as a source of multilingual aligned corpora has only just begun, its links betwen description of concepts in diferent languages have ben exploited for cros-language question answering [Fer?ndez et al. 207] and automatic generation of bilingual dictionaries [Erdman et al. 208]. This is further discused in Section 3.4, while Section 4.3 investigates Wikipedia?s potential for multilingual information retrieval. 2.3.3 Wikipedia as a thesaurus: There are many similarities betwen the structure of traditional thesauri and the ways in which Wikipedia organizes its content. As noted, each article describes a single concept, and its title is a sucinct, wel-formed phrase that resembles a term in a conventional thesaurus. If article names corespond to manualy defined terms, links betwen them corespond to relations betwen terms, the building blocks of thesauri. The international standard for thesauri (ISO 278) specifies four kinds of relation: ? Equivalence: USE, with inverse form USE FOR ? Hierarchical: broader term (BT), with inverse form narower term (NT) ? Any other kind of semantic relation (RT, for related term). Wikipedia redirects provide precisely the information expresed in the equivalence relation. As noted, they are a powerful way of dealing with word variations such as abreviations, equivalent expresions and synonyms. The hierarchical relations (broader and narower terms) are reflected in Wikipedia?s category structure. Hyperlinks betwen articles capture other kinds of semantic relation. (Restricting consideration to mutual cros-links eliminates many of the more tenuous asociations.) As we wil se, researchers compare Wikipedia with manualy created domain-specific thesauri and augment them with knowledge from it (Section 3.2.3). Redirects turn out to be very acurate and can safely be aded to existing thesauri without further checking. Wikipedia also has the potential to contribute new topics and concepts, and can be used as a source of sugestions for thesaurus maintenance. Manual creation of scope notes is a labor-intensive aspect of traditional thesauri. Instead, the first paragraph of a Wikipedia article can be extracted as a description of the topic, backed up by the ful article should more explanation be required. Finaly, Wikipedia?s multilingual nature alows thesauri to be translated into other languages. 2.3.4. Wikipedia as a database: Wikipedia contains a masive amount of highly structured information. Several projects (notably DBpedia, discused in Sections 5.2 and 6.6) extract this and store it in formats acesible to database aplications. The aim is two-fold: to alow users to pose database-style queries against datasets derived from Wikipedia, and to facilitate linkage with other datasets on the web. Some projects even aim to extract database-style facts directly from the text of Wikipedia articles, rather than from infoboxes. Furthermore, disambiguation and redirect pages can be turned into a relational database that contains tables for terms, concepts, term concept relationships and concept relationships [Gregorowicz and Kramer 206]. Another idea is to botstrap fact extraction from articles by using the content of infoboxes as training data and aplying machine learning techniques to extract even more infobox-style information from the text of other articles. This alows infoboxes to be generated for articles that do not yet have them [Wu and Weld 207]. Related techniques can be used to clean up the underlying infobox data structure, with its proliferation of individual templates. 2.3.5 Wikipedia as an ontology: Articles can be viewed as ontology elements, for which the URIs of Wikipedia entries serve as surprisingly reliable identifiers [Hep et al. 206]. Of course, true ontologies also require concept nodes to be conected by informative relations, and in Section 6 we wil se researchers mine such relations in a host of inovative ways from Wikipedia?s structure?including redirects, hyperlinks (both incoming and outgoing, as wel as the anchor text), category links, category names and infoboxes, and even raw text, as wel as experimenting with ading relations to and from other resources such as WordNet and Cyc. From this viewpoint Wikipedia is arguably by far the largest living ontological structure available today, with its distinctive Wiki technology serving as a large-scale colaborative ontology development environment. Some researchers are begining to mix traditional mining techniques with posibly more far-sighted atempts to encourage Wikipedia editors themselves in directions that might bear ontological fruit. 2.3.6 Wikipedia as a network structure: Wikipedia can be viewed as a hyperlinked structure of web pages, a microcosm of the web. Standard methods of analyzing the network structure can then be aplied [Belomi and Bonato 205]. The two most prominent techniques used for web analysis are PageRank, which underpins Gogle?s suces [Brin and Page 198], and the HITS algorithm [Kleinberg 198]. Belomi and Bonato [205] aplied both of these to Wikipedia and discerned some interesting underlying cultural biases (as of April 205). These authors conclude that PageRank and HITS sem to identify diferent kinds of information. They report that acording to the HITS authority metric, space (in the form of political geography) and time (in the form of both time spans and landmark events) are the primary organizing categories for Wikipedia articles. Within these, information tends to be organized around famous people, comon words, animals, ethnic groups, political and social institutions, and abstract concepts such as music, philosophy, and religion. In contrast, the most important articles acording to PageRank include an overwhelming number of concepts tightly related to religion. For example, Pope, God and Priest were the highest-ranking nouns, as compared to Television, Scientific clasification, and Animal for HITS. They found that PageRank semed to transcend recent political events to give a wider historical and cultural perspective in weighting geographic entities. It also tends to bring out a global rather than a Western perspective, both for countries and cities and for historical events. HITS reveals a strong bias towards recent political leaders, whereas people with high PageRank scores tend to be ones with an impact on religion, philosophy and society. It would be interesting to se how these trends have evolved in the thre years since the publication of this work. An alternative to PageRank and HITS is the Gren method [Dufy 201], which Olivier and Senelart [207] aplied to Wikipedia?s hyperlink network structure in order to find related articles. This method, which is based on Markov Chain theory, is related to the topic-sensitive version of PageRank introduced by Haveliwala [203]. Given a target article, one way of finding related articles is to lok at nodes with high PageRank in its imediate neighborhod. For this a topic-sensitive measure like Gren?s is more apropriate than the global PageRank. The Wikipedia category graph also forms a network structure. Zesch and Gurevych [207] showed that it is a scale-fre, smal-world graph, like other semantic networks such as WordNet. They adapted WordNet-based measures of semantic relatednes to use the Wikipedia category graph instead, and found that they work wel?at least for nouns. They sugest that this, coupled with Wikipedia?s multilingual nature, may enable natural language procesing algorithms to be transfered to languages that lack wel- developed semantic WordNets. 2.4. Obtaining Wikipedia data Wikipedia is based on the MediaWiki software. As an open source project, its entire content is easily obtainable. It is available in the form of large XML files and database dumps that are released sporadicaly, from several days to several weks apart. 10 The ful content (without revision history or images) of the English version of Wikipedia ocupies 18 Gb of uncompresed data at the time of writing. There are several tols for extracting information from these files, which are discused in Section 7. 10 htp:/download.wikimedia.org/wikipedia Instead of obtaining the database directly, specialized web crawlers have ben developed to download the entire content of Wikipedia. Belomi and Bonato [205] scaned the All pages index section, which contains a complete list of the pages exposed on the website. Pages that do not contain a regular article were identified by testing for specific paterns in the URL, and discarded. Wikipedia?s administrators prefer the use of the database dumps, however, to minimize the strain placed on their services. 3 SOLVING NATURAL LANGUAGE PROCESING TASKS Natural language procesing aplications fal into two major groups: i) those relying on symbolic methods, where the system utilizes a manualy encoded repository of human language, and i) statistical methods, which infer properties of language by procesing large text corpora. The problem with the former is a dearth of high-quality knowledge bases. Even the lexical database WordNet, which, as the largest of its kind, receives substantial atention [Felbaum 198], has ben criticized for low coverage?particularly of proper names?and high sense proliferation [Mihalcea and Moldovan 201; Ponzeto and Strube 207a]. Initial enthusiasm with statistical methods somewhat faded once they hit an uper performance bound that is hard to improve upon unles they are combined with symbolic elements [Klavans and Resnik 196]. Several research groups simultaneously discovered Wikipedia as an alternative to WordNet. Direct comparison of their performance on the same task has shown that Wikipedia can be employed in a similar way and significantly outperforms WordNet on various tasks [Strube and Ponzeto 206]. This section describes research in the four language procesing tasks to which Wikipedia has ben sucesfuly aplied: semantic relatednes (Section 3.1), word sense disambiguation (Section 3.2), co-reference resolution (Section 3.3) and multilingual alignment (Section 3.4). 3.1 Semantic relatednes Semantic relatednes quantifies the similarity betwen two concepts, e.g. doctor and hospital. Budanitsky and Hirst [201] diferentiate betwen semantic similarity, where only predefined taxonomic relations are used to compute similarity, and semantic relatednes, where other relations like has-part, is-made-of are used as wel. Semantic relatednes can be also quantified by statistical methods without requiring a manualy encoded taxonomy, for example by analyzing term co-ocurence in a large corpus [Resnik 195; Jiang and Conrath 197]. To evaluate automatic methods for estimating semantic relatednes, the corelation coeficient betwen machine-asigned scores and those asigned by human judges is computed. Thre standard datasets are available for evaluation: ? Miler and Charles? [191] list of 30 noun pairs, which we denote by M&C; ? Rubenstein and Godenough?s [1965] 65 synonymous word pairs, R&G, ? [Finkelstein et al. 202]?s colection of 353 word pairs (WordSimilarity-353), WS-353. The best pre-Wikipedia result for the first set was a corelation of 0.86, achieved by Jiang and Conrath [197] using a combination of statistical measures and taxonomic analysis derived from WordNet. For the third, Finkelstein et al. [202] achieved 0.56 corelation using Latent Semantic Analysis. The discovery of Wikipedia began a new era of competition. Strube and Ponzeto [206] and Ponzeto and Strube [207a] re-calculated several measures developed for WordNet using Wikipedia?s category structure. The best performing metric on most datasets was Leacock and Chodorow?s [198] normalized path measure: ! lchc 1 , 2 () ="log lengthc 1 , 2 () 2D , where length is the number of nodes on the shortest path betwen nodes c 1 and c 2 , and D is the maximum depth of the taxonomy. WordNet-based measures outperform Wikipedia- based ones on the smal datasets M&C and R&G, but on WS-353 Wikipedia wins by a large margin. Combining similarity evidences from Wikipedia and WordNet using a SVM to learn relatednes from the training data yielded the highest corelation score of 0.62 on a designated ?testing? subset of WS-353. Strube and Ponzeto remark that WordNet?s sense proliferation was responsible for its por performance on WS-353. For example, when computing the relatednes of jaguar and stock, the later is interpreted in the sense of animals kept for use or profit rather than in the sense of market, which people find more intuitive. WordNet?s fine sense granularity has ben also criticized in word sense disambiguation (Section 3.2.1). The overal conclusion is that Wikipedia can serve AI aplications in the same way as hand- crafted knowledge resources. Zesch et al. [207] perform similar experiments with the German Wikipedia, which they compare to GermaNet on thre datasets including the translated M&C. The performance of Wikipedia-based measures was inconsistent, and, like Strube and Ponzeto [206], they obtained best results by combining evidence from GermaNet and Wikipedia. Ponzeto and Strube [207a] investigate whether performance on Wikipedia-based relatednes measures changes as Wikipedia grows. After comparing February 206, September 206 and May 207 versions they conclude that the relatednes measure is robust. There was no improvement, probably because new articles were unrelated to al words in the evaluation datasets. A Java API is available for those wishing to experiment with these techniques [Ponzeto and Strube [207c]. 11 Gabrilovich and Markovitch [207] develop Explicit Semantic Analysis (ESA) as an alternative to the wel-known Latent Semantic Analysis. They use a centroid-based clasifier to map input text to a vector of weighted Wikipedia articles. For example, for Bank of Amazon the vector contains Amazon River, Amazon Basin, Amazon Rainforest, Amazon.com, Rainforest, Atlantic Ocean Brazil, etc. To compute semantic relatednes betwen two terms, they compute the cosine similarity of their vectors. This significantly outperforms Latent Semantic Analysis on WS-353, with an average corelation of 0.75. With the same technique, the Open Directory Project 12 achieves a 0.65 corelation, indicating that Wikipedia?s quality is greater. The maping developed in this work has ben sucesfuly utilized for text categorization (Section 4.4). While Gabrilovich and Markovitch [207] use the ful text of Wikipedia articles to establish relatednes betwen terms, Milne [207] analyses just the internal hyperlinks that apear, arguing that Wikipedia?s link structure bears much significant information about concepts. To compute the relatednes betwen two terms they are first maped to coresponding Wikipedia articles, and then vectors are created containing the links to other Wikipedia articles that ocur in these articles. For example, a sentence like Bank of America is the largest comercial in the by both and contributes four links to the vector. Each link is weighted by the inverse number of times it is linked from other Wikipedia articles?the les comon the link, the higher its weight. For example, market capitalization receives higher weight than United States and thus contributes more to the semantic relatednes. Disambiguation is a serious chalenge for this technique. Strube and Ponzeto [206] chose the most likely meaning from the order in which entries ocur in Wikipedia?s disambiguation pages; Gabrilovich and Markovitch [207] avoid disambiguation entirely by simultaneously asociating a term with several Wikipedia articles. However, Milne?s [207] aproach hinges upon corect maping of terms to Wikipedia articles. When terms are manualy disambiguated, a corelation of 0.72 is achieved for WS-353. Automatic disambiguation that simply selects whatever meaning produces the greatest similarity score is only 0.45, showing that unlikely senses often produce greater similarity than comon ones. Milne and Witen [208a] disambiguate term mapings automaticaly using thre features. One is the conditional probability of the sense given the term, acording to the Wikipedia corpus (discused further in Section 3.2.1). For example, the term leopard 11 htp:/ww.eml-r.org/english/research/nlp/download/jwordnetsimilarity.php most often links to the animal description rather than the eponymous Mac operating system. They also analyze how comonly two terms apear in Wikipedia as a colocation. Finaly, they replace the vector-based similarity metric described above by a measure inspired by Cilibrasi and Vitanyi?s [202] Normalized Gogle Distance, which is based on term ocurences in web pages, but using Wikipedia?s links rather than Gogle?s search results. The semantic similarity of two terms is determined by the sum of these thre values?conditional probability, colocation and similarity. This technique achieves 0.69 corelation with human judgments on WS-353, not far of Gabrilovich and Markovitch?s [207] figure for ESA. However, it is far les computationaly intensive because only links are analyzed, not the entire Wikipedia text. Further analysis of the results shows that performance is even higher on terms that are wel defined in Wikipedia. Table 2 sumarizes the results of the similarity metrics that we have described, using the same datasets and evaluation technique. ESA is best, with WLM not far behind and WikiRelate the lowest. The astonishingly high corelation with human performance that these techniques obtain was wel out of reach in pre-Wikipedia days. This is an important advance, because?as we wil se when discusing information retrieval and extraction?automatic computation of semantic similarity helps with many natural language procesing tasks. 3.2 Word sense disambiguation Techniques for word sense disambiguation?i.e., resolving polysemy?use a dictionary or thesaurus that defines the inventory of posible senses [Ide and Veronis 198]. Wikipedia provides an alternative resource. Each article describes a concept that is a posible sense for words and phrases that denote it, whether by redirection, via a disambiguation page, or as anchor text that links to the article. The terms to be disambiguated may either apear in plain text or in an existing knowledge base (thesaurus or ontology). The former situation is more complex because the context is les clearly defined. Consider the example in Figure 3. Even human readers canot be sure of the intended meaning of wod from the sentence alone, but a diagram showing semanticaly related words in WordNet puts it into context and makes it clear that the meaning is the tres and other plants in a large densely woded area, rather than the hard fibrous lignified substance under the bark of tres. This highlights the main idea behind disambiguation: identify the context and analyze which of the posible senses fits it best. 12 htp:/ww.dmoz.org We first cover techniques for disambiguating phrases in text to Wikipedia articles, then examine the important special case of named entities, and finaly show how disambiguation is used to map manualy created knowledge structures to Wikipedia. 3.2.1. Disambiguating phrases in runing text: Discovering the intended senses of words and phrases is an esential stage in every natural language aplication, otherwise ful ?understanding? canot be claimed. WordNet is a popular resource for word sense disambiguation, but succes has ben mixed [Vorhes 198]. One reason is that the task is demanding because ?linguistic [disambiguation] techniques must be esentialy perfect to help? [Vorhes 198]; another is that WordNet defines word senses with such fine granularity that even human anotators strugle to diferentiate them [Edmonds and Kilgarif 198]. The two are related, because fine sense granularity makes disambiguation more dificult. In contrast, Wikipedia defines only those senses on which its contributors reach consensus, and include an extensive description of each one rather than WordNet?s brief glos. Substantial advances have ben made since it was discovered as a resource for disambiguation. Mihalcea [207] use Wikipedia articles as a source of sense-taged text to form a training corpus for supervised disambiguation. They folow the evaluation methodology developed by SIGLEX, the Asociation for Computational Linguistics? Special Interest Group on the Lexicon. 13 For each example they colect its ocurences as link anchors in Wikipedia. For example, the term bar is linked to bar (establishment) and bar (music), each of which coresponds to a WordNet synset?that is, a set of synonymous terms representing a particular meaning of bar. The results show that a machine learning aproach trained on Wikipedia sentences in which both meanings of bar ocur clearly outperforms two simple baselines. Method M&C R&G WS-353 WordNet [Strube and Ponzeto, 206] 0.82 0.86 ful: 0.36 test: 0.38 WikiRelate! [Ponzeto and Strube, 207] 0.49 0.5 ful: 0.49 test: 0.62 ESA [Gabrilovich and Markovitch, 207] 0.73 0.82 0.75 WLVM [Milne, 207] n/a n/a man: 0.72 auto: 0.45 WLM [Milne and Witen, 208] 0.70 0.64 0.69 Table 2. Overview of semantic relatednes methods. This work uses Wikipedia solely as a resource to disambiguate words or phrases into WordNet synsets. Mihalcea and Csomai [207] go further, using Wikipedia?s content as a sense inventory in its own right. They disambiguate terms?words or phrases?that apear in plain text to Wikipedia articles, concentrating exclusively on ?important? concepts. They cal this proces wikification because it simulates how Wikipedia authors manualy insert hyperlinks when writing articles. There are two stages: extraction and disambiguation. In the first, terms that are judged important enough to be highlighted as links are identified in the text. Only terms ocuring at least five times in Wikipedia are considered, and likelihod of a term being a hyperlink is estimated by expresing the number of articles in which a given word or phrase apears as anchor text as a proportion of the total number of articles in which it apears. Al terms whose likelihod exceds a predefined threshold are chosen, which yields an F-measure of 5% on a subset of manualy anotated Wikipedia articles. In the second stage these terms are disambiguated to Wikipedia articles that capture the intended sense. For example, in the sentence Jenga is a popular ber in the bars of Thailand the term bar coresponds to the bar (establishment) article. Given a term, those articles for which it is used as anchor text in the Wikipedia are candidate senses. Best results are achieved by a machine learning aproach in which Wikipedia?s already- anotated articles serve as training data. Features?like part-of-spech tag, local context of thre words to the left and right, and their part-of-spech tags?are computed for each ambiguous term that apears as anchor text of a hyperlink. A Na?ve Bayes clasifier is then aplied to disambiguate unsen terms. Csomai and Mihalcea [207] report an F- measure of 87.7% on 6,50 examples, and go on to demonstrate that linking educational material to Wikipedia articles in this maner improves the quality of knowledge that people acquire when reading the material, and decreases the time taken. 13 htp:/ww.senseval.org He could se wod around the house. Figure 3. What is the meaning of wood in both examples? In a paralel development, Wang et al. [207] use a fixed-length window to identify terms in a document that match the titles of Wikipedia articles, eliminating matches subsumed by longer ones. They disambiguate the matches using two methods. One works on a document basis, seking those articles that are most similar to the original document acording to the standard cosine metric betwen TF?IDF-weighted word frequency vectors. The second works on a sentence basis, computing the shortest distance betwen the candidate articles for a given ambiguous term and articles coresponding to any non-ambiguous terms that apear in the same sentence. The distance metric is 1 if the two articles link to each other; otherwise it is the number of nodes along the shortest path betwen two Wikipedia categories to which they belong, normalized by the maximum depth of the category taxonomy. The result is the average of the two techniques (if no unambiguous articles are available, the similarity technique is aplied by itself). Wang et al. do not compare this method to other disambiguation techniques directly. They do, however, report the performance of text categorization before and after synonyms and hyponyms of matching Wikipedia articles, and their related terms, were aded to the documents. The findings were mixed, and somewhat negative. Medelyan et al. [208] use Mihalcea and Csomai?s [207] wikification strategy with a diferent disambiguation technique. Document terms with just one match are unambiguous, and their coresponding articles are colected and used as ?context articles? to disambiguate the remaining terms. This is done by determining the average semantic similarity of each candidate article to al context articles identified for the document. The semantic similarity of a pair of articles is obtained from their incoming links as described by Milne and Witen [208a] (se Section 3.1). Acount is also taken of the conditional probability of a sense given the term, acording to the Wikipedia corpus (proposed by Mihalcea and Csomai [207] for a baseline). For example, the term jaguar links to the article Jaguar cars in 46 out of 927 cases, thus its conditional probability is 0.5. The resulting maping is the one with the largest product of semantic similarity and conditional probability. This achieves an F-measure of 93% on 17,50 mapings in manualy anotated Wikipedia articles. Milne and Witen [208b] extend this aproach using machine learning. Rather than extracting terms and then disambiguating them, they alow a term?s posible mapings to influence whether it should be adjudged an important concept for the document. Conditional probability of a maping, its semantic similarity to other context articles, and other features are combined in a machine learning clasifier, baged decision tres, which determines a probability figure for each maping. More than one Wikipedia article can be chosen for a given document term, which improves recal at the expense of a slight decrease in precision, raising the F-measure from 93% to 97% on the same data. 3.2.2. Disambiguating named entities: Phrases refering to named entities, which are proper nouns such as geographical and personal names, and titles of boks, songs and movies contribute to the largest part of our vocabulary. Wikipedia is recognized as the largest available resource of such entities. It has become a platform for discusing curent news, and contributors put isues into encyclopedic context by relating them to historical events, geographic locations and significant personages, thereby increasing the coverage of named entities. Here we describe thre aproaches that focus specificaly on linking named entities apearing in text or in search queries to coresponding Wikipedia articles. Techniques for recognizing named entities in Wikipedia itself are sumarized in Section 5.3. Bunescu and Pa?ca [206] disambiguate named entities in search queries in order to group search results by the coresponding senses. They first create a dictionary of 50,00 entities that apear in Wikipedia, and ad redirects and disambiguated names to each one. If a query contains a term that coresponds to two or more entries, they chose the one whose Wikipedia article has the greatest cosine similarity with the query. If the similarity scores are to low they use the category to which the article belongs instead of the article itself. If even this fals below a predefined threshold they asume that no maping is available. The reported acuracies are betwen 5% and 85% for members of Wikipedia?s People by ocupation category, depending on the model and experimental data employed. Cucerzan [207] identifies and disambiguates named entities in text. Like Bunescu and Pa?ca [206], he first extracts a vocabulary from Wikipedia. It is divided into two parts, the first containing surface forms and the second the asociated entities, along with contextual information about them. The surface forms are titles of articles, redirects, and disambiguation pages, and anchor text used in links. This yields 1.4 milion entities, with an average of 2.4 surface forms each. Further pairs are extracted from Wikipedia list pages?e.g., Texas (band) receives a tag LIST_band name etymologies, because it apears in the list with this title?yielding a further 540,00 entries. Categories asigned to Wikipedia articles describing named entities serve as tags to, yielding 2.65 milion entries. Finaly a context for each named entity is colected? e.g., parenthetical expresions in its title, phrases that apear as link anchors in the article?s first paragraph of the article, etc.?yielding 38 milion pairs. To identify named entities in text, capitalization rules indicate which phrases are surface forms of named entities. Co-ocurence statistics generated from the web by a search engine help to identify boundaries betwen them (e.g. Whitney Museum of American Art is a single entity, whereas Whitney Museum in New York contains two). Lexical analysis is used to colate identical entities (e.g., Mr. Brown and Brown), and entities are taged with their type (e.g., location, person) based on statistics colected from manualy anotated data. Disambiguation is performed by comparing the similarity of the document in which the surface form apears with Wikipedia articles that represent al named entities that have ben identified in it, and their context terms, and chosing the best match. Cucerzan [207] achieves 8% acuracy on 5,00 entities apearing in Wikipedia articles, and 91% on 750 entities apearing in news stories. Kazama and Torisawa [207] recognize and clasify entities but do not disambiguate them. Their work resembles the methods described above. Given a sentence, their goal is to extract al n-grams representing Wikipedia articles that corespond to a named entity and asign a type to it. For example, in the sentence Rare Jimy Hendrix song draft sels for almost $17,00 they identify Jimy Hendrix as an entity of type musician. To determine the type they extract the first noun phrase folowing the verb to be from the Wikipedia article?s first sentence, excluding phrases like kind of, type of?e.g., guitarist in Jimy Hendrix was a guitarist. Recognition is a supervised taging proces based on standard features such as surface form and part of spech tag, augmented with category labels extracted from Wikipedia and a gazeter. An F-measure of 8% was achieved on a standard set of 100 training and 20 development and testing documents. Cucerzan [207] and Kazama and Torisawa [207] report similar performance, while Bunescu and Pa?ca?s [206] results sem slightly worse. However, comparison is unreliable because diferent datasets are used. Acuracy also depends on the type of the named entity. 3.2.3. Disambiguating thesaurus and ontology terms: Wikipedia?s category and link structure contains the same kind of information as a domain-specific thesaurus, as ilustrated by Figure 4, which compares it to the agricultural thesaurus Agrovoc [195]. Whereas in Section 3.1.2 Wikipedia is used as an independent knowledge base, it can also be used to extend and improve existing resources. For example, if it were known that cardiovascular system and circulatory system in Figure 4 refer to the same concept, the synonym blod circulation could be aded to Agrovoc. The major problem is to establish a maping betwen Wikipedia and other resources, disambiguating situations that suport multiple mapings. Ruiz-Casado et al. [205] map Wikipedia articles to WordNet. They work with the Simple Wikipedia, 14 a reduced version that contains easier words and shorter sentences, intended for people learning English. WordNet synsets cluster word senses so that homonyms can be identified. If a Wikipedia article matches several WordNet synsets, the apropriate one is chosen by computing the similarity betwen the Wikipedia entry word-bag and the WordNet synset glos. This technique achieves 84% acuracy, when dot product similarity of stemed word vectors is aplied. The problem is that as Wikipedia grows, so does ambiguity. For instance even the Simple Wikipedia contains the article Cats (musical), which is absent from WordNet. The maping technique must be able to deal with absent items as wel as polysemy in both resources. Overel and R?ger [206] disambiguate place names mentioned in Wikipedia to locations in gazeters. Instead of semantic similarity they develop geographicaly-based disambiguation methods. One seks a minimum bounding box enclosing the location being disambiguated and other place names that are mentioned in the same context, using geographical cordinates from the gazeter. Another analyzes the place name?s referent; for example, if the surface form Ontario is maped to Ontario, Canada, then London, Ontario can be maped to London, Canada. Best results were achieved by combining the minimum bounding box method with ?importance,? measured by population size. 14 htp:/simple.wikipedia.org Figure 4. Comparison of organization structure in Agrovoc and Wikipedia. An F-measure of 80% was achieved on a test set with 1,70 locations and 12,275 non- locations. Overel and R?ger [207] extend this aproach by creating a co-ocurence model for each place name. They map place names to Wikipedia articles, colect their redirects as synonyms, and gather the anchor text of links to these articles. This yields diferent ways of refering to the same place, e.g., {Londinium ? London} and {London, UK ? London}. Next they colect evidence from Wikipedia articles: geographical cordinates, and location names in subordinate categories. They also mine Placeopedia, a mash-up website that conects Wikipedia with Gogle Maps. Together, these techniques recognize 75% of place names and map them to geographical locations with an acuracy of betwen 78 and 90%. Milne et al. [207] investigate whether domain-specific thesauri can be obtained from Wikipedia for use in natural language aplications within restricted domains, comparing it with Agrovoc, a manualy built agricultural thesaurus. On the positive side, Wikipedia article titles cover the majority of Agrovoc terms that were chosen by profesional indexers as index terms for an agricultural corpus, and its redirects corespond closely with Agrovoc?s synonymy relation. However, neither category relations nor (mutual) hyperlinks betwen articles corespond wel with Agrovoc?s taxonomic relations. Instead of extracting new domain-specific thesauri from Wikipedia they examine how existing ones can be improved, using Agrovoc as a case study [Medelyan and Milne 2008]. Given an Agrovoc descriptor, they colect semanticaly related terms from the Agrovoc hierarchy as context terms and map each one to the Wikipedia articles whose conditional probability (as explained in Section 3.2.1) is greatest. Then they compute the semantic similarity of each candidate maping to this set of context articles. Manual evaluation of a subset with 40 mapings shows an average acuracy of 92%. The results are slightly beter if there are fewer than four posible mapings and remain stable at 8% if there are ten or more. Medelyan and Leg [208] map terms from the Cyc ontology to Wikipedia articles using the disambiguation aproach proposed by Medelyan and Milne [208]. However, since they draw on the Cyc ontology as part of their disambiguation, and the project can be viewed as a large-scale ?ontology alignment?, discusion of it wil be postponed to Section 6.5. There is stil far les research on word sense disambiguation using Wikipedia than for WordNet. However, significant advances have ben made, and over the last two years the acuracy of maping documents to relevant Wikipedia articles has improved by one third [Milne and Witen 208]. Other researchers (such as Wang et al. [207]) use word sense disambiguation as a part of an aplication but do not provide an intrinsic evaluation. Furthermore, for fair comparison the same version of Wikipedia and the same training and test set should be used, as has ben done for WordNet by SIGLEX (Senseval, cited earlier). Evaluation of named entity extraction is even more complex, with each research group concentrating on diferent types of entity, e.g. persons or places. Here, extrinsic evaluations may be helpful?e.g., performance on a particular task, for example question answering, before and after integration with Wikipedia. The next section describes an extrinsic evaluation of Wikipedia for co-reference resolution and compares the results with WordNet. 3.3 Co-reference resolution Natural language understanding tasks such as textual entailment and question answering involve co-reference resolution?identifying which text entities refer to the same concept. Unlike word sense disambiguation, it is not necesary to determine the actual meaning of these entities, but merely identify their conection. Consider the folowing example from Wikipedia?s article on New Zealand: Elizabeth I, as the Quen of New Zealand, is the Head of State and, in her absence, is represented by a non-partisan Governor-General. The Quen ?reigns but does not rule.? She has no real political influence, and her position is esentialy symbolic. [emphasis aded] Without knowing that Elizabeth I and the Quen refer to the same entity, which can be refered to by the pronouns she and her, the information that can be infered from this paragraph is limited. To resolve the highlighted co-referent expresions requires linguistic knowledge and world knowledge?that Elizabeth I is the Quen, and female. Curent methods often derive semantic relations from WordNet or mine large corpora using lexical paterns such as X is a Y and Y such as X. The task can be modeled as a binary clasification problem?to determine, for each pair of entities, whether they co-refer or not?and adresed using machine learning techniques, with features such as whether they are semanticaly related, the distance betwen them, agrement in number and gender. The use of Wikipedia for these tasks has ben explored in two ways. Ponzeto and Strube [206a, 207] analyze its hyperlink structure and text to extract semantic features; whereas Yang and Su [207] use it as a large semi-structured corpus for mining lexical paterns. They are easy to compare because both use test data from the Mesage Understanding Conference organized by NIST. Ponzeto and Strube?s [206, 207a] main goal is to show that Wikipedia can be used as a fuly-fledged lexical and encyclopedic resource, comparable to WordNet but far more extensive. While their work on semantic relatednes (Section 3.1) evaluates Wikipedia intrinsicaly, co-reference is evaluated extrinsicaly to demonstrate Wikipedia?s utility. As a baseline they re-implement Son et al.?s [201] method with a set of standard features, such as whether the two entities share the same gramatical feature, or belong to the same WordNet clas. Aditional features mined from WordNet and Wikipedia are evaluated separately. The WordNet features for two given terms A, e.g. Elisabeth I, and B, e.g. Quen, are: ? The highest similarity score from al synset pairs to which A and B belong ? The average similarity score. The Wikipedia analogue to these two features, ? The highest similarity score from al Wikipedia categories to which A and B belong ? The average similarity score, is augmented by further features: ? Does the first paragraph of the Wikipedia article describing A mention B? ? Does any hyperlink in A?s article target B? ? Does the list of categories for A?s article contain B? ? What is the overlap betwen the first paragraphs of the articles for A and B? The similarity and relatednes scores are computed using various metrics. Feature selection is aplied during training to remove irelevant features for each scenario. The results are included in Table 3, which we wil discus shortly. Yang and Su [207] utilize Wikipedia in a diferent way, asesing semantic relatednes betwen two entities by analyzing their co-ocurence paterns in Wikipedia. (Patern matching using the Wikipedia corpus is practiced extensively in information extraction, as described in Section 5). The paterns are evaluated based on positive instances in the training data that serve as seds. For example, given the pair of co- referents Bil Clinton and president, and Wikipedia sentences like Bil Clinton is elected President of the United States and The US president, Mr Bil Clinton; the paterns [X is elected Y] and [Y, Mr X] are extracted. Sometimes paterns ocur in structured parts of Wikipedia like lists and infoboxes?for example, in United States | Washington, D.C., the bar symbol is the patern. An acuracy measure is used to eliminate paterns that are frequently asociated with both negative and positive pairs. Yang and Su [207] found NWIRE BNEWS R P F R P F baseline 56.3 86.7 68.3 50.5 82.0 62.5 +WordNet 62.4 81.4 70.7 59.1 82.4 68.8 Ponzeto and Strube [206, 207a] +ikipedia 60.7 81.8 69.7 58.3 81.9 68.1 baseline 54.5 80.3 64.9 52.7 75.3 62.0 Yang and Su [207] +sem. related. 57.4 80.8 67.1 54.0 74.7 62.7 Table 3. Performance comparison of two independent techniques on the same datasets. that using the 10 most acurate paterns as features did not improve performance over the baseline. However, ading a single feature representing semantic relatednes betwen the two entities did improve results. Yang and Su use mined paterns to ases relatednes by multiplying together two measures of reliability: the strength of asociation betwen each positive sed pair and the pointwise mutual information betwen the entities ocuring with the patern and by themselves. Table 3 shows the results that both sets of authors report for co-reference resolution. They use the same baseline, but the implementation was evidently slightly diferent, for Ponzeto and Strube?s yielded a slightly improved F-measure. Ponzeto and Strube?s results when features were aded from WordNet and Wikipedia are remarkably similar, with no statistical diference betwen them. These features decrease precision over the baseline on NWIRE by 5 points but increase recal on both datasets, yielding a significant overal gain (1.5 to 2 points on NWIRE and 6 points on BNEWS). Yang and Su improve the F-measure on NWIRE and recal on BNEWS by 2 points. Overal, it sems that Ponzeto and Strube?s technique performs slightly beter. These co-reference resolution systems are quite complex, which may explain why no other methods have ben described in the literature. We expect further developments in this area. 3.4 Multilingual alignment In 206, five years after its inception, Wikipedia contained 10,00 articles for eight diferent languages. The closest precedent to this unique multilingual resource is the comercial EuroWordNet that unifies seven diferent languages but covers a far smaler set of concepts?8,00 to 4,00, depending on the language [Vosen et al. 197]. Of course, multilingual vocabularies and aligned corpora benefit any aplication that involves machine translation. Adafre and de Rijke [206] began by generating paralel corpora in order to identify similar sentences?those whose information overlaps significantly?in English and Dutch. First they used a machine translation tol to translate Wikipedia articles and compared the result with the coresponding manualy writen articles in that language. Next they generated a bilingual lexicon from links betwen articles on the same topic in diferent languages, and determined sentence similarity by the number of shared lexicon entries. They evaluated these two techniques manualy on 30 randomly chosen Dutch and English Wikipedia articles. Both identified rather a smal number of corect sentence alignments: the machine translation had lower acuracy but higher coverage than the lexicon aproach. The authors ascribed the por performance to the smal size of the Dutch version but were optimistic about Wikipedia?s potential. Fer?ndez et al. [207] use Wikipedia for cros-language question answering (se Section 4.3 for research on monolingual question answering). They identify named entities in the query, link them to Wikipedia article titles, and derive equivalent translations in the target language. Wikipedia?s exceptional coverage of named entities (Section 3.2.2) counters the main problem of cros-language question answering: low coverage of the vocabulary that links questions to documents in other languages. For example, the question In which town in Zeland did Jan Torop spend several weks every year betwen 1903 and 1924? mentions the entities Zeland and Jan Torop, neither of which is covered by EuroWordNet. In an initial version of the system using that resource, Zeland remains unchanged and the phrase Jan Torop is translated to Enero Torop because Jan is eroneously interpreted as January. With Wikipedia as a reference, the translation is corect: ?En qu? ciudad de Zelanda pasaba varias semanas al a?o Jan Torop entre 1903 y 1924? With Wikipedia?s help, Fer?ndez et al. increase the percentage of corectly answered questions by 20%. Erdman et al. [208] show that simply folowing language links in Wikipedia is insuficient for a high-coverage bilingual dictionary. They develop heuristics based on Wikipedia?s link structure that extract significantly more translation pairs, and evaluate them on a manualy created test set containing terms of diferent frequency. Given a Wikipedia article that has ben translated into another language?the target article?they augment the translated article name with redirects and also anchor text used to refer to the article. Redirects are weighted by the proportion of links to the target article (including al redirects) that use this particular redirect. Anchors are weighted similarly, by expresing the number of links that use this particular anchor text as a proportion of the total number of incoming links to the article. If a term apears as both redirect and anchor text, the two weights are combined. The resulting dictionary contains al translation pairs whose weight exceds a certain threshold. This achieves significantly beter results than a standard dictionary creation aproach using paralel corpora. Figure 5 shows the system in action. This section has demonstrated Wikipedia?s imense potential as a repository of linguistic knowledge for natural language procesing. Impresive results have ben achieved, particularly on wel-defined tasks such as determining semantic relatednes and word sense disambiguation. 4. INFORMATION RETRIEVAL Given its utility for natural language procesing, it is not surprising that Wikipedia has also ben used to organize documents and locate them. This section describes aplications of Wikipedia to information retrieval. These split roughly into searching and browsing. For searching, Wikipedia has ben leveraged to gain a deper understanding of both queries and documents, and improve how they are matched to each other. Section 4.1 describes how it has ben used to expand queries to alow them to return more relevant documents, while Section 4.2 describes experiments in cros-language retrieval. Wikipedia has also ben used to retrieve specific portions of documents, such as answers to questions (Section 4.3) or important topics (Section 4.4). For browsing, the same Wikipedia-derived understanding has ben used to automaticaly organize documents into helpful groups. Section 4.5 shows how Wikipedia has ben aplied to document clasification, where documents are categorized under broad headings like Sport and Technology. To a leser extent it has also ben used to determine the main topics that documents discus, so that they can be organized under more specific tags (Section 4.6). 4.1 Query expansion Query expansion aims to improve queries by ading terms and phrases, such as synonyms, alternative spelings, and closely related concepts. Such query reformulations can be performed automaticaly?without the user?s input?or interactively?where the system sugests modifications that could be made. Milne et al. [207] use Wikipedia to provide both forms of expansion in their knowledge-based search engine Koru. 15 They first obtain a subset of Wikipedia articles that are relevant for a particular document colection, and use the links betwen these to 15 Demo at htp:/ww.nzdl.org/koru Figure 5. Scren shot of automaticaly created translations for plant. build a corpus-specific thesaurus. Given a query they map its phrases onto topics in this thesaurus. Figure 6 demonstrates how a query president bush controversy is maped to potentialy relevant thesaurus topics (or Wikipedia articles) George H.W. Bush, George W. Bush and Controversy. President Bush is initialy disambiguated to the younger of the two, because he ocurs most often in the document set. This can be corected manualy. The redirects from his article and that of Controversy are then mined for synonyms and alternative spelings, such as Dubya and disagrement, and quotes are aded around multi-word phrases (such as Bush administration). This results in a complex Bolean query such as an expert librarian might isue. The knowledge base derived from Wikipedia was capable of recognizing and lending asistance to 95% of the queries isued to it. Evaluation over the TREC HARD Track [Alan 205] shows that the expanded queries are significantly beter than the original ones in terms of overal F- measure. Milne et al. also provided interactive query expansion by using the detected query topics as starting points for browsing the Wikipedia-derived thesaurus. For example, George Bush provides a starting point for locating related topics such as Dick Cheney, Terorism, and President of the United States. The evaluation of such exploratory search provided litle evidence that it asisted users. Despite this, the authors argue that Wikipedia should be an efective base for this task, due to its extensive coverage and inter-linking. This is yet to be proven, however: to our knowledge there are no other examples of exploratory searching with Wikipedia. Li et al. [207] also use Wikipedia to expand queries, but focus on the most problematic ones; those that traditional aproaches fail to improve. The standard method for improving queries?pseudo-relevance fedback?works by feding terms from the highest ranked documents back into the query [Ruthven and Lalmas 203]. This works wel in general, so most of the state-of-the-art aproaches are variants of this idea. Unfortunately it makes bad queries even worse, because it relies on at least the top few documents being relevant. Li et al. avoid this by using Wikipedia as an external corpus to obtain aditional query terms. They isue the query on Wikipedia to retrieve relevant articles. They then use these articles? categories to group them, and rank articles so that those in the largest groups apear more prominently. Forty terms are then picked from the top 20 articles?it is unclear how they are selected?and aded to the original query. When tested on queries from TREC?s 205 Robust track [Alan 205], this improved those queries on which traditional pseudo-relevance fedback performs most porly. It did not perform as wel as the state of the art in general, however. The authors atribute this to diferences in language and context betwen Wikipedia and the dated news articles used for evaluation, which render many aded terms irelevant. Where the previous two systems departed from traditional bag-of-words relevance fedback, Egozi et al. [208] instead aim to augment it. Their system, MORAG, uses Explicit Semantic Analysis (described in Section 3.1) to represent documents and queries as vectors of their most relevant Wikipedia articles. Comparison of document vectors to the query vector results in concept-based relevance scores, which are combined with those given by state-of-the-art retrieval systems, such as Xapian and Okapi. Aditionaly, both concept-based and bag-of-words scores are computed by segmenting documents into overlaping 50 word subsections (a comon strategy), so that the total score of a document is the sum of the score obtained from its best section and its overal content. One complication that this aproach must overcome is ESA?s tendency to provide features (Wikipedia articles) that are only peripheraly related to queries. The query law enforcement, dogs, for example, results not just in police dog and cruelty to animals, but also contract and Louisiana. To adres this, MORAG first ranks documents acording to their BOW scores, and then uses the highest and lowest ranking documents to provide positive and negative examples for selecting features. When used to augment the four top performing systems from the TREC-8 competition [Vorhes and Harman 200] MORAG achieved improvements of betwen 4% and 15% to Mean Average Precision, depending on the system being augmented. We were surprised to find only these thre papers on using Wikipedia to expand queries, despite the fact that it sems wel suited to this task. Bag-of-words based Figure 6. Using Wikipedia to recognize and expand query topics. George W. Bush Controversy ?George W. Bush? OR ?George Bush? OR ?G.W. Bush? OR Bush OR ?Bush Junior? OR ?Bush government? OR Dubya OR Dubyuh OR ?Bush administration? OR ? AND ( ) George H.W. Bush ? ? ? controversy OR controversial OR controversies OR disagreement OR dispute OR squable ( ) aproaches stand to benefit from Wikipedia?s understanding of what the words mean and how they relate to each other. Concept based aproaches that draw on traditional knowledge bases could profit just as much from Wikipedia?s unmatched breadth. We expect widespread usage of Wikipedia in the future, both for automatic query expansion and exploratory searching, and for both improving existing techniques and suporting entirely new ones. 4.2 Multilingual Retrieval Multilingual or cros-language information retrieval involves searching for relevant documents that were not writen in the same language as the query, which serves the large number of bilingual or multilingual users. Wikipedia has clear aplication to this task. Although its language versions grow at diferent rates and cover diferent topics, they are carefuly interwoven. For example, the English article on Search engines is linked to the German Suchmaschine, the French Moteur de recherch?, and more than 40 other translations. These links constitute a comprehensive cros-lingual dictionary of topics and terms, which is growing rapidly. This makes Wikipedia ideal for translating emerging named entities and topics, such as people and technologies?exactly the items that more traditional multilingual resources (dictionaries) strugle with. Surprisingly, we failed to locate any papers that use Wikipedia?s cros-language links directly to translate query topics. Instead Pothast et al. [208] jump directly to a more sophisticated solution that uses Wikipedia to generate a multilingual retrieval model. This is a generalization of traditional monolingual retrieval models?like the vector space model or latent semantic analysis?which ases similarities betwen documents and fragments of text. Multilingual and cros-language models are capable of identifying similar documents even when they are writen in diferent languages. Pothast et al. take Explicit Semantic Analysis?which, as described in Section 3.1, represents documents by their most relevant Wikipedia concepts?as the starting point for a new model caled Cros-language Explicit Semantic Analysis or CL-ESA. This aproach depends on the hypothesis that the relevant concepts identified by ESA are esentialy language independent, so long as the concepts are suficiently described in diferent languages. If there were suficient overlap betwen the English and German Wikipedias, for example, then one would get roughly the same list of concepts (and in the same order) from ESA regardles of whether the document being represented, or the concept space it was projected onto, was in English or German. This means that the languages of documents and concept spaces are largely irelevant, and documents in diferent languages can be compared without explicit translation. To evaluate this idea, Pothast et al. conducted several experiments with a bilingual (German/English) set of 3,00 documents. One test was to use articles in one language as queries, to retrieve their direct translations in the other language. When CL-ESA was used to rank al English documents by their similarity to German ones, the explicit translation of the document was consistently ranked highly?it was first 91% of the time, and in the top 10 more than 9% of the time. Another test was to use an English document as a query for the English document set, and its translation as a query for the German one. The two result sets had an average corelation of 72%. These results were obtained with a dimensionality of 10 5 ; that is, 10,00 bilingual concepts were used to generate the concept spaces. Today, only German and English Wikipedias have this degre of overlap. Results degrade as fewer concepts are used; Pothast et al. found that betwen 1,00?10,00 concepts are suficient for reasonable retrieval performance. At the time, this made CL-ESA capable of pairing English with German, French, Polish, Japanese, and Dutch. In time, improvements to the algorithm and continued growth of Wikipedia wil alow these techniques to be aplied to other languages as wel. 4.3 Question answering Question answering is a more complex form of information retrieval, which aims to return specific answers to questions, rather than entire documents. This ranges in sophistication from merely obtaining the most relevant sentences or sections from documents, to ensuring that they are in the corect form to constitute an answer, to constructing answers on the fly. Wikipedia provides an extremely broad corpus filed with numerous facts, which makes it a promising source of answers. A simple but wel-known example of this is how Gogle queries prefixed with define, and Ask.com queries starting with What is? or Who is?, often return the first sentences from relevant Wikipedia articles. Kaiser?s [208] QuALiM system, ilustrated in Figure 7, provides a more sophisticated example of question answering with Wikipedia. 16 When asked a question (such as Who is Tom Cruise maried to?) it mines Wikipedia not only for relevant articles, but also for the sentences and paragraphs in which the answer is given. It also provides the exact entity that answers the question?e.g. Katie Holmes. Interestingly, this entity is not mined from Wikipedia but obtained by analyzing results from various web search engines. It parses questions to identify the expected clas of the answer (in this case, a person), and construct valid queries (e.g. Tom Cruise is maried to or Tom Cruise?s wife). Responses to these queries are then parsed to identify entities of the 16 Demo at htp:/demos.inf.ed.ac.uk:8080/qualim/ corect type to satisfy the answer. Wikipedia is then only used to provide the suporting sentences and paragraphs. The TREC series of conferences hosts a prominent forum for investigating question answering, 17 The question-answering track provides ground truth for experiments with a corpus from which answers to questions have ben manualy extracted. The 204 track saw two of the first uses of Wikipedia for question answering, from Lita et al. [204] and Ahn et al. [205]. The former does not perform question answering per se; instead it investigates whether diferent resources provide answers to questions, without atempting to extract the answers automaticaly. Wikipedia?s coverage of answers was 10 percentage points higher than WordNet, and about 30 points higher than the other resources they compared it to, including Gogle define queries and gazeters such as the CIA World Fact Book. Ahn et al. [205] sem to be the first to provide explicit answers from Wikipedia. They first identify the topic of the question?Tom Cruise in our example?and locate the relevant article. They then identify the expected type of the answer?in this case, another person (his wife)?and scan the article for matching entities. These are ranked by both 17 htp:/trec.nist.gov/ Figure 7. The QuALiM system, using Wikipedia to answer Who is Tom Cruise maried to? prior answer confidence (probability that they answer any question at al) and posterior confidence (probability that they answer the question at hand). Prior confidence is given by the position of the entity in the article, since articles cover the most important facts first. Posterior confidence is given by the Jacard similarity of the original question and the sentence surounding the entity. Wikipedia is used as one stream among many from which to extract answers, and unfortunately the experiments do not tease out its specific contribution. Consequently is dificult to measure the efectivenes of their aproach. Overal, however, they describe the results as ?disapointing? because it did not improve upon their previous work. The CLEF series of conferences and competitions is another popular forum for investigating question answering. 18 Monolingual and cros-language QA are adresed by providing corpora and tasks in many diferent languages. One source of documents is a cros-language crawl of Wikipedia. Most entries for this competition extract answers from Wikipedia but are not covered here because they do not take advantage of its unique properties. Buscaldi and Roso [207a] use Wikipedia to augment their question answering system QUASAR. The way in which this system extracts answers was left unchanged, except for an aditional step where Wikipedia is consulted to verify the results. They index four diferent views of Wikipedia?titles, ful text, first sections (definitions), and the categories that articles belong to?and search them diferently depending on the question type. Answers to definition questions (e.g., Who is Nelson Mandela?) are verified by seking articles whose title contains the coresponding entity and whose first section contains the proposed answer. If the question requires a name (e.g., Who is the President of the United States?) the proces is reversed: candidate answers (Bil Clinton, George Bush) are sought in the title field and query constraints (President, United States) in the definition. In either case, if at least one relevant article is returned the answer is verified. This yielded an improvement of 4.5% over the original system, acros al question types. Fer?ndez et al. [207] also make use of Wikipedia?s structure to answer questions, but focus on cros-lingual tasks, where questions are formulated in a language diferent from that of the documents from which answers are extracted. Their work is described in Section 3.4. As wel as using Wikipedia as a corpus for standard question answering tasks, CLEF has a track (WiQA) specificaly designed to asist Wikipedia?s contributors. Its aim, given a source article, is to extract new snipets of information from related articles that should be incorporated into it [Jijkoun and de Rijke 206]. The authors conclude that the 18 The homepage for the CLEF series of conferences is at htp:/ww.clef-campaign.org/ task is dificult but posible, as long as the results are used in a supervised fashion. The best out of seven participating teams aded an average of 3.4 perfect (important and novel) snipets to each English article, with a precision of 36%. Buscaldi and Roso [207b], one of the contributing entries, 19 search Wikipedia for articles containing the text of the target article?s title. They extract snipets from them, rank them acording to their similarity to the original article using the standard bag-of-words model, and discard those that are redundant (to similar) or irelevant (not similar enough). On English data this yields 2.7 perfect snipets per topic, with a precision of 29%. On Spanish data it obtains 1.8 snipets with 23% precision. Higashinaka et al. [207] extract questions, answers and even hints from Wikipedia to automaticaly generate ?Who am I?? quizes. The first two tasks are simple because the question is always the same and the answer is always a person. The chalenging part is extracting hints (which are esentialy facts about the person) and ranking them so that they progres from vague to specific. They used machine learning for this, based on biographical Wikipedia articles whose facts have ben manualy ranked. Overal, research on question answering tends to treat Wikipedia as just another plain-text corpus from which to extract answers. Few researchers take advantage of Wikipedia?s unique structural properties (e.g. categories, links, etc) or the explicit semantics it provides. Instead they aply standard word-based similarity measures, even when Wikipedia concept-based measures such as ESA have ben proven to be more efective. We were surprised to find litle overlap betwen this work and research on information extraction from Wikipedia (Section 5), and no use of Wikipedia derived ontologies or its infoboxes (Section 6). Perhaps this reflects an overal goal of crawling the entire web for answers, requiring techniques that are generalizable to any textual resource. 4.4 Entity ranking It is often expedient to return entities in response to a query rather than ful documents as in clasical retrieval. This resembles question answering and often fulfils the same purpose?for example, the query countries where I can pay in euros could be answered by a list of relevant countries. For other queries, however, entity ranking does not provide answers but instead generates a list of pertinent topics. For example, as wel as Gogle, Yaho, and Microsoft Live the query search engines would also return PageRank and World Wide Web. The literature sems to use the term entity and named 19 We were unable to locate papers describing the others. entity interchangeably, thus it is unclear whether concepts such as information retrieval and ful text search would also be valid results. Section 5.3 demonstrates that Wikipedia ofers an exceptionaly large pol of manualy-defined entities, which can be typed (as people, places, events, etc.) fairly acurately. The entity ranking track of the Initiative for Evaluation of XML Retrieval (INEX) compares diferent methods for entity ranking by how wel they are able to return relevant Wikipedia entities in response to queries [de Vries et al. 207]. Zaragoza et al. [207] also use Wikipedia as a dataset for comparing two main aproaches to entity ranking: entity containment graphs and web search based methods. Their results are of litle interest here because they do not relate directly to Wikipedia. More relevant is that they have developed a version of Wikipedia that has ben automaticaly anotated with named entities, and are sharing it so that others can investigate diferent aproaches to named entity ranking. 20 As wel as a being source of entities, Wikipedia provides a wealth of information about them, which can improve ranking. Vercoustre et al. [208] combine traditional search with Wikipedia-specific features. They rank articles (which they asume are synonymous with entities) by combining the score provided by a search engine (Zetair) with features mined from categories and inter-article links. The article links provide a simplified PageRank for entities and the categories provide a similarity score for how they relate to each other. The resulting precision is almost double that of the search engine alone. Vercoustre et al. were the only competitors for the INEX entity-ranking track we were able to locate, 21 and it sems that Wikipedia?s ability to improve entity ranking has yet to be evaluated against more sophisticated baselines. Moreover, the features that Vercoustre et al. derive from Wikipedia are only used to rank entities in general, not by their significance for the query. Regardles, entity ranking wil no doubt receive more atention as the INEX competition grows and others use Zaragosa et al.?s dataset. The knowledge that Wikipedia provides about entities can also be used to organize them. This has not yet ben thoroughly investigated, the only example being Yang et al.?s [207] use of Wikipedia articles and WikiBoks to organize entities into hierarchical topic maps. They search for the most relevant article and bok for a query and simply strip away the text to leave lists of links?which again they asume to be entities?under the headings in which they were found. This is both a simplistic entity ranking method and a tol for generating domain-specific taxonomies, but has not ben evaluated as either. 20 The anotated version of Wikipedia is at htp:/ww.yr-bcn.es/semanticWikipedia 21 It began in 207 and the Procedings are yet to be published. 4.5 Text categorization Text categorization (or clasification) organizes documents into meaningful homogeneous groups. Documents are labeled from a pol of categories in the same way that articles in a newspaper are asigned to sections like busines, sport, or entertainment. The traditional aproach to this task is to represent documents with the words they contain, and use training documents to identify the words and phrases that are most indicative of each category label. Wikipedia alows categorization techniques to draw on background knowledge about the concepts these words represent. As Gabrilovich and Markovitch [206] note, traditional aproaches are britle. They break down when documents discus similar topics in diferent terms?as when one talks of Wal-Mart and the other of department stores. They canot make the necesary conections because they lack background knowledge about what the words mean. Wikipedia can fil the gap. As a quick indication of Wikipedia?s aplication to text categorization, Table 4 compares Wikipedia-based aproaches with state of the art categorization that only uses information obtained from the documents themselves. The figures were obtained on the Reuters-21578 colection, a set of news stories that have ben manualy asigned to categories. Results are presented as the break even point (BEP) where recal and precision are equal. The micro and macro columns corespond to how these are averaged: the former averages acros documents, so that smaler categories are largely ignored; while the later averages by category. The first entry is a baseline provided by Gabrilovich and Markovitch, which is in line with state-of-the-art document-based methods such as [Dumais et al. 198]. The remaining thre entries use aditional information gleaned from Wikipedia and are described below. The gains may sem slight, but they represent the first improvements upon a performance plateau reached by previous state-of-the-art techniques, which are now a decade old. Gabrilovich and Markovitch [206] observed that documents can be augmented with Wikipedia concepts without complex natural language procesing. Both are in the same form?plain text?so standard similarity algorithms can be used to compare documents with potentialy relevant articles. Thus documents can be represented weighted lists of Micro BEP Macro BEP Baseline (from Gabrilovich and Markovitch [206]) 87.7 60.2 Gabrilovich and Markovitch [206] 8.0 61.4 Wang et al. [207] 91.2 63.1 Minier et al. [207] 86.1 64.1 Table 4. Performance of text categorization over the Reuters-21578 colection. relevant concepts, rather than bags of words. This should sound familiar; it is the predecesor of Explicit Semantic Analysis, an influential technique that we have sen several times before (Section 3.1, 4.1, 4.2). For each document, Gabrilovich and Markovitch generate a large set of features (articles) not just from the document as a whole, but also by considering each word, sentence, and paragraph independently. Training documents are then used to filter out the best of these features, to augment the original bags of words. Aditionaly the number of links made to each article is used to identify and emphasize those that are most wel known. This results in consistent improvements over the previous clasification techniques, particularly over short documents (which otherwise have few features) and smal categories (which provide fewer training examples). The ability of Wikipedia to improve clasification of short documents is confirmed by Banerje et al. [207], who focus on clustering news articles under fed items such as those provided by Gogle News. They tok a simple aproach for obtaining relevant articles for each news story, by isuing its title and short description (Gogle snipet) as separate queries to a Lucene index of Wikipedia. They were able to cluster the documents under their original headings (each fed item organizes many similar stories) with 90% acuracy using only the titles and descriptions as input. This work is somewhat suspect, however, in that it treats Gogle?s automaticaly clustered news stories as ground truth, and only compares their Wikipedia-based aproach to a baseline of their own design. Wang et al. [207] also use Wikipedia to improve document clasification, but focus on mining Wikipedia for terms and phrases to ad to the bag of words that represent each document. For each document, they locate relevant Wikipedia articles by matching n- grams to article titles. They then augment the document by crawling these articles for synonyms (redirects), hyponyms (parent categories) and asociative concepts (inter-article links). In the later case they acknowledge that many links exist betwen articles that are only tenuously related at best. They overcome this by only selecting linked articles that are closely related acording to textual content or parent categories. As shown in Table 4, this results in the best overal performance. As wel as a source of background knowledge for improving clasification techniques, Wikipedia can be used as a corpus for training and evaluating them. Almost al clasification aproaches are machine-learned, and thus require training examples. Wikipedia provides milions of them. Each asociation betwen an article and the categories to which it belongs can be considered as manualy defined ground truth for how that article should be clasified. Gleim et al. [207], for example, use it to evaluate their techniques for categorizing web pages solely on their structure rather than textual content. Admitedly, this is a wel-established research area with wel-known datasets, so it is unclear why another one is required. Table 4, for example, would be more informative if al of the researchers using Wikipedia for document clasification had used standard datasets instead of creating their own. Two interesting aproaches that do not compete with the traditional bag-of-words aproaches (and wil therefore be discused only briefly) are Janik and Kochut [207] and Minier et al. [207]. The former is one of the few techniques that does not use machine learning for clasification. Instead Janik and Kochut mine miniature ?ontologies??rough networks of relevant concepts?from Wikipedia for each document and category, and algorithmicaly identify the most relevant category ontology for each document ontology. The later aproach transforms the document-term matrix used by traditional techniques by maping it onto a gigantic term-concept matrix obtained from Wikipedia. PageRank is run over Wikipedia?s inter-article links in order to weight the derived Wikipedia concepts, and dimensionality reduction techniques (latent semantic analysis, kernel principle component analysis and kernel canonical corelation analysis) are used to reduce the representation to a manageable size. Minier et al. atribute the disapointing results (shown in Table 4) to diferences in language usage betwen Wikipedia the Reuters corpus used for evaluation. It should be noted that their Macro BEP (the highest in the Table) may be misleading; their baseline achieves an even higher result, indicating that their experiment should not be compared to the other thre. Banerje [207] observed that document categorization is a problem where the goalposts shift regularly. The typical aplication is organizing news stories or emails, which arive in a constant stream where the topics being discused constantly evolve. A categorization method trained today may not be particularly helpful next wek. Instead of throwing away old clasifiers, they show that inductive transfer alows old clasifiers to influence new ones. This improves results and reduces the ned for fresh training data. They find that clasifiers which derive aditional knowledge from Wikipedia are more efective at transfering this knowledge, which they atribute to Wikipedia?s ability to provide background knowledge about the content of articles, making their representations more stable. Daka and Cucerzan [208] and Bhole et al. [207] perform the reverse of the above techniques. Instead of using Wikipedia to augment document categorization, they aply categorization techniques to Wikipedia. Their aim is to clasify articles to detect the types (people, places, events, etc.) of the named entities they represent. Since this has more to do with named entity recognition than document clasification, discusion of it is defered to Section 5.3. Also discused elsewhere is Sch?nhofen [206] who developed a topic indexing system but evaluated it as a document clasifier. His work is left for the next section. Overal, the use of Wikipedia for text categorization is a flourishing research area. Many recent eforts have improved upon the previous state of the art; a plateau that had stod for almost a decade. Some of this suces may be due to the amount of atention the problem has generated (at least 10 papers in just thre years), but more fundamentaly it can be atributed to the way in which researchers are aproaching the task. Just as we saw in Section 4.1, the greatest gains have come from drawing closely on and augmenting existing research, while thoroughly exploring the unique features that Wikipedia ofers. 4.6 Topic Indexing Topic indexing is subtly diferent from text categorization. Both label documents so that they can be grouped sensibly and browsed eficiently, but in topic indexing labels are chosen from the topics the documents discus rather than from a predetermined pol of categories. Topic labels are typicaly obtained from a domain-specific thesaurus?such as MESH [Lipscomb 200] for the Medical domain?because general thesauri like WordNet and Roget are to smal to provide suficient detail. An alternative is to obtain labels from the documents themselves, but this is inconsistent and eror-prone because topics are dificult to recognize and apear in diferent surface forms. Using Wikipedia as a source of labels sidesteps the onerous requirement for developing or obtaining relevant thesauri, since it is large and general enough to aply to al domains. It might not achieve the same depth as domain-specific thesauri, but tends to cover the topics that are used for indexing most often [Milne et al. 206]. It is also more consistent than extracting terms from the documents themselves, since each concept in Wikipedia is represented by a single sucinct manualy chosen title. In adition to the labels themselves, Wikipedia provides many aditional features about the concepts, such as how important and wel known they are, and how they relate to each other. Medelyan et al. [208] propose topic indexing that uses Wikipedia as a controled vocabulary and aplies wikification (defined in Section 3.2.1) to identify the topics mentioned within documents. For each candidate topic they identify several features, including clasical, such as how often topics are mentioned, and two Wikipedia-specific ones. One is node degre: the extent to which each candidate topic (article) is linked to the other topics detected in the document. The other is keyphrasenes: the extent to which the topics are used as links in Wikipedia. They use a supervised aproach that learns the typical distributions of these features from manualy taged corpus [Frank et al. 199]. For training and evaluation they had 30 people, working in pairs, index 20 documents. Figure 8 shows key topics for one document and demonstrates the inherent subjectivity of the task?the indexers did not al chose the same topics, and achieved only 30% agrement with each other. Medelyan et al.?s automatic system, whose choices are shown as filed circles in the figure, obtained the same level of agrement and requires litle training. Although it has not ben evaluated as such, Gabrilovich and Markovitch?s [207] Explicit Semantic Analysis, described in Section 3.1, esentialy performs topic indexing. For each document or text fragment it generates a weighted list of relevant Wikipedia concepts, the strongest of which should be suitable topic labels. Another aproach that has not ben compared to manualy indexed documents is Sch?nhofen [206], who uses Wikipedia categories as the vocabulary from which key topics are selected. Documents are scaned to identify the article titles and redirects they mention, and documents are represented by the categories that contain these articles?weighted by how often the document mentions the category title, its child article titles, and the individual words in them. Sch?nhofen did not compare the resulting categories with index topics, but instead used them to perform document categorization. Roughly the same results are achieved whether documents are represented by these categories or by their content in the standard way. Combining the two yields a significant improvement. Like document categorization, research in topic indexing builds solidly on related work, but has ben augmented to make interesting use of Wikipedia. Although not a great deal of research has ben done, significant gains have ben achieved over the previous state of the art. The results have not yet ben evaluated as rigorously as in categorization, however. Medelyan et al. [208] have directly compared their results Figure 8. Topics asigned to a document entitled ?A Safe, Eficient Regresion Test Selection Technique? by human teams (outlined circles) and the new algorithm (filed circles). against manualy defined ground truth, but this was restricted to a relatively smal dataset. To advance further, larger datasets ned to be developed for evaluation and training. 5. INFORMATION EXTRACTION Where information retrieval is driven largely by the goal of answering specific questions, information extraction seks to deduce meaningful structures from unstructured data such as natural language text, though in practice the dividing line betwen the fields is not sharp. These structures are usualy represented as relations. For example, from this: Apple Inc.?s world corporate headquarters are located in the midle of Silicon Valey, at 1 Infinite Lop, Cupertino, California. a relation hasHeadquarters(Apple Inc., 1 Infinite Lop-Cupertino-California) might be extracted. The chalenge is to extract this relation from sentences expresing the same information about Apple Inc., regardles of the actual wording. Moreover, given a similar sentence about other companies, the same relation should be determined with diferent arguments, e.g., hasHeadquarters(Gogle Inc., Gogle Campus-Mountain View- California). Methods for extracting relations from Wikipedia can be grouped into those that use its raw text (Section 2.3.2) and those that use its semi-structured parts and internal hyperlink structure (Section 2.3.3, 2.3.4 and 2.3.5). The former, described in Section 5.1, aply methods developed before Wikipedia was recognized as a linguistic resource; for them, any text represents a source of relations. The extraction proces benefits from the encyclopedic nature of Wikipedia articles and their uniform writing style. The later, described in Section 5.2, exploit unique Wikipedia properties such as infoboxes and the category structure. Finaly, in Section 5.3 the determination of named entities and their type is treated as a task of its own. As noted earlier, Wikipedia?s coverage of named entities is uniquely comprehensive and up-to-date (Section 3.2.3). Such work extracts named entity information such as isA(Portugal, Location) and isA(Bob Marley, Person). Again, although the task is similar to that in Sections 5.1 and 5.2, diferent techniques are aplied, like analysis of geographical cordinates. 5.1 Semantic relations in Wikipedia?s raw text Extracting semantic relations from raw text begins by taking known relations that serve as seds and extracting paterns from their text?X?s * headquarters are located in * at Y in the above example. These paterns are aplied to a large text corpus to identify new relations. For this, a phrase chunker or named entity recognizer is aplied to identify entities that apear in a sentence, intervening paterns are compared to the sed paterns, and when they match, new semantic relations are discovered. Culota et al. [206] sumarize dificulties in this proces: ? Enumerating over al pairs of entities yields a low density of corect relations even when restricted to a single sentence ? Erors in the entity recognition stage create inacuracies in relation clasification. Wikipedia?s structure helps combat these dificulties. Each article represents a particular concept that serves as a clearly recognizable principal entity for relation extraction from that article. Its description contains links to other, secondary, entities. Al that remains is to determine the semantic relation betwen these entities. For example, the description of the Waikato River, shown in Figure 9, links to entities like river, New Zealand, Lake Taupo and many others. Apropriate syntactic and lexical paterns can extract a host of semantic relations betwen these items. Ruiz-Casado et al. [205] mine relations from Simple Wikipedia using WordNet as a source of positive examples (Ruiz-Casado et al. [207] explain the technique in greater detail). Given two co-ocuring semanticaly related WordNet nouns in a Wikipedia article, the intervening text is used to find relations that are absent from WordNet. But first the text is generalized. If the edit distance fals below a predefined threshold?i.e., the two strings nearly match?those parts that do not match are replaced by a wildcard (*). For example, a generalized patern: X directed the * famous|known film Y is obtained from two strings: X directed the famous film Y and X directed the wel known film Y. Using this technique Ruiz-Casado et al. identify 120 new semantic relations with a precision of 61?69% depending on the relation type. Ruiz-Casado et al. [206] generalize this technique to extract relations betwen automaticaly identified entities without using WordNet as a reference. The English Figure 9. Wikipedia?s description of the Waikato River. Wikipedia is used as a corpus, but now the authors concentrate only on those parts that are likely to contain relations of interest. They crawl Wikipedia?s list pages to aces prime ministers, authors, actors, fotbal players, and capitals; and infer the same kind of predefined paterns as above. They manualy evaluate precision on at least 50 examples for each relation type. If the pages are combined into a single corpus results vary wildly, from 8% precision on the player-team relation to 90% for death-year. The reason is heterogeneity in style and mark-up of articles. When the player-team paterns are aplied just to articles about fotbal players, precision increases to 93%. Herbelot and Copestake [206] extract hyponymy relations from sentences containing the verb to be (including is, was, wil be etc.) Instead of performing simple patern matching of the form X is a Y with some wildcards, they analyze the sentences to identify the subject, object and their relationship, regardles of word order. These authors use their own dependency analyzer, caled Robust Minimal Recursion Semantics, which can handle partialy parsed sentences. This analyzer re-organizes a parsed sentence into a series of minimal semantic tres whose rot elements corespond to lemas in the sentence. The same tre is obtained for similar sentences like Xanthidae is a family of crabs and Xanthidae is one of the families of crabs (Figure 10). The results are evaluated manualy on a subset of 10 articles and automaticaly using a thesaurus, restricted it to Wikipedia articles describing animal species. Because only 3 paterns were used, recal was low: 14% at precision 92%. To improve recal they sugest extracting paterns automaticaly. The same dependency analyzer is used, which yields paterns that are more general than regular expresions, although no explicit performance comparison is provided. Initial experiments increase recal to 37%; however, precision drops to 65%. Suchanek et al. [206] also employ linguistic techniques to achieve beter results than regular expresions. They parse each sentence with a context-fre gramar. A patern is defined by a set of syntactic links betwen two given concepts, caled a bridge. For example, the bridge in Figure 11 matches sentences like Chopin was great among the composers of his time where Chopin=X and composers=Y. Machine learning techniques are aplied to determine and generalize paterns that describe relations of interest from manualy suplied positive and negative examples. The aproach is evaluated on article sets with diferent degres of heterogeneity: articles about composers, geography, and random articles. As expected, the more heterogeneous the corpus the worse the results, with best results achieved on composers for the relations birthDate (F-measure 75%) and instanceOf (F-measure 79%). Unlike Herbelot and Copestake [206], Suchanek et al. show that their aproach outperforms other systems, including a shalow patern matching resource TextToOnto 22 and the more sophisticated scheme of Chimiano and Volker [195]. Nguyen et al. [207a, 207b] augment these ways of combining lexical and syntactic paterns with techniques such as anaphora resolution (to increase coverage), ful dependency parsing and subtre mining. Sentences are analyzed with OpenNLP 23 and anaphora and co-referents resolved using a simple heuristic developed specialy for the purpose. Thus, in an article about the software company 3PAR, phrases like 3PAR, manufacturer, it and company are taged as the same principal entity. Next, al link anchors in the article are taged as secondary entities?ones relating to the principal entity. Sentences with at least one principal and one secondary entity are analyzed by the Minipar dependency parser. The dependency tre of Figure 12a is extracted from the sentence David Scot joined 3PAR as CEO in January 201 and is then generalized to match similar sentences (Figure 12b). The subtres are extracted from a set of training sentences containing positive examples and then aplied as paterns to find new semantic relations. The scheme was evaluated using 3,30 manualy anotated entities, 20 of which were reserved for testing. 6,00 Wikipedia articles, including 45 test articles, were used as the corpus. The new aproach achieved an F-measure of 38%, with precision significantly higher than recal, significantly outperforming two simple baselines. Wang et al. [207a] use selectional constraints in order to increase the precision of regular expresions without reducing coverage. They also automaticaly extract positive seds from infoboxes. For example, the infobox field Directed by describes relation hasDirector(FILM, DIRECTOR) with positive examples and 22 htp:/sourceforge.net/projects/textonto 23 htp:/openlp.sourceforge.net/ Figure 10. Output of the Robust Minimal Recursion Semantics analyzer for the sentence Xanthidae is one of the families of crabs [Herbelot and Copestake, 206]. . They colect paterns that intervene betwen these entities in Wikipedia?s text and generalize them into regular expresions like X (is|was) (a|an) * (film|movie) directed by Y. Selectional constraints restrict the types of subject and object that can co-ocur within such paterns. For example, Y in the patern above must be a director?or at least a person. The labels specifying the types of entities implemented as features are derived using words comonly ocuring in Wikipedia articles describing these entities. For example, instances of ARTIST extracted from a relation hasArtist(ALBUM, ARTIST) often co-ocur with terms like singer, musician, guitarist, raper, etc. To ensure beter coverage, Wang et al. cluster such terms hierarchicaly. The advantage of selectional constraints is that they alow paterns such as ?X?s Y? and ?X of Y? to be aplied. The relations hasDirector and hasArtist are evaluated independently on a sample of 10 relations extracted automaticaly from the entire Wikipedia and were manualy asesed by thre human subjects. An unsupervised learning algorithm was aplied, and the features were tested individualy and together. The authors report precision and acuracy values close to 100%. The same authors investigate a diferent technique that does not rely on paterns at al [Wang et al. 207b]. Instead, features are extracted from two articles before determining their relation: Figure 1. Example bridge patern used in Suchanek et al. [206]. (a) (b) Figure 12. Example dependency parse in Nguyen et al. [207]. ? The first noun phrase and its lexical head that folows the verb to be in the article?s first sentence (e.g., comedy film and film in Annie Hal is a romantic comedy film) ? Noun phrases that apear in the coresponding category titles and the lexical heads. ? Infobox predicates, e.g. Directed by and Produced by in Annie Hal. ? Terms that apear betwen the articles in sentences that contain them both as a link. For each pair of articles the distribution of values of these features is compared with that of positive examples. Unlike in [Wang et al. 207a], no negative instances are used. A special learning algorithm (B-POL) designed for situations where only positive examples are available is aplied. First, negative examples are identified from unlabeled data using a weak clasifier, and then a strong clasifier (e.g., SVM) is used to iteratively clasify negative examples until none remain. Four relations were used for evaluation, hasArtist(ALBUM, ARTIST), hasDirector(FILM, DIRECTOR), isLocatedIn(UNIVERSITY, CITY), isMemberOf(ARTIST, BAND), along with 1,00 named entity pairs clasified by thre human subjects. Best results were an F-measure of 80% on the hasArtist relation, which had the largest training set; the worse was 50% on isMemberOf. Wu and Weld [207] view the extraction problem as a task of improving infoboxes in Wikipedia. Like Wang et al. [207a, 207b] they use their content as training data. Their system caled Kylin first maps infobox atribute-value pairs to sentences in coresponding Wikipedia article using some simple heuristics. Next, for each atribute it creates a sentence clasifier that uses sentence?s tokens and their part of spech tags as features. Given an unsen Wikipedia article, a document clasifier analyzes its categories and asigns an infobox clas, e.g. ?U.S. counties?. Next, sentence clasifier is aplied to asign relevant infobox atributes. Extracting values from the sentences is treated as a sequential data-labeling problem and Conditional Random Fields are aplied for this. Precision and recal of Kylin are measured by its ability to generate corect infoboxes for Wikipedia articles, for which infobox information is known. The authors judged manualy the atributes produces by their system and by Wikipedia authors. Kylin?s precision ranged from 74 to 97%, at recal levels of 60 to 96% respectively, depending on the infobox clas. The authors? precision was around 95% on average and more stable acros the clases; their recal was significantly beter on most clases but worse or same on others. In a later work Wu et al. [208] adres problems in their aproach in the folowing way. To generate complete infobox schemata for articles of rare clases, they refer to WordNet?s ontology and agregate atributes from parents to their children clases. E.g. knowing that isA(Performer, Person), infobox for Performers receives prior mising field BirthPlace. To provide aditional positive examples, they aply TextRuner [Banko et al. 207] to the web, in order to retrieve aditional sentences describing the same atribute-values pairs. Given a new entity for which an infobox neds to be generated, they use Gogle search to retrieve aditional sentences describing this entity. The combination of these techniques improves the recal by 2 to 9 percentage points while maintaining of increasing precision. Kylin?s results are the most complete and impresive in this group of aproaches. The majority presented aproaches take advantage of Wikipedia?s encyclopedic nature using it as a corpus for extracting semantic relations. Simple patern matching techniques are outperformed by those that use parsing [Suchanek et al. 206], selectional constraints [Wang et al. 207a] and lexical features [Wang et al. 207b]. Wang et al. [207a] and Wu et al. [207] show that Wikipedia infoboxes contain positive examples that can improve the extraction if machine learning is aplied. Wu et al. [208] prove that retrieving aditional content from the web bosts the extraction performance. It would be helpful to directly compare the aproaches on the same data set. Of course for this, the researchers would ned to reach a consensus on what relations they wil extract. At this point, while there is an overlap in some relations (isMemberOf, InstanceOf, hasDirector), the choice of a particular relation set by a research group sems to be arbitrary. Furthermore, none of these techniques take advantage of Wikipedia?s structural information like hyperlinks betwen the articles and their categorization. As the next section shows, such information contains a wealth of semantic relations outnumbering the ones apearing in Wikipedia?s actual text. Figure 13. Fragment of Wikipedia?s category structure [Ponzeto, 207]. 5.2 Semantic relations in structured parts of Wikipedia Here we describe research that adreses the limitations just identified by seking semantic relations in (semi-)structured parts of Wikipedia, with the goal of building an alternative to manualy created knowledge bases such as WordNet and Cyc. Some label existing links betwen categories and articles, a proces sometimes refered as link-typing. As noted in Section 2.2.6, Wikipedia?s category structure is made up of what are in fact rather diferent kinds of relations. For example, in Figure 13 Category:Mathematical logic belongs to both Category:Logic and Category:Mathematics, the former relation should arguably be isA and the later partOf. Further diferentiation betwen category relations in Wikipedia is required to transform it into a lexical knowledge base like those created by humans. Some aproaches use Wikipedia?s infoboxes (Figure 14) as a further source of relational information. Chernov et al. [206] were one of the first to analyze links betwen Wikipedia categories. Their goal was to determine semanticaly strong links, as oposite to ?iregular and navigational links.? They develop two measures. One corelates semantic strength with the number of hyperlinks betwen articles asigned to two categories in question; the other is the conectivity ratio?the number of links from articles in one category to articles in the other, expresed as a proportion of the total number of links in the first category. Evaluation uses a sample of 10 category pairs, each asesed by human subjects as strongly, averagely or weakly related. Chernov et al. observe that both measures corelate with human judgments, but a more thorough study is required. Several projects extract relations from Wikipedia of a quantity or organization that might properly be caled ?ontological?. Discusion of these projects impinges on the teritory of Section 6. Here we discus the projects? methods and relationship to other IE research, while in Section 6 we discus their end-products considered as ontologies in their own right. One such project is YAGO, Yet Another Great Ontology [Suchanek et al. 207]. Here Wikipedia?s leaf categories are maped onto the WordNet taxonomy of synsets, and the articles belonging to those categories are aded to the taxonomy as new elements. To perform the maping, each category?s lexical head is extracted?people in Category:American people in Japan and, if necesary, expresed in singular form? person?before being sought in WordNet. If there is a match, it is chosen as the clas for this category. This scheme extracts 143,00 isA relations?in this case, isA(American people in Japan, person/human). If more than one match is posible, word sense disambiguation is required (cf. Section 3.2.3). The authors experimented with maping a category?s subcategories to WordNet and chosing the sense that coresponds to the smalest resulting taxonomic graph. However, they claim that this semanticaly enhanced technique does not perform as wel as chosing the most frequent WordNet synset for a given term (the frequency values are provided by WordNet), an observation that sems inconsistent with findings by other authors [e.g. Medelyan and Milne 208] who show that the most frequent sense is not necesarily the intended one (Section 3.2.3). Having established a large core taxonomy, the authors define a mixed suite of heuristics for extracting further relations to ad to it. For instance a name parser is aplied to al personal names to identify given and family names, ading 40,00 relations like familyNameOf(Albert Einstein, ?Einstein?). Many heuristics make use of the Wikipedia category names, alowing extraction of relations like bornInYear, establishedIn, locatedIn and others. For example, subcategories of categories ending with birth (e.g., 1879 birth) and establishments, corespond to the first two relations. A category like Cities in Germany indicates the locatedIn relation. This yields 370,00 non-hierarchical, non- synonymous relations. Manual evaluation of sample facts by human judges shows 91? 9% acuracy, depending on the relation. Also aded are 2M synonymy relations generated from redirects, 40M context relations generated from cros-links betwen articles, and 2M type relations betwen categories considered as clases and their articles considered as entities. Section 6.6 discuses the number and kinds of facts in YAGO in more detail, as wel as further specificaly ontological features, such as its purpose-built ontology language. 24 Another extremely large-scale relation-extraction project is DBPedia [Auer and Lehman 2007]. This project analyses Wikipedia?s infoboxes and transforms their content into RDF triples. Figure 14 shows part of the infobox from the New Zealand article; on the right is the Wiki mark-up used to create it. Extracting information from infoboxes is by no means trivial. The information they contain is expresed in an atribute-value notion, which is rendered inside a wiki page by means of an asociated template. There are many diferent templates, with a great deal of redundancy betwen them?for example, Auer and Lehman report separate templates for Infobox_film, Infobox Film, and Infobox film. Recursive regular expresions are used to parse relational triples from al templates that are comonly used in Wikipedia and contain at least several predicates. For example, the country template encodes relations like hasCapital(New Zealand, Welington) or hasPrimeMinister(New Zealand, Helen Clark). The templates are taken at face value; no heuristics are aplied to verify their acuracy. The URL of each entity linked to from an article is recorded as a unique identifier. Wikipedia categories are treated as clases and articles as individuals. However, Auer and Lehman do not say what hapens to articles that have coresponding categories, like New Zealand; presumably article and category receive diferent identifiers. Unlike YAGO there is no atempt to place facts in the framework of an overal taxonomic structure of concepts. Apart from the infobox relations, links betwen categories are merely extracted and labeled with the relation isRelatedTo. The resulting DBPedia dataset contains 15,00 clases and 650,00 individuals sharing 8,00 types of semantic relations. A total of 103M triples are extracted, far surpasing any other scheme. 25 However, 60% of these are internal links derived from Wikipedia?s link structure; only 15% are taken directly from infoboxes. Also since there is no evaluation it is dificult to judge how acurate the triples are. Unlike other aproaches, DBPedia relies on the acuracy of Wikipedia?s contributors, and Auer and Lehman sugest guidelines for authors in order to improve the quality of infoboxes with time. Section 6.6 further discuses DBPedia in the context of YAGO and other ontologies Work at the European Media Lab Research (EMLR) takes up the chalenge of further diferentiating category links independently of the DBpedia project. Ponzeto and Strube 24 YAGO can be queried online or downloaded from htp:/ww.mpi- inf.mpg.de/~suchanek/downloads/yago/ {{ Infobox Country or teritory | native_name = New Zealand | ? capital = [Welington] | latd = 41 | latm = 17 | latNS = S | longd = 174 | longm = 27 | longEW = E | largest_city = [Auckland] | oficial_languages = [New Zealand English|English] (98%) [M?ori language|M?ori] (4.2%) [New Zealand Sign Language|NZ Sign Language] (0.6%) | demonym = [New Zealand People|New Zealander],[Kiwi (people)|Kiwi] | government_type = [Parliamentary democracy] and [Constitutional monarchy] ?}} Figure 14. Wikipedia infobox on New Zealand. [207] observe that the first task is to construct a knowledge taxonomy, or subsumption hierarchy, and that the quickest way to do this is to identify and isolate isA relations from amongst already-existing category links. Here isA is thought of as subsuming relations betwen two clases?isSubclasOf(Apples, Fruit)?and betwen an instance and its clas?isInstanceOf(New Zealand, Country). They analyze category titles and their conectivity to distinguish betwen isA and what they cal ?notIsA? relations. Several steps are aplied in order of acuracy. One of the most acurate matches the lexical head and modifier of two phrases. Sharing the same lexical head indicates an isA relation, e.g., isA(British computer scientist, Computer scientist). Modifier matching indicates notIsA, e.g., notIsA(Islamic mysticism, Islam). Another method uses co-ocurence statistics of two categories within paterns to indicate hierarchical and non-hierarchical relations, e.g., NP 2 ,? (such as|like|, especially) NP* NP 1 indicates isA, and NP 1 are? used in NP 2 indicates notIsA. This technique induces 10,00 isA relations from Wikipedia. Comparing the derived labels with relations asigned (by knowledge enginers) to concepts with the same lexical heads in ResearchCyc shows that their labeling is highly acurate, depending on the method used, and yields an overal F-measure of 8%. Ponzeto [207] describes how they plan to aply the induced knowledge base to natural language procesing tasks such as co-reference resolution. Since then the same research group has further refined semantic relations betwen Wikipedia categories. Zirn et al. [208] divide the derived isA relations into those expresing isSubclasOf and isInstanceOf. For example, Category:American scientist generalizes Category:American physicists, whereas Category:Albert Einstein is an instance of Category:American physicists. Two methods asume that al named entities are instances and thus related to their categories by isInstanceOf. One uses a named entity recognizer, the other a heuristic based on capitalization in the category title. Further methods include heuristics like: If a category has at least one hyponym that has at least 25 Further information, and the extracted data, can be downloaded from htp:/ww.dbpedia.org Figure 15. Relations inferred from BY categories [Nastase and Strube 208]. two hyponyms, it is a clas. Evaluation against 8,00 categories listed in ResearchCyc as individuals (instances) and colections (clases) shows that the capitalization method is best, achieving 83% acuracy; however, combining al methods into a single voting scheme improves this to 86%. The taxonomy derived from this work is available in RDF Schema format. 26 Nastase and Strube [208] extract non-taxonomical relations from Wikipedia by parsing category titles. They are no longer just working with the category network but also deriving entirely new relations betwen categories, articles and terms extracted from category titles. Explicit unitary relations are extracted?for example, analysis of the category title Quen (band) members results in the memberOf relation being infered from the articles in that category to the article for the band, e.g. memberOf(Brian May, Quen (band). Explicit binary relations are also extracted?for example, if a category title matches the patern X [VBN IN] Y, for instance Movies directed by Wody Allen, the verb phrase is used to ?type? a relation betwen al articles asigned to the category and the entity Y, e.g. directedBy(Annie Hal, Wody Alen), while the clas X is used to further type the articles in the category, e.g. isA(Annie Hal, Movie). Particularly sophisticated is their derivation of entirely implicit relations from the very comon X by Y patern in Wikipedia category names, which facets a great deal of the category structure (e.g. Writers By Nationality, Writers by Genre, Writers by Language). For instance, given the category title Albums By Artist, they not only label al the articles in the category isA(X, Album), but also find subcategories pertaining to particular artists (e.g. MilesDavis, Albums), locate the article coresponding to the artist, label the entity as an artist, e.g. isA(MilesDavis, Artist) and label al members of the subcategory as being produced by him, e.g. artist(KindOfBlue MilesDavis). Figure 15 ilustrates this. Nastase and Strube identify a total of 3.4 milion isA and 3.2 milion spatial relations, along with 43,00 memberOf relations and 4,00 other relations such as causedBy and writenBy. Evaluation with ResearchCyc was not meaningful because of litle overlap in extracted concepts?particularly named entities. Instead, human anotators analyzed four samples of 250 relations from the above sets; precision ranged from 84 to 98% depending on relation type. Once again the implications of this work for ontology building wil be discused in Section 6.6. Although the thre aproaches presented in this section?YAGO, DBPedia and EMLR?s taxonomy?have the same goal, to create an extensive, acurate knowledge base of human language, the techniques difer significantly. The first combines Wikipedia?s 26 htp:/ww.eml-r.org/english/research/nlp/download/wikitaxonomy.php leaf categories (and their instances) with Wordnet?s hypernym hierarchy, embelishing this structure with further relations; the second basicaly dumps the contents of Wikipedia?s infoboxes with litle further analysis; and the third performs a diferentiation or ?typing? of category links, folowed by an analysis of category titles and the articles contained by those categories to derive further relations. As a result, the information extracted varies. For instance whereas Suchanek et al. [207] extracts the relation writenInYear, Nastavi and Strube [208] detect writenBy and Auer and Lehman [207] generate writen, writenBy, writer, writers, writerName, coWriters, as wel as their case variants. There has so far ben litle comparison of these aproaches, testing of them against each other or atempts to integrate them. We lok forward to further research in this area. 5.3 Typing Wikipedia?s named entities One main disadvantage of Wikipedia is its lack of semantic anotation. Infoboxes for entities of the same kind share similar characteristics?for example, Apple Inc, Microsoft and Gogle share the fields Founded, Headquarters, Key People and Products?but Wikipedia does not state that they belong to the same type of named entity, namely company. Knowing the type of entity?e.g., location or person?would suply information that is important for tasks such as information retrieval and question answering (Section 4). This section covers research that clasifies articles into predefined clases representing entity-types. The results are semantic relations of a particular kind, e.g. isA(London, Location). Toral and Mu?os [206] extract named entities from the Simple Wikipedia using WordNet?s noun hierarchy. Given an entry?Portugal?they extract the first sentence of its definition?Portugal is a country in the south-west of Europe?and tag each word with its part of spech. They asign nouns their first (i.e. most comon) sense from WordNet and move up in the hierarchy to determine its clas, e.g., country ? location. The majority clas apearing in the sentence determines the clas of the article itself (i.e. entity). The authors achieve 78% F-measure on 404 locations and 68% on 236 persons. They do not use Wikipedia?s special features but mention this as future work. Buscaldi and Roso [207] pursue the same task, but concentrate on locations. Unlike Toral and Mu?os [206], they analyze not merely the first sentence but the entire description of each article. In order to determine whether it describes a geographical location, they compare its content with a set of keywords extracted from gloses of locations in WordNet using the Dice metric and cosine coeficient; they also use a multinominal Na?ve Bayes clasifier trained on the Wikipedia XML corpus [Denoyer and Galinari 206]. When evaluated on data provided by Overel and R?ger [207] (described in Section 3.2.2) they find that cosine similarity outperforms both the WordNet-based Dice metric and Na?ve Bayes, achieving an F-measure of 53% on ful articles and 65% on the first sentence. However, the authors fail to achieve Overel and R?ger?s [206] results, and conclude that the content of articles describing locations is les discriminative than other features like geographical cordinates. Section 3.2.2 discused how Overel and R?ger [206, 207] analyze named entities representing geographic locations, thereby maping articles to place names listed in a gazeter. It also described another group of aproaches that recognize named entities apearing in raw text and map them to articles. Apart from these, litle research has ben done on determining the semantic types of named entities. It is surprising that both techniques described in the present section use WordNet as a reference for the entities? semantic clas instead of refering to Wikipedia?s categories. For example, the thre companies mentioned above belong to subcategories of Category:Companies and Portugal is listed under Category:Countries. Moreover, neither technique utilizes the shared infobox fields mentioned above. Anotating Wikipedia with entity labels sems to be low-hanging fruit and we expect to se more advances in the near future. Aproaches to information extraction are les wel defined than for natural language procesing and most information retrieval tasks, and vary in their scope and depth depending on the research group. There is a dearth of comonly used ground truth data, each technique being evaluated in a diferent way. It sems that a unified comprehensive general-purpose ontology would be the ideal extension of the research discused above. For instance, it could unify the specific relations concerning fotbal players and their birth dates extracted from article text with the wealth of taxonomic relations in Wikipedia?s category structure and any available named entity information. Thus the next section reviews some of the projects described above, and others, from the perspective of clasical, large-scale ontology building. 6. ONTOLOGY BUILDING AND THE SEMANTIC WEB We now turn to the use of Wikipedia for creating ontologies: comprehensive, large-scale information resources. Section 5 also covers agregation of knowledge into forms structured for automated reasoning. Nevertheles it is worth treating the topics separately, because ontology building aims for a resource with a level of internal organization and consistency not always found in information extraction. Hence while Section 5 describes the many diferent methods used for the task, here we consider research projects from the perspective of the comprehensivenes and sophistication of their results, and also the extent to which they contribute to the broad-ranging and ambitious research project known as the semantic web. 6.1 Background: What is Ontology? A formal ontology is a machine-readable theory of the meanings of some set of concepts or ?categories.? Building such a resource involves naming the concepts, representing and often categorizing the links betwen them, and usualy encoding some key facts about them. Thus it is generaly thought that an ontology which includes the concept tre should i) name it as a first-clas object (to which synonyms such as the French arbre may be atached), i) link it to closely-related concepts such as leaf, preferably with some indication that a leaf is part of a tre, rather than for instance a type of tre, and ii) it would be at least helpful if it represented facts such as ?There are no tres in the Antarctic.? Having said that, there is a large spectrum of complexity and ambition amongst ontology projects. One measure of complexity is the logical expresivity of the relevant ontology language [McGuines 2003], which has a direct trade-of with inferential tractability, due to the vastly increased computation required to prove statements true in more expresive languages. Expresivity ranges from thesaurus-style representations of synonyms and homonyms, through frame-systems in which individuals are placed in clases in a subsumption hierarchy, through description logics that constitute large decidable fragments of first-order logic [Bader et al. 207], to ful first-order and even higher-order logic?for instance the Cyc project, with its purpose-built inference engine [Lenat 195]. Ontology work began in earnest in the 1980s as a branch of AI research. After an initial rush of enthusiasm, the trade-of betwen logical expresivity and inferential tractability emerged and became a major obstacle, because much of the human knowledge that arguably should be represented in an ontology can only be stated in languages of great logical expresivity?for instance, negations and disjunctions require ful first-order logic, while statements about statements require higher-order logic. Nevertheles, the goals of formal ontology have reawakened with the semantic web [Berners-Le et al. 201; Berners-Le 203]. Since Berners-Lee?s vision is to index the web via meanings, not just character-strings, it is widely acepted that it wil have to draw on some kind of shared, machine-readable, conceptual scheme. But the big stumbling block has ben obtaining the world?s involvement. At least two major problems ned to be solved?first to define ?semantic metadata? and then to mark up the web with it. The World-Wide Web Consortium recently defined a web ontology language, OWL [McGuines and van Harmelen 2004]. It has thre versions of diferent levels of expresivity: Owl Lite (thesaurus level), OWL DL (description logic-level) and OWL Ful (ful first-order logic). But atempts to set up repositories for large-scale sharing and re-use of OWL ontologies have failed to gain traction. It is worth emphasizing that the manual creation of ontologies is enormously dificult. It requires detailed knowledge of formal logic, and for the creation of uper and midle ontologies some understanding of metaphysics (whether explicitly formulated or ?quick and dirty?). Moreover, as size increases, so do the interconections amongst ontology?s categories, rendering the potential ramifications of local changes exponentialy more significant. Cyc, the most ambitious ontology project, has employed specialist ontological enginers with PhDs in philosophy over a period of 20 years without reaching any natural end-point to the development proces. Its nearest competitor, SUMO, 27 is an order of magnitude smaler. Large ontologies have ben created for specific, wel-funded research areas such as biomedical science, e.g. the Gene Ontology 28 and SNOMED, 29 but again with a huge investment of labor. They are not without their problems [Smith et al. 203], and have to be continualy updated. Projects in ?ontology learning? have ben tried but so far achieved rather por performance [Buitelar 2005]. Could Wikipedia, with its abundance of fre, up-to-the-minute contributions, high visibility and remarkable consensus, be used to bypas these laborious ontology-creation methods? Section 2.3.5 mentioned ways in which it may already be sen in this light: its articles are basic concepts, both general concepts and named entities, aranged in some kind of hierarchy via the category structure, and further organisable via a wealth of other relations that may be mined from Wikipedia?s structure. There is a vast quantity of ?domain-ontology? facts in structured and semi-structured form. On the downside, however, as noted in Section 2.2.6, Wikipedia?s category system sems curently incapable of suporting principled knowledge inheritance, on pain of, for instance, infering isA(Domestic Pig, Pork). Finaly, Wikipedia provides no means to perform inferences over its various structures. This section, like Section 5, is organized around the diferent kinds of features that researchers sek to mine from Wikipedia. However, because the task is now ontology- building, we consider a somewhat diferent list, namely: knowledge organization, named entities, synonymy relations and other thesaurus-type information, ontology alignments and finaly ful-blown facts. This research area may alternatively be broken down into projects that sek to augment already existing ontologies or knowledge bases, including Wikipedia itself, and those that build brand new resources, and we wil se both kinds. 27 htp:/ww.ontologyportal.org 28 htp:/ww.geneontology.org 29 htp:/ww.snowmed.org 6.2 Knowledge Organization Halavais and Lackaf [208] ases the overal breadth and comprehensivenes of Wikipedia?s coverage of al knowledge. They ask whether the particular enthusiasms of volunter editors produce excesive coverage of certain topics by comparing topic- distribution in Wikipedia with that in Books In Print, and with a range of printed scholarly encyclopedias. They measure this using a Library of Congres categorization of 300 randomly-chosen articles and find Wikipedia?s coverage remarkably representative, except for law and medicine. Muchnik et al. [207] recomend automatic generation of knowledge hierarchies. They develop five algorithms for organizing Wikipedia articles into a hierarchy, which they evaluate against Wikipedia?s category hierarchy. They note that although the matches are not exact, the category hierarchy itself leaves much to be desired?it would be fruitful to evaluate both against human benchmarks. 6.3 Named Entities Turning now to named entities, Section 3.2.2 described detailed methods for disambiguating named entity terms by linking them to Wikipedia articles; Section 4.4 covered named entity ranking for question answering; and Section 5.3 loked at ways of recognizing named entities in Wikipedia itself. Here it is worth highlighting Wikipedia?s natural and straightforward role as indexer of named entities. Regarding Wikipedia article URLs as URIs solves one of the most significant problems facing the semantic web: it is easy to create a XML/RDF namespace that names an entity, but dificult to publicize this URI, get anyone else to use it, or cordinate with other posible definitions of namespaces to represent the same things [Leg 207]. Many authors have noted that Wikipedia, by contrast, enjoys al the broad aceptance and availability that semantic web proponents originaly hoped for (e.g. Hep et al. [206], Bhole et al. [207], McCol [206]). However, using named entity URIs for semantic web purposes arguably awaits the arival of URIs for further crucial features of human language, such as general terms (e.g. tre), and predicates (e.g. cut down). 6.4 Thesaurus Information Section 3 discused mining Wikipedia for ?thesaurus-style information??namely semantic relatednes measures (Section 3.1) and word sense disambiguation (3.2). Here we specificaly discus the use of Wikipedia to generate large-scale, independent, general and systematic thesauri. There is a natural bridge from this task to ful-blown ontology- building, for once a system of terms is interconected via links representing general semantic relatednes, these links may then be upgraded, or ?typed?, to more specific ontological relations. Gregorowicz and Kramer [206] sek to construct a comprehensive term-concept map that wil solve ?the problem of variable terminology? and facilitate concept-based information retrieval by resolving synonyms in a systematic way. They use al Wikipedia articles as concepts, and establish synonyms via redirects and homonyms via disambiguation pages. The result is 2M concepts linked to 3M terms?a vast and impresive resource compared to WordNet?s 15,00 synsets created from 150,00 words. Likewise Nakayama et al. [207, 207, 208] describe a project to build a large general-purpose thesaurus solely from Wikipedia?s hyperlink structure, obtaining a thesaurus of 1.3M concepts with a measured strength of relatednes betwen each one. They then sugest upgrading the thesaurus to a ful-blown ontology by typing the generic relatednes measures betwen concepts into more traditional ontological relations such as isA and partOf. Details of how this wil be done are sketchy. The idea of link typing is developed in greater detail in [Kr?tzsch et al. 205, 207] and [V?lkel et al. 206]. Unlike Nakayama et al., however, they plan to aply it to Wikipedia?s own hyperlink structure. They note the profusion of links betwen articles, al indicating some form of semantic relatednes, and then claim that categorizing them would be a simple, unintrusive way of rendering large parts of Wikipedia machine- readable. For instance, the existing hyperlink from Leaf to Plant would be labeled partOf, that from Leaf to Organ labeled kindOf, and so on. Categorizing al hyperlinks would be a significant task, and they recomend introducing a system of link types and encouraging the Wikipedia editors to start using them, and to sugest further types. This raises interesting usability isues. Given that ontology is specialist knowledge (at least as traditionaly practiced by ontological enginers), it might be argued that disaster could result if every Wikipedian were alowed to aply it in acord with Wikipedia?s uniquely democratic editing model. On the other hand, one might ask why this is any diferent to other specialist aditions to Wikipedia (e.g. cel biology, diesel locomotive enginering, Scotish jaz musicians), whose contributors show a remarkable ability to self-select, yielding surprising and impresive quality control. Perhaps the most tricky characteristic of ontology is that, unlike specialist topics such as cel biology, people think they are experts in it when in fact they are not. At any rate, this research is esentialy a proposal for Wikipedia?s developers to ad further functionality, and its results canot yet be evaluated. Like Kr?tzsch et al., Wu and Weld [207, 208] sek to augment Wikipedia itself. Their aim is to help kick-start the semantic web by marking up Wikipedia semanticaly in order to create enough structured data to make it worthwhile for developers to produce aplications for it. To do this they propose a combination of automated and human proceses. They investigate the use of machine learning techniques for completing infoboxes by extracting data from article text, constructing new infoboxes from templates where apropriate, rationalizing tags, merging replicated data using microformats, disambiguating links, ading aditional links, and flaging items for verification, corection, or the adition of mising information. As with Kr?tzsch et al., it wil be interesting to se whether Wikipedia editors wil be eager to work on the colaborative side of this project, and also how efective they are. Furthermore, it is worth asking? even if these projects? aims were achieved and Wikipedia became a complete machine- readable knowledge base, would this bring about the semantic web? How exactly would its existence render the rest of the web machine-readable? Publications from EMLR that were discused in detail in Section 5.2 may also be viewed under this heading of link-typing for ontology-building. We saw that these authors focused initialy on Wikipedia?s category network, aiming to discriminate betwen isA and notIsA links [Ponzeto and Strube 2007]. They then further discriminated betwen two kinds of isA: clas instance and subclas relationships [Zirn et al. 208]. Unlike Kr?tzsch et al., and Wu and Weld, they sek to acomplish this task entirely automaticaly by deducing such relations from an analysis of the titles of interlinked categories. How do their results measure up as an ontology? They claim to derive 105,00 isA links, roughly one for each Wikipedia category. Evaluation of Zirn et al?s results against the entirely manualy created ResearchCyc yielded an acuracy of around 83%, which is impresive. However, though large and comparable with Cyc, this is stil much smaler than the 2M concepts in Wikipedia?s articles. Also, as a mere isA taxonomy it constitutes a relatively inexpresive frame-system-level ontology, lacking in any further relations that might define the concepts in the hierarchy. Finaly, though it has ben released as a giant set of RDF triples, no ready means to perform inferencing over it sems yet available. Section 5.2 also described how the same research group turned in later work to parsing category titles and using them to derive new (typed) relations betwen Wikipedia articles [Nastase and Strube 2008]. Because this work qualifies as mining ?facts? for ontology-building purposes, it is discused in Section 6.6. 6.5 Ontology Alignment Finding categories in diferent ontologies that in some sense ?mean the same? can be a useful exercise in itself. If the resources are in the same language, string-matching on category titles goes a long way but is insuficient: homonyms in the mapings must be detected and eliminated. This task thus overlaps greatly with the word sense disambiguation problem discused in Section 3.2. The problem cuts both ways: there may be one-to-many string matches from a concept in either of the maped ontologies to concepts in the other. WordNet is a popular choice of ontology for alignment projects because it is simple and fairly large (frame-system level). Thus, as was described in Section 3.2.3, Ruiz- Casado et al. [205] align Wikipedia articles with WordNet synsets, building a large general resource that marks up synsets with article URIs and bags of words from article text. However, other than the maping itself this project ads no ontological value to WordNet, particularly since Wikipedia entries whose title string does not already apear in a synset were discarded. The authors? later work (described in Section 5.1) has shifted to extracting semantic relationships. Suchanek et al. [207, forthcoming] also align WordNet and Wikipedia. However, discusion is defered to Section 6.6 because they ad many other relations as wel. Medelyan and Leg [208] map 50,00 Wikipedia articles to equivalent categories in ResearchCyc. Their ultimate aim is to create a resource combining Cyc?s principled ontological structure with Wikipedia?s mesier but much more abundant information. Instead of selecting one resource as a base, they merely produce a list of pairs of equivalent concepts in both resources. They use methods described in Section 3.2.3 to determine genuine semantic similarity, folowing earlier work aligning a domain-specific thesaurus (Agrovoc) with Wikipedia [Medelyan and Milne 208]. For each Cyc term, its surounding ontology is used to gather a context for disambiguation, using the taxonomic relations #$genls, #$isa and some specific relations like #$countryOfCity and #$conceptualyRelated. Then the most comon Wikipedia article for each context term is identified and compared with al candidates for a maping. A further test is aplied when several Cyc terms map to the same Wikipedia article?reverse disambiguation. First, mapings that score les than 30% of the highest score are eliminated. Then a comon-sense test is aplied to the remainder based on Cyc?s ontological knowledge regarding disjointnes betwen clases. If the best scoring Cyc term does not intersect with the second best one (that is, it represents ?a diferent kind of thing?), the later is eliminated; otherwise both mapings are acepted. An evaluation on 10,00 manualy maped terms provided by the Cyc Foundation, as wel as a study with six human subjects, shows that performance of the maping algorithm compares with the eforts of humans. 6.6 Facts Now we turn to mining Wikipedia for what might be caled ful-blown facts, for the purpose of ontology building. This category is blured by the dificulty of defining what exactly constitutes a fact?e.g., the typing of links in Section 6.4 in some sense already qualifies. However, here we focus on projects that find and store entirely new literals, RDF triples and similar propositionaly-structured entities. Sections 4 and 5 have covered much of this work; here we consider to what extent it has resulted in large-scale re-usable knowledge resources. First we consider those who use Wikipedia to ad facts to existing ontologies. We saw in Section 5.2 that Suchanek et al. [207; forthcoming] use information extraction methods to create an ontology named YAGO 30 that unifies WordNet and Wikipedia. This contains 1M concepts and 5M facts about them, an impresive quantity. Table 5 breaks down the number of diferent types of fact. The concepts are al WordNet synsets, Wikipedia leaf categories and al Wikipedia articles whose titles are not listed as comon names in WordNet. This neatly bypases the por ontological quality of Wikipedia?s category structure, WordNet?s taxonomy being manualy generated and far cleaner. It also avoids Ruiz-Casado et al.?s problem of omiting Wikipedia concepts whose titles do not apear in WordNet, although it stil mises al proper names with WordNet synonyms?e.g. the programing language Python and the movie The Birds. In this way a graph-structured hierarchy of concepts is established, then embelished with facts harvested by a sophisticated suite of heuristics, many obtained by hand-picking popular paterns in the titles of Wikipedia categories and asigning relevant facts to al the instances of those categories. From an ontology-building perspective, these 30 htp:/ww.mpi-inf.mpg.de/~suchanek/downloads/yago/ Relation Domain Range Number of facts subClasOf clas clas 143,210 type entity class 1,901,130 context entity entity 40,00,00 describes word entity 986,628 bornInYear person year 18,128 diedInYear person year 92,607 establishedIn entity year 13,619 locatedIn object region 59,716 writenInYear book year 9,670 politicianOf organization person 3,59 hasWonPrize person prize 1,016 means word entity 1,598,684 familyNameOf word person 23,194 givenNameOf word person 217,132 Table 5. Size of YAGO (facts). sophisticated automated methods are a real step forward, though only a tiny subset of category names has ben parsed. For instance they do not adres widespread paterns such as ?X by Y? (e.g. Persons by continent, Persons by company, Persons by nationality and so on), which was analyzed by the EMLR group (Section 5.2). YAGO has many features one seks in a formal ontology. Its authors have defined a logic-based representation language and a basic data model of entities and binary relations, with a smal extension to represent relations betwen facts (such as transitivity). This gives it formal rigor?the authors even provide a model-theoretic semantics?and the expresive power of a rich version of Description Logic. In terms of inferential tractability it compares favorably with the hand-crafted Cyc. A SPARQL interface (available online) alows queries of traditional knowledge-base logical complexity?for instance when asked for bilionaires born in the USA it came up with two (though it mised Bil Gates?coverage of Wikipedia?s structured data is not complete by the project?s methods). The authors plan to integrate their project with the latest version of OWL (released in 207). They claim to have already noticed a positive fedback lop whereby as more facts are aded, word senses can be disambiguated more efectively in order to corectly identify and enter further facts. Such a fedback lop was a long- standing ambition of AI researchers (e.g. Lenat [195]), though claims that it was about to be achieved often turned out to be premature. Dataset Description Triples Page links Internal links betwen DBpedia instances derived from the internal pagelinks betwen Wikipedia articles 62 M Infoboxes Data atributes for concepts that have ben extracted from Wikipedia infoboxes 15.5 M Articles Descriptions of al 1.95 milion concepts within the English Wikipedia. Includes titles, short abstracts, thumbnails and links to the coresponding articles 7.6 M Languages Aditional titles, short abstracts and Wikipedia article links in 13 other languages. 5.7 M Article categories Links from concepts to categories using SKOS 5.2 M Extended abstracts Aditional, extended English abstracts 2.1 Language abstracts Extended abstracts in 13 languages 1.9 M Type information Infered from category structure and redirects by the YAGO (?yet another great ontology?) project [Suchanek et al. 207] 1.9 External links Links to external web pages about a concept 1.6 M Categories Information which concept is a category and how categories are related 1 Persons Information about 80,00 persons (date and place of birth etc.) represented using the FOAF vocabulary 0.5 M External links Links betwen DBpedia and Geonames, US Census, Musicbrainz, Project Gutenberg, the DBLP bibliography and the RDF Bok Mashup 180 K Table 6. Content of DBPedia [Auer et al. 207]. By contrast, the flourishing and ambitious DBpedia project [Auer et al. 207; Auer and Lehman 2007] atempts to create an entirely new ontology by harvesting facts from Wikipedia. The facts are stored as a vast set of RDF triples. As noted in Section 5.2, this project strives to make al Wikipedia?s structured information frely available in database form. Of al projects, it takes the most purely automated aproach and gathers the largest quantity of structured data. The focus is on formating paterns in the text of Wikipedia articles, notably infoboxes, though categorization and other links are also harvested. A stagering 103M ?facts? (triplets) are obtained. Like YAGO, the dataset can be queried via SPARQL and Linked Data, and conects with other open datasets on the web. Table 6 sumarizes its content. The project has already ben influential?for instance, to test their document clasification algorithm Janik and Kochut [207] use slightly modified methods from DBpedia to create an RDF ontology from Wikipedia (Section 4.5). From a general ontology-building perspective, however, it has some weakneses. There is litle or no conection betwen the facts, and the knowledge is not organized into a hierarchy that enables inheritance (although, of course, as a giant database, state of the art procesing techniques can be brought to bear). Unlike YAGO it has no formaly defined ontology language, and thus it would sem that many semantic relations amongst its triples wil go unrecognized (e.g., that the first argument of the predicate artistOf might bear a relationship to the colection Artists). Second, although a formal evaluation of the resource?s quality is not provided, a quick manual inspection reveals that large sections of the data has limited ontological value. For instance, 60% of the RDF triples are internal links derived from Wikipedia?s link structure; only 15% are taken directly from infoboxes, and of those, the most comon relation (over 10%) is the formating relation wikiPageUsesTemplate. Amongst the properly ontological relations are many obvious redundancies not identified as such, e.g. placeOfBirth and birthPlace, dateOfBirth and birthDate. Finaly, some individual relations contain por-quality infobox data?for instance, keyPeople asertions of the form ?CEO? or ?Bob?. We finaly come to consider the final phase of EMLR?s project [Nastase and Strube 2008]. We saw in Section 5.2 that this work consisted in parsing category titles, analyzing paterns in them and using that information to derive new relations betwen articles. They manage a deper analysis of category titles than YAGO?in particular, they managing to crack open the extensive X by Y patern and derive entirely implicit relations, as we saw above. In this way they manage to ad a wealth of new ontological information to their existing taxonomy of 105,00 categories?9M new facts, about twice the size of YAGO. The facts include 3.4 milion isA and 3.2 milion spatial relations, along with 43,00 memberOf relations and 4,00 other specific relations such as causedBy and writenBy. The authors promise to release a new ontology containing these facts son. It wil be interesting to se whether they define a formaly specified ontology language, as with YAGO (and if so how expresive it is), or merely dump out the data as with DBpedia (in which case the tols available for inferencing, and the complexity of suported queries, become paramount). Table 7 shows the size of the larger ontologies. How much nearer does this work bring us to the semantic web? Great progres has ben made on named entities (such as ?Helen Clark?), for al that is neded to establish shared meaning for a named entity is a shared URI. General concepts (such as ?tre?) are more tricky. There is certainly a wealth of semantic information regarding such concepts in Wikipedia, but an almost total lack of consensus on how to extract and analyze it, let alone inference over it. Yet for the semantic web, this was the whole point. 7. PEOPLE, PLACES AND RESOURCES The research described here is scatered acros the globe; Figure 16 shows prominent countries and institutions. US and Germany are the largest contributors. The US research spreads acros many institutions. The University of North Texas, who work with entity recognition and disambiguation, produced the wikify system. In the Pacific Northwest, Microsoft Research focuses on named entity recognition, while the University of Washington extracts semantic relations from Wikipedia?s infoboxes. German research is more localized geographicaly. EML Research Institute works on relation extraction, semantic relatednes, and co-reference resolution; Darmstadt University of Technology on semantic relatednes and analyzing Wikipedia?s structure. The Max-Plank Institut produced the YAGO ontology; they colaborate with the University of Leipzig, who produced DBpedia. The University of Karlsruhe have focused on providing users with tols to ad formal semantics to Wikipedia. Ontology Entities Facts SUMO 20,00 60,00 WordNet 17,597 207,016 OpenCyc 47,00 306,00 Manualy created ResearchCyc 250,00 2,20,00 YAGO 1M 5M DBpedia N/A 103M Automaticaly derived EMLR[208] 105,00 9M Table 7. Size of ontologies (adapted from Suchanek et al. [207]). Spain is Europe?s next largest contributor. Universidad Autonoma de Madrid extract semantic relations from Wikipedia; Universidad Politecnica de Valencia and Universidad de Alicente both use it to answer questions and recognize named entities. The Netherlands, France, and UK are each represented by a single institution. The University of Amsterdam focuses on question answering; INRIA works primarily on entity ranking, and Imperial Colege on recognizing and disambiguating geographical locations. The Israel Institute of Technology have produced widely cited work on semantic relatednes, document representation and categorization. They developed the popular technique of Explicit Semantic Analysis. Hewlet Packard?s branch in Bangalore puts India on the map with document categorization research. In China, Shanghai Jiatong University works on relation extraction and category recomendation. In Japan, the University of Osaka has produced several open source resources, including a thesaurus and a bilingual (Japanese?English) dictionary. The University of Tokyo, in conjunction with the National Institute of Advanced Industrial Science and Technology, have focused on relation extraction. Australia (5) RMIT University New Zealand (8) Waikato University Japan (10) Osaka University U. of Tokyo & AIST Austria (2) U. of Insbruck China (5) Shanghai Jiatong U. Germany (20) EML, Heidelberg Darmstadt U. of Technology Max-Plank I. Saarbruken University of Leipzig University of Karlsruhe India (3) H.P. Bangalore Israel (5) Israel I. of Tech. Italy (2) Spain (9) U. Autonoma de Madrid U. Politecnica de Valencia U. of Alicente Netherlands (5) U. of Amsterdam United States (21) U. of North Texas U. of Washington Microsoft Research United Kingdom (4) Imp. Colege, London France (5) INRIA, Rocquencourt Figure 16. Countries and institutions with significant research on mining meaning from Wikipedia. New Zealand and Australia are each represented by a single institution. Research at the University of Waikato covers entity recognition, query expansion, topic indexing, semantic relatednes and augmenting existing knowledge bases. RMIT in Melbourne have colaborated with INRIA?s work on entity ranking. Table 8 sumarizes tols and resources, along with brief descriptions and URLs. The first part shows tols for acesing and procesing Wikipedia. The second shows demos of Wikipedia mining aplications. The third lists datasets that have ben generated from Wikipedia. Procesing tols JWPL Java ikipedia Library API for structural aces of Wikipedia parts such as redirects, categories, articles and link structure. [Zesch et al. 208] htp:/ww.ukp.tu-darmstadt.de/software/jwpl/ WikiRelate! API for computing semantic relatednes using Wikipedia [Strube and Ponzeto 206; Ponzeto and Strube 206] htp:/ww.eml-research.de/ english/research/ nlp/download/ wikipediasimilarity.php Wikipedia Miner API that provides a simplified aces to Wikipedia and models its structure semanticaly [Milne et al. 208] htp:/sourceforge.net/ projects/wikipedia-miner/ WikiPrep A Perl tol for preprocesing Wikipedia XML dumps [Gabrilovich and Markovitch 207] htp:/ww.cs.technion.ac.il/ ~gabr/resources/ code/wikiprep/ W.H.A.T. Wikipedia Hybrid Analysis Tol An analytic tol for Wikipedia with two main functionalities: an article network and extensive statistics. It contains a visualization of the article networks and a powerful interface to analyze the behavior of authors. htp:/sourceforge.net/ projects/ w-h-a-t/ Wikipedia mining demos DBpedia Online Aces Online aces of DBpedia data (103M facts extracted from Wikipedia) via a SPARQL query endpoint and as Linked Data. [Auer et al. 207] htp:/wiki.dbpedia.org/ OnlineAces YAGO Demo of the Yet Another Ontology YAGO, containing 1.7M entities and 14M facts [Suchanek et al. 207] htp:/ww.mpi.mpg.de/ ~suchanek/yago QuALiM A Question Answering system. Given a question in a natural language returns relevant pasages from Wikipedia. [Kaiser 208] htp:/demos.inf.ed.ac.uk:8080/ qualim/ Koru A demo of a search interface that maps topics involved in both queries and documents to Wikipedia articles. Suports automatic and interactive query expansion. [Milne et al. 2007] htp:/ww.nzdl.org/koru Wikipedia Thesaurus A large scale asociation thesaurus containing 78 milion asociations [Nakayama et al. 207 and 208] htp:/wikipedia-lab.org:8080/ WikipediaThesaurusV2/ Wikipedia English- Japanese A dictionary returning translations from English into Japanese and vise versa, enriched with probabilities of these translations [Erdman et al. 207] dictionary htp:/wikipedia-lab.org:8080/ WikipediaBilingualDictionary/ Wikify Automaticaly anotates any text with links to Wikipedia articles [Mihalcea and Csomai 207] htp:/wikifyer.com/ Wikifier Automaticaly anotates any text with links to Wikipedia articles describing named entities htp:/wikifier.labs.exalead.com/ Location query server Location data acesible via REST requests returning data in a SOAP envelope. Two requests are suported: A bounding box or a Wikipedia Article. The reply is the number of references made to locations within that bounding box, and a list of Wikipedia articles describing those locations. Or none, if the request is not a location. [Overel and R?ger 206 and 207] htp:/ww.doc.ic.ac.uk/ ~seo01/wiki/demos Datasets DBpedia Facts extracted from Wikipedia infoboxes and link structure in RDF format. [Auer et al. 207] htp:/wiki.dbpedia.org Wikipedia Taxonomy Taxonomy automaticaly generated from the network of categories in Wikipedia (RDF Schema format) [Ponzeto and Strube 207; Zirn et al. 208] htp:/ww.eml-research.de/ english/research/ nlp/download/ wikitaxonomy.php Semantic Wikipedia A snapshot of Wikipedia automaticaly anotated with named entity tags. [Zaragosa et al. 207] htp:/ww.yr-bcn.es/ semanticWikipedia Cyc to Wikipedia mapings 50,00 automaticaly created mapings from Cyc terms to Wikipedia articles. [Medelyan and Leg 208] htp:/ww.cs.waikato.ac.nz/ ~olena/cyc.html Topic indexed documents A set of 20 Computer Science technical reports indexed with Wikipedia articles as topics. 15 teams of 2 senior CS undergraduates have independently asigned topics from Wikipedia to each article. [Medelyan et al. 208] htp:/ww.cs.waikato.ac.nz/ ~olena/wikipedia.html Locations in Wikipedia, ground truth A manualy anotated sample of 100 Wikipedia articles. Each link in each article is anotated, whether it is a location or not. If yes, it contains the coresponding unique id from the TGN gazeter. [Overel and R?ger 206 and 207] htp:/ww.doc.ic.ac.uk/ ~seo01/wiki/data_release Table 8. Wikipedia tols and resources. 8. SUMARY A whole host of researchers have ben quick to grasp the potential of Wikipedia as a resource for mining meaning: the literature is large and growing rapidly. We began this article by describing Wikipedia?s creation proces and structure (Section 2). The unique open editing philosophy, which acounts for its suces, is subversive. Although regarded as suspect by the academic establishment, it is a remarkable concrete realization of the American pragmatist philosopher Peirce?s proposal that knowledge be defined through its public character and future usefulnes rather than any prior justification. Wikipedia is not just an encyclopedia but can be viewed as anything from a corpus, taxonomy, thesaurus, hierarchy of knowledge topics to a ful- blown ontology. It includes explicit information about synonyms (redirects) and word senses (disambiguation pages), database-style information (infoboxes), semantic network information (hyperlinks), category information (category structure), discusion pages, and the ful edit history of every article. Each of these sources of information can be mined in various ways. Section 3 explains how Wikipedia is being drawn upon for natural language procesing. Unlike WordNet, it was not created as a lexical resource that reflects the intricacies of human language. Instead, its primary goal is to provide encyclopedic knowledge acros subjects and languages. However, the research described here demonstrates that it has, unexpectedly, imense potential as a repository of linguistic knowledge for natural language aplications. In particular, its unique features alow wel- defined tasks such as word sense disambiguation and word similarity to be adresed automaticaly?and the resulting level of performance is remarkably high. Researchers on co-reference resolution and mining of multilingual information have only recently discovered Wikipedia; significant improvements in these areas can be expected shortly. To our knowledge, its use as a resource for other tasks such as natural language generation, machine translation and discourse analysis, has not yet ben explored. These areas are ripe for exploitation, and exciting discoveries can be expected. Section 4 describes aplications to information retrieval. Query expansion, document clasification and topic indexing provide the best examples of aplying Wikipedia for searching and organizing document colections. These areas can take advantage of its unique properties while grounding themselves in?and building upon?existing research. In particular, document clasification has gathered momentum and significant advances are obtained over the state of the art. Question answering and entity ranking are les wel adresed, because they do not sem to take ful advantage of Wikipedia: with a few exceptions they simply treat it as just another corpus and thus difer litle from previous work. We found litle evidence of cros-polination betwen this work and the information extraction eforts described in Section 5. Given how closely question answering and entity ranking depend on the extraction of facts and entities, we expect this to become a fruitful line of enquiry. In Section 5 we turn to information extraction; mining text for topics, relations and facts. Unlike the tasks in Sections 3 and 4, information extraction is not easy to define. Diferent researchers focus on diferent kinds of information: we have reviewed research on extracting information about movie directors and socer players, composers, corporate descriptions and hierarchical and ontological relations. Techniques range from those developed for standard text corpora to ones that utilize properties such as hyperlinks and category structure. The extracted resources range in size from several hundred to several milion relations, but the lack of a comon basis for evaluation prevents us from drawing any conclusion as to which aproach performs best. Section 6 discuses the use of Wikipedia for ontology-building. Wikipedia?s vast quantity of structured information provides low-hanging fruit for automating this proces. Article names can serve as URIs for named entities; hyperlinks and redirects can be mined for large-scale thesauri; the category structure can be treated as encoding taxonomic information (though not always very wel); and infoboxes are a rich source of domain knowledge. From the perspective of large-scale general ontology building, the two most impresive projects are YAGO and DBPedia. Which wil turn out to be more useful, the large but mesy and low-quality DBPedia, or the smaler but more rigorous and acurate YAGO? Meanwhile, EMLR?s latest eforts (not yet released) promise to combine some of the greater rigor of the former with the greater size of the later. We believe that an extrinsic evaluation would be most meaningful, and hope to se these systems compete on a wel-defined task in an independent evaluation. It wil also be interesting to se to what extent these resources are exploited by other research comunities in the future. Some authors have sugested using Wikipedia editors themselves to perform ontology-building, an enterprise that might be thought of as mining Wikipedia?s people rather than its data. Perhaps they grasp the implications of the underlying driving force behind this masively sucesful resource beter than the rest of us! Only time wil tel whether the comunity is amenable to folowing such sugestions. The idea of moving to a more structured and ontologicaly principled Wikipedia raises an interesting question: how il it interact with the public, amateur-editor model? Does this signal the long-awaited emergence of the semantic web? We suspect that, like the suces of Wikipedia itself, the result wil be something new, something that experts have not foresen and may not condone. That is the glory of Wikipedia. ACKNOWLEDGEMENTS We warmly thank Evgeniy Gabrilovich, Rada Mihalcea, Dan Weld, Fabian Suchanek and the YAGO team for their valuable coments on a draft of this paper. Medelyan is suported by a scholarship from Gogle, Milne by a New Zealand Tertiary Education Comision Top Achiever Scholarship. References ADAFRE, S.F., JIJKOUN, V., AND M. DE RIJKE. [207] Fact Discovery in Wikipedia. In Procedings of the 207 IEE/WIC/ACM International Conference on Web Inteligence. ADAFRE, S.F., AND M. DE RIJKE. [206] Finding Similar Sentences acros Multiple Languages in Wikipedia. In Procedings of the EACL 206 Workshop on New Text?Wikis and Blogs and Other Dynamic Text Sources. ADAFRE, S.F., AND M. DE RIJKE. [205] Discovering Mising Links in Wikipedia. In Procedings of the LinkKD 205, August 21, 205, Chicago, IL. AGROVOC [195] Multilingual agricultural thesaurus. Fod and Agricultural Organization of the United Nations. htp:/ww.fao.org/agrovoc/ AHN, D., JIJKOUN, V., MISHNE, G., M?LER, K., DE RIJKE, M., AND S. SCHLOBACH. [204] Using Wikipedia at the TREC QA Track. In Procedings of the 13th Text Retrieval Conference (TREC 204). ALAN, J. [205] HARD track overview in TREC 205: High acuracy retrieval from documents. In Procedings of the 14th Text Retrieval Conference (TREC 205). AUER, S., BIZER, C., LEHMAN, J., KOBILAROV, G., CYGANIAK, R., AND Z. IVES [207] DBpedia: A Nucleus for a Web of Open Data. In Procedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC/ASWC207), Busan, South Korea, 4825: 715?728, 207. AUER, S. AND J. LEHMAN. [207] What have Insbruck and Leipzig in comon? Extracting Semantics from Wiki Content. In Franconi et al. (eds), Procedings of European Semantic Web Conference (ESWC?07), LNCS 4519, p. 503?517, Springer, 207. BADER, F., CALVANESE, D., MCGUINES, D. AND D. NARDI. [207] The Description Logic Handbok: Theory, Implementation and Aplications. Cambridge: Cambridge University Pres. BAKER, L. [208] Profesor Bans Gogle & Wikipedia: Encourages Critical Thinking & Research. Search Engine Journal, January 14th, 208. BANERJE, S. [207] Boosting Inductive Transfer for Text Clasification Using Wikipedia. In Procedings of the 6th International Conference on Machine Learning and Aplications (ICMLA), p. 148?153. BANERJE, S., RAMANATHAN, K. AND A. GUPTA. [207] Clustering Short Texts using Wikipedia. In Procedings of the 30th Anual International ACM SIGIR conference on Research and Development in Information Retrieval. Amsterdam, Netherlands. p. 787?78. BANKO, M., CAFARELA, M. J., SODERLAND, S., BROADHEAD, M. AND O. ETZIONI. [207] Open information extraction from the Web. In Procedings of the 20th International Joint Conference on Artificial Inteligence IJCAI?07, p. 2670?2676, January 207. BHOLE, A., FORTUNA, B., GROBELNIK, B. AND . MLADENI?. [207] Extracting Named Entities and Relating Them over Time Based on Wikipedia. Informatica. BELOMI, F. AND R. BONATO. [205] Network Analysis for Wikipedia. In Procedings of the 1st International Wikimedia Conference, Wikimania 205. Wikimedia Foundation. BERNERS-LEE, T., HENDLER, J, AND O. LASSILA. [201]. The Semantic Web. Scientific American 284 (5), 34?43. BERNERS-LEE, T. [203]. Foreword. In D. Fensel, J. Hendler, H. Lieberman, and W. Wahlster (Eds.) Spining the Semantic Web: Bringing the World Wide Web to its Ful Potential. Cambridge, MA: MIT Pres. BRIN, S. AND L. PAGE. [198] The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, Vol. 3, p. 107?117. BROWN, P., DELA PIETRA, S., DELA PIETRA, V., AND R. MERCER. [193] The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263?311. BUDANITSKY, A. AND HIRST, G. [201] Semantic distance in WordNet: An experimental, aplication-oriented evaluation of five measures. Workshop on WordNet and Other Lexical Resources, Second meting of the North American Chapter of the Asociation for Computational Linguistics, Pitsburgh, PA. BUITELAR, P., CIMIANO, P., MAGNII, B. (eds). [205] Ontology Learning from Text: Methods, Evaluation and Aplications. Amsterdam, The Netherlands: IOS Pres. BUNESCU, B. AND PA?CA, M. [206] Using Encyclopedic Knowledge for Named Entity Disambiguation. In Procedings of the1th Conference of the European Chapter of the Asociation for Computational Linguistics, p. 9?16. BUSCALDI, D. AND P. A. ROSO. [207] Comparison of Methods for the Automatic Identification of Locations in Wikipedia. In Procedings of the 4th ACM workshop on Geographical information retrieval, GIR?07. Lisbon, Portugal, p. 89?92. BUSCALDI, D. AND P. A. ROSO. [207] A Bag-of-Words Based Ranking Method for the Wikipedia Question Answering. Task Evaluation of Multilingual and Multi-modal Information Retrieval, p. 50?53. CAVNAR, W. B. AND J. M. TRENKLE. [194] N-Gram-Based Text Categorization. In Procedings of 3rd Anual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, p. 161-175. CHERNOV, S., IOFCIU, T., NEJDL, W. AND X. ZHOU. [206] Extracting Semantic Relationships betwen Wikipedia Categories. In Procedings of the 1st International Workshop: SemWiki?06?From Wiki to Semantics. Co-located with the 3rd Anual European Semantic eb Conference ESWC?06 in Budva, Montenegro, June 12, 206. CIMIANO, P. AND J. VOLKER. [205] Towards large-scale, open-domain and ontology-based named entity clasification. In Procedings of the Internatioal Conference on Recent Advances in Natural Language Procesing, RANLP?05, p. 16?172. INCOMA Ltd., Borovets, Bulgaria, September 205. CSOMAI, A. AND R. MIHALCEA. [207] Linking Educational Materials to Encyclopedic Knowledge. Frontiers in Artificial Inteligence and Aplications, v.158, p. 57?59. IOS Pres, Netherlands. CUCERZAN, S. [207] Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In Procedings of the 207 Joint Conference on Empirical Methods in Natural Language Procesing and Computational Natural Language Learning, p. 708?716, Prague, Czech Republic, June 207. CULOTA, A., MCALUM, A. AND J. BETZ. [206]. Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Paterns in Text. In Procedings of the main conference on Human Language Technology Conference of the North American Chapter of the Asociation of Computational Linguistics. New York, NY, p. 296?303. DAKA, W. AND S. CUCERZAN. [208]. Augmenting Wikipedia with Named Entity Tags. In Procedings of the 3rd International Joint Conference on Natural Language Procesing (IJCNLP 208), Hyderabad. DENING, P., HORNING, J., PARNAS, D., AND WEINSTEIN, L. [205]. Wikipedia Risks. In Comunications of the ACM 48(12), p. 152?152. DENOYER, L. AND GALINARI, P. [206] The Wikipedia XML corpus. SIGIR Forum, 40(1), p. 64?69, ACM Pres. DONDIO, P., BARRET, S., WEBER, S., AND SEIGNEUR, J. [206] Extracting Trust from Domain Analysis: A Case Study on the Wikipedia Project. Autonomous and Trusted Computing, p. 362-373. DUMAIS, S., PLAT, J., HECKERMAN, D. AND M. SAHAMI. [198] Inductive learning algorithms and representations for text categorization. In Procedings of the 7th international conference on Information and knowledge management, p. 148?155. EDMONDS, P. AND KILGARRIF, A. [202] Introduction to the special isue on evaluating word sense disambiguation systems. Journal of Natural Language Enginering, 8(4), p. 279?291. Cambridge University Pres, New York, NY, USA. EMIGH, W. AND HERRING, S. [205] Colaborative Authoring on the Web: A Genre Analysis of Online Encyclopedias. In Procedings of the 38 th Hawai International Conference on System Sciences, p.9a. ERDMAN, M., NAKAYAMA, K., HARA, T., AND S. NISHIO. [208] An Aproach for Extracting Bilingual Terminology from Wikipedia. In Procedings of the 13th International Conference on Database Systems for Advanced Aplications (DASFA, To apear). FELBAUM, C. (editor). [198] WordNet An Electronic Lexical Database. Cambridge, MA: MIT Pres. FERR?NDEZ, F., TORAL, A., FERR?NDEZ, ?., FERR?NDEZ, A., AND R. MU?OZ. [207] Aplying Wikipedia?s Multilingual Knowledge to Cros?Lingual Question Answering. In Procedings of the 12th International Conference on Aplications of Natural Language to Information Systems, Paris, France, p. 352?363. June 207 FINKELSTEIN, L., GABRILOVICH, E., MATIAS, Y., RIVLIN, E., SOLAN, Z., WOLFMAN, G., AND E. RUPIN. [202] Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1), p. 116?131. FRANK, E., PAYNTER, G. W., WITEN, I. H., GUTWIN, C. AND C. G. NEVIL-MANING. [199] Domain- Specific Keyphrase Extraction. In Procedings of the 16th International Joint Conference on Artificial Inteligence, IJCAI?9, Stockholm, Sweden, p. 68?673. GABRILOVICH, G. AND S. MARKOVITCH. [207] Computing Semantic Relatednes using Wikipedia-based Explicit Semantic Analysis. In Procedings of the 20th International Joint Conference on Artificial Inteligence, IJCAI?07, Hyderabad, India, January 207, p.1606?161. GABRILOVICH, G. AND MARKOVITCH, S. [206] Overcoming the Britlenes Botleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge, Procedings of The 21st National Conference on Artificial Inteligence (AAI), p. 1301?1306, Boston, July 206 GILES, J. [205] Internet Encyclopaedias Go Head to Head. In Nature 138(15), 14 December 205. GLEIM, R., MEHLER, A. AND M. DEHMER. [207] Web Corpus Mining by Instance of Wikipedia. In Kilgarrif, Adam; Baroni, arco (eds.) Procedings of the EACL 206 Workshop on eb as Corpus, Trento, Italy, April 3?7, 206, p. 67?74. GREGOROWICZ, A. AND M. A. KRAMER. [206] Mining a Large-Scale Term-Concept Network from Wikipedia. Mitre Technical Report 06?1028, October 206. HALAVAIS, A. AND LACKAF, D. [208] An Analysis of Topical Coverage of Wikipedia. Journal of Computer- Mediated Comunication, 13(2), p. 429?440. HALER, H., KR?TZSCH, M., V?LKEL, M., AND . VRANDECIC. [206] Semantic Wikipedia (software demo). In Procedings of the 206 International Symposium on Wikis, p. 137?138. ACM Pres, August 206. HATCHER, E. AND O. GOSPODNETIC. [204] Lucene in Action. Maning Publications, Grenwich, CT. HAVELIWALA, T. H. [203] Topic-sensitive PageRank: A context-sensitive ranking algorithm for web search. IEE transactions on knowledge and data enginering, 15(4), p. 784?796. HERBELOT, A. AND A. COPESTAKE. [206] Acquiring Ontological Relationships from Wikipedia Using RMRS. In Proc. International Semantic Web Conference 206 Workshop on Web Content Mining with Human Language Technologies, Athens, GA. HEPP, M., BACHLECHNER, D., AND K. SIORPAES. [206] Harvesting Wiki Consensus?Using Wikipedia Entries as Ontology Elements. In Procedings of the 1st International orkshop: SemWiki?06?From Wiki to Semantics. Co-located with the 3rd Anual European Semantic Web Conference ESWC?06 in Budva, Montenegro, June 12, 206. HIGASHINAKA, R., DOHSAKA, K., AND H. ISOZAKI. [207] Learning to Rank Definitions to Generate Quizes for Interactive Information Presentation, in Companion Volume to the Procedings of the 45th Anual Meting of the Asociation for Computational Linguistics, p. 17?120 HUANG, W.C., TROTMAN, A., AND S. GEVA. [207] Colaborative Knowledge Management: Evaluation of Automated Link Discovery in the Wikipedia. In Procedings of the Workshop on Focused Retrieval at SIGIR 207, July 27, 207, Amsterdam. IDE, N. AND J. V?RONIS (editors). [198] Word Sense Disambiguation. Special isue of Computational Linguistics, 24(1). JIANG, J. J. AND . W. CONRATH, D. W. [197] Semantic similarity based on corpus statistics and lexical taxonomy. In Procedings of the 10th International Conference on Research in Computational Linguistics, ROCLING?97. Taiwan. JIJKOUN, V. AND M. DE RIJKE. [206] Overview of the WiQA task at CLEF 206. In: C. Peters et al. (editors). Evaluation of ultilingual and Multi-modal Information Retrieval. 7th Workshop of the Cros-Language Evaluation Forum, CLEF 206, Alicante, Spain, September 20?2, 206, Revised Selected Papers, LNCS 4730, p. 265?274, September 207 JANIK, M. AND K. KOCHUT. [207] Wikipedia in Action: Ontological Knowledge in Text Categorization, University of Georgia, Computer Science Department Technical Report no. UGA-CS-TR-07-001. KAISSER, M. [208] The QuALiM Question Answering Demo: Suplementing Answers with Paragraphs drawn from Wikipedia. In Procedings of the ACL-08 HLT Demo Sesion, Columbus, Ohio, p. 32?35. KASNECI, G., SUCHANEK, F.M., IFRIM, G., RAMANATH, M. AND G. WEIKUM. [207] NAGA: Searching and Ranking Knowledge. In Procedings of the 24th IEE International Conference on Data Enginering, ICDE?08, Cancun, Mexico, 7?12 April 208, p. 953?962. KASNER, L., NASTASE, V., AND M. STRUBE. [208] Acquiring a Taxonomy from the German Wikipedia. To apear in Procedings of LREC 208. KAZAMA, J. AND K. TORISAWA. [207] Exploiting Wikipedia as External Knowledge for Named Entity Recognition. In Procedings of the Joint Conference on Empirical Methods in Natural Language Procesing and Computational Natural Language Learning, p. 698?707. KINZLER, D. [205] WikiSense: Mining the Wiki, v 1.1. In Procedings of the 1st International Wikimedia Conference, Wikimania 205. Wikimedia Foundation. KITUR, A., SUH., B., PENDLETON, B.A. AND CHI, E.H. [207] He says, she says: Conflict and Coordination in Wikipedia. In CHI, p. 453-462. KLEINBERG, J. [198] Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, 46, p. 604? 632. KLAVANS, J. L. AND P. RESNIK. [196] The balancing act: combining symbolic and statistical aproaches to language. Cambridge, MA: MIT Pres. KRIZHANOVSKY, A. [206] Synonym Search in Wikipedia: Synarcher. In Procedings of the 1th International Conference ?Spech and Computer? SPECOM?06. Rusia, St. Petersburg, June 25?29, 206, p. 474?477. KR?TZSCH, M., VRANDECIC, D., V?LKEL, M., HALER, H., AND R. STUDER. [207] Semantic Wikipedia. Journal of Web Semantics, 5, p. 251?261. KR?TZSCH, M., VRANDECIC, D. AND M. V?LKEL. [205] Wikipedia and the Semantic Web?The Mising Links. In Procedings of the 1st International Wikimedia Conference, Wikimania 205. Wikimedia Foundation. LEACOCK, C., AND M. CHODOROW. [198] Combining local context and WordNet similarity for word sense identification. In Felbaum, C. (editor), WordNet: An Electronic Lexical Database. Chapter 1, p. 265? 283. Cambridge, MA: MIT Press. LEHTONEN, M. AND A. DOUCET. [207] EXTIRP: Baseline Retrieval from Wikipedia. Comparative Evaluation of XML Information Retrieval Systems, p. 15?120. LEGG, C. [207] Ontologies on the Semantic Web. Anual Review of Information Science and Technology 41, p. 407?452. LENAT, D. B. [195] Cyc: A Large-Scale Investment in Knowledge Infrastructure. Comunications of the ACM 38(11). LIPSCOMB, C.E. [200] Medical Subject Headings (MeSH). In Buletin of the Medical Library Asociation 8(3), p. 265. LI, B., CHEN, Q., YEUNG, D.S., NG, W. .Y., WANG, X. [207] Exploring Wikipedia and Query Logs Ability for Text Feature Representation. In Procedings of the International Conference on Machine Learning and Cybernetics, Hong Kong, 19?2 August 207, v. 6, p. 343?348. LI, Y., LUK, R. W. P., HO, E. K. S., CHUNG, K. F. [207] Improving weak ad-hoc queries using Wikipedia as external corpus. In Kraij et al. (editors) Procedings of the 30th Anual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR?07, Amsterdam, The Netherlands, July 23?27, 207, p. 797?798. ACM Pres. LIH, A. [204] Wikipedia as Participatory Journalism: Reliable Sources? Metrics for Evaluating Colaborative Media as a News Source. In Procedings of the 5th International Symposium on Online Journalism. MAGNUS, P. D. [206] Epistemology and the Wikipedia. In Procedings of the North American Computing and Philosophy Conference, Troy, New York, August 206. MAYS, E., DAMERAU, F. J. AND R. L. MERCER. [191] Context-based speling corection. Information Procesing and Management 27(5), p. 517?52. MCOL, R. [206]. Rethinking the Semantic Web, Part 2. IEE Internet Computing 10(1), p. 93?96. CGUINNESS, D. [203]. Ontologies Come of Age. In D. Fensel, et al. (editors) Spining the Semantic Web: Bringing the World Wide Web to Its Ful Potential. Cambridge, MA: MIT Pres. MCGUINNESS, D. AND F. VAN HARMELEN. [204] OWL Web Ontology Language: Overview. htp:/ww.3.org/TR/owl-features/ MEDELYAN, O. AND . MILNE. [208] Augmenting domain-specific thesauri with knowledge from Wikipedia. In Procedings of the NZ Computer Science Research Student Conference, Christchurch, NZ. MEDELYAN, O., WITEN, I. H., AND D. MILNE. [208] Topic Indexing with Wikipedia. To apear in Procedings of the WIKI-AI: Wikipedia and AI Workshop at the AAI?08 Conference, Chicago, US. MEDELYAN, O. AND C. LEGG. [208] Integrating Cyc and Wikipedia: Folksonomy mets rigorously defined comon-sense. To apear in Procedings of the WIKI-AI: Wikipedia and AI Workshop at the AAI?08 Conference, Chicago, US. MIHALCEA, R. [207] Using Wikipedia for Automatic Word Sense Disambiguation. In Procedings of the Human Language Technologies 207: The Conference of the North American Chapter of the Asociation for Computational Linguistics, Rochester, New York, April 207 MIHALCEA, R. AND D. MOLDOVAN. [201] Automatic generation of a coarse grained WordNet. In Procedings of the NACL Workshop on WordNet and Other Lexical Resources. Pitsburgh, PA. MIHALCEA, R. AND A. CSOMAI. [207] ikify! Linking Documents to Encyclopedic Knowledge. In Procedings of the 16th ACM Conference on Information and Knowledge Management, CIKM?07, Lisbon, Portugal, November 6?8, 207, p. 23?241. MILER, E. [198] An Introduction to the Resource Description Framework. Buletin of the American Society for Information Science 25(1), p. 15?19. MILER, G. A., AND W. G. CHARLES. [191] Contextual corelates of semantic similarity. Language and Cognitive Proceses 6(1), p. 1?28. MILNE, D., MEDELYAN, O. AND I. H. WITEN. [206] Mining domain-specific thesauri from Wikipedia: A case study. In Procedings of the International Conference on Web Inteligence (IEE/ IC/ACM WI'206), Hong Kong. MILNE, D., WITEN, I. H. AND . M. NICHOLS. [207] A Knowledge-Based Search Engine Powered by Wikipedia. In Procedings of the 16th ACM Conference on Information and Knowledge Management, CIKM?07, Lisbon, Portugal, November 6?8, 207, p. 45?454. MILNE, D. [207] Computing Semantic Relatednes using Wikipedia Link Structure. In Procedings of the New Zealand Computer Science Research Student Conference, NZ CSRSC?07, Hamilton, New Zealand. MILNE D. AND I. H. WITEN. [208] Learning to link with Wikipedia. Forthcoming INIER, Z., ZALAN, B. AND L. CSATO. [207] Wikipedia-Based Kernels for Text Categorization. In Procedings of the International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC?07, IEE Computer Society Washington, DC, USA. p. 157?164. MUCHNIK, L., ITZHACK, R., SOLOMON, S. AND Y. LOUZOUN. [207] Self-emergence of Knowledge Tres: Extraction of the Wikipedia Hierarchies, in Physical Review E 76(1). NAKAYAMA, K., HARA, T., AND S. NISHIO. [207] Wikipedia: A New Frontier for AI Researches. Journal of the Japanese Society for Artificial Inteligence 2(5), p. 693?701. NAKAYAMA, K., HARA, T., AND S. NISHIO. [208] A Search Engine for Browsing the Wikipedia Thesaurus. In Procedings of the 13th International Conference on Database Systems for Advanced Aplications, Demo sesion (DASFA?08), p. 690?693. NAKAYAMA, K., ITO, M., HARA, T. AND S. NISHIO. [208] Wikipedia Mining for Huge Scale Japanese Asociation Thesaurus Construction. In Workshop Procedings of the 2nd International Conference on Advanced Information Networking and Aplications, AINA?08, GinoWan, Okinawa, Japan, March 25? 28, 208, p. 150?15. IEE Computer Society. NAKAYAMA, K., HARA, T., AND S. NISHIO. [207] A Thesaurus Construction Method from Large Scale Web Dictionaries. In Procedings of the 21st IEE International Conference on Advanced Information Networking and Aplications, AINA?07, May 21?23, 207, Niagara Fals, Canada, p. 932?939. IEE Computer Society. NAKAYAMA, K., HARA, T., AND S. NISHIO. [207] Wikipedia Mining for an Asociation Web Thesaurus Construction. In Procedings of the 8th International Conference on Web Information Systems Enginering, WISE?07, Nancy, France, December 3?7, 207, p. 32?34. Lecture Notes in Computer Science 4831 Springer. NASTASE, V. AND M. STRUBE. [208] Decoding Wikipedia Categories for Knowledge Acquisition. To apear in Procedings of the AAI?08 Conference, Chicago, US. NELKEN, R. AND E. YAMANGIL. [208] Mining Wikipedia?s Article Revision History for Traning Computational Lingustic Algorithms. In Procedings of the WIKI-AI: Wikipedia and AI Workshop at the AAI?08 Conference, Chicago, US. NGUYEN, D. P. T., MATSUO, Y., AND M. ISHIZUKA. [207] Relation Extraction from Wikipedia Using Subtre Mining. In Procedings of the AAI?07 Conference, p. 1414?1420, Vancouver, Canada, July 207. NGUYEN, D. P. T., MATSUO, Y., AND M. ISHIZUKA. [207] Subtre Mining for Relation Extraction from Wikipedia. In Procedings of the HLT-NACL 207, p, 125?128. NGUYEN, D. P. T., MATSUO, Y., AND M. ISHIZUKA. [207] Exploiting Syntactic and Semantic Information for Relation Extraction from Wikipedia. In Procedings of the IJCAI Workshop on Text-Mining and Link- Analysis, TextLink?07. OLIVIER, Y. UND P. SENELART. [207] Finding Related Pages Using Gren Measures: An Illustration with Wikipedia. In Procedings of the AAI?07 Conference, p. 1427?143, Vancouver, Canada, July 207. OVEREL, S. E. AND S. R?GER. [207] Geographic co-ocurence as a tol for GIR. In Procedings of the 4th ACM Workshop on Geographical Information Retrieval. Lisbon, Portugal. OVEREL, S. E. AND S. R?GER. [206] Identifying and grounding descriptions of places. In Procedings of the 3rd ACM workshop on Geographical Information Retrieval at SIGIR. PEIRCE, C.S. [187] The Fixation of Belief. Popular Science Monthly 12 (Nov. 187), p. 1?15. PEI, M., NAKAYAMA, K., HARA, T. AND NISHIO, S. [2008] Constructing a Global Ontology by Concept Maping using Wikipedia Thesaurus. In Procedings of the 2nd International Conference on Advanced Information Networking and Aplications, AINA?08, GinoWan, Okinawa, Japan, March 25?28, 208, p. 1205?1210. IEE Computer Society. PONZETO, S. P. AND M. STRUBE. [206]. Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution. In Procedings of HLT-NACL '06, p.192?19. PONZETO, S. P. AND M. STRUBE. [207a]. Knowledge Derived from Wikipedia for Computing Semantic Relatednes. Journal of Artificial Inteligence Research 30, p. 181?212 PONZETO, S. P. AND M. STRUBE. [207b]. Deriving a Large Scale Taxonomy from Wikipedia. In Procedings of AAI '07, p.140?145. PONZETO, S. P. AND M. STRUBE. [207c]. An API for Measuring the Relatednes of Words in Wikipedia. In: Companion Volume of the Procedings of the 45th Anual Meting of the Asociation for Computational Linguistics, Prague, Czech Republic, 23?30 June, 207, p. 49?52. PONZETO, S. P. [207] Creating a knowledge base from a colaboratively generated encyclopedia. In: Procedings of the Human Language Technology Conference of the North American Chapter of the Asociation for Computational Linguistics Doctoral Consortium, Rochester, NY, 2?27 April, 207, p. 9? 12. POTHAST, M., STEIN, B., AND M. A. ANDERKA [208] Wikipedia-Based Multilingual Retrieval Model. In Procedings of the 30th European Conference on IR Research, ECIR?08, Glasgow. POTHAST, M. [207] Wikipedia in the pocket: indexing technology for near-duplicate detection and high similarity search. In Procedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval. QUINE, W.V.O. [1960] Word and Object. Cambridge, MA: MIT Pres. RANSDEL, J. [203] The Relevance of Peircean Semiotic to Computational Inteligence augmentation. SED Journal (Semiotics, Evolution, Energy, and Development). RESNIK, P. [199] Semantic similarity in a taxonomy: An information-based measure and its aplication to problems of ambiguity in natural language. Journal of Artificial Inteligence Research, 1, p. 95?130. RUBENSTEIN, H., AND J. GODENOUGH. [1965] Contextual corelates of synonymy. Comunications of the ACM 8(10), p. 627?633. RUIZ-CASADO, M., ALFONSECA, E., AND P. CASTELS. [205] Automatic asignment of Wikipedia Encyclopedic Entries to WordNet synsets. In Procedings of AWIC?05. RUIZ-CASADO, M., ALFONSECA, E., AND P. CASTELS. [207] Automatising the learning of lexical paterns: An aplication to the enrichment of WordNet by extracting semantic relationships from Wikipedia. Data Knowledge and Enginering 61(3), p. 484?49. RUIZ-CASADO, M., ALFONSECA, E., AND P. CASTELS. [206] From Wikipedia to Semantic Relationships: a Semi-automated Anotation Aproach. In Procedings of the 1st International Workshop: SemWiki?06? From Wiki to Semantics. Co-located with the 3rd Anual European Semantic eb Conference ESWC?06 in Budva, Montenegro, June 12, 206. RUIZ-CASADO, ., ALFONSECA, E., AND P. CASTELS. [205] Automatic Extraction of Semantic Relationships for WordNet by Means of Patern Learning from Wikipedia. In Procedings of the 10th International Conference on Aplications of Natural Language to Information Systems, NLDB?05, p. 67?79, Alicante, Spain, June 15?17, 205. RUTHVEN, I. AND M. LALMAS. [203] A survey on the use of relevance fedback for information aces systems. Knowledge Enginering Review 18(2), p. 95?145. SCHOENHOFEN, P. [206] Identifying Document Topics Using the Wikipedia Category Network. In Procedings of the International Conference on Web Inteligence (IEE/WIC/ACM WI'206), Hong Kong. SMITH, B., WILIAMS, J., AND S. SCHULZE-KREMER. [203]. The Ontology of the Gene Ontology. In Procedings of AIA Symposium, p. 609?613. SOON, W. M., NG, H. T., AND . C. Y. LIM [201]. A machine learning aproach to coreference resolution of noun phrases. Computational Linguistics 27(4), p. 521?54. SOWA, J. [204]. The Chalenge of Knowledge Soup. htp:/ww.jfsowa.com/pubs/chalenge.pdf. STVILIA, B., TWIDALE, M. B., GASSER, L., AND L. SMITH. [205]. Information Quality Discusions in Wikipedia. Graduate Schol of Library and Information Science, University of Illinois at Urbana- Champaign. Technical Report ISRN UIUCLIS?205/2+CSCW STRUBE, M. AND PONZETO, S.P. [206]. WikiRelate! Computing Semantic Relatednes Using Wikipedia. In: AAI '06, p.1419?1424. SUCHANEK, F. M., KASNECI, G., AND G. WEIKUM. [207] Yago: a core of semantic knowledge. Proc 16th World Wide Web Conference, W?07. New York, NY: ACM Pres. SUCHANEK, F. M., KASNECI, G., AND G. WEIKUM. [forthcoming] Yago: A Large Ontology from Wikipedia and WordNet. Elsevier Journal of Web Semantics. SUCHANEK, F. M, IFRIM, G., AND G. EIKUM. [206] Combining Linguistic and Statistical Analysis to Extract Relations from Web Documents. In Procedings of the Knowledge Discovery and Data Mining Conference, KD?06. SUH, S., HALPIN, H., AND E. KLEIN. [206] Extracting Comon Sense Knowledge from Wikipedia. In Procedings of the ISWC?06 Workshop on Web Content Mining with Human Language technology. SYED, Z., FININ, T., AND A. JOSHI. [208] ikipedia as an Ontology for Describing Documents. In Procedings of the 2nd International Conference on Weblogs and Social Media, AAI, March 31, 208 THOM, A., PEHCEVSKI, J., AND A. M. VERCOUSTRE. [207] Use of Wikipedia Categories in Entity Ranking. In Procedings of the 12th Australasian Document Computing Symposium, Melbourne, Australia. THOMAS, C.S., AND P. AMIT. [206] Semantic Convergence of Wikipedia Articles. In Procedings of the International Conference on Web Inteligence, IEE/WIC/ACM I?06, Hong Kong. TORAL, A. AND R. MU?OZH. [207] Towards a Named Entity WordNet (NEWN). In Procedings of the 6th International Conference on Recent Advances in Natural Language Procesing, RANLP?07, Borovets, Bulgaria. p. 604?608. TORAL, A. AND R. MU?OZH. [206] A proposal to automaticaly build and maintain gazeters for Named Entity Recognition by using Wikipedia. In Procedings of the Workshop on New Text at the 1th EACL?06. Trento, Italy. TYERS, F. AND J. PIENAR. [208] Extracting bilingual word pairs from Wikipedia. In Procedings of the SALTMIL Workshop at Language Resources and Evaluation Conference, LREC?08. VERCOUSTRE, A. M., PEHCEVSKI, J., AND J. A. THOM [207]. Using Wikipedia Categories and Links in Entity Ranking. In Pre-procedings of the 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX?07, December 17, 207. VERCOUSTRE, A. M., THOM, J. A., AND J. PEHCEVSKI [208] Entity Ranking in Wikipedia. In Procedings of SAC?08, March 16?20, 208, Fortaleza, Ceara, Brazil. VI?GAS, F.B., WATENBERG, M., AND . KUSHAL. [204] Studying coperation and conflict betwen authors with history flow visualizations. In Procedings of SIGCHI?04, Viena, Austria, p. 575?582. New York, NY: ACM Pres. VI?GAS, F., WATENBERG, M., KRISS, J., AND F. VAN HAM. [207] Talk before You Type: Coordination in Wikipedia. In Procedings of the 40th Hawai International Conference on System Sciences. V?LKEL, M., KR?TZSCH, M., VRANDECIC, D., HALER, H. AND R. STUDER. [206] Semantic Wikipedia. In Procedings of the 15th International Conference on World Wide Web, WW?06, Edinburgh, Scotland, May 23?26, 206. VORHES, E. M. [199] Natural Language Procesing and Information Retrieval. In Pazienza, M. T. (editor) Information Extraction: Towards Scalable, Adaptable Systems, New York: Springer, p. 32?48. VORHES, E. M., AND HARMAN, D. [2000]. Overview of the eighth text retrieval conference (trec-8). In TREC, p. 1?24. VOSSEN, P., DIEZ-ORZAS, P., AND W. PETERS. [197] The Multilingual Design of EuroWordNet. In Vosen, P. et al. (editors) Procedings of the ACL/EACL?97 Workshop on Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Aplications, Madrid, July 12, 197. VRANDECIC, D., KR?TZSCH, M., AND M. V?LKEL. [207] Wikipedia and the Semantic Web, Part I. In P. Ayers and N. Boalch (editors) Procedings of the 2nd International Wikimedia Conference Wikimania?06. Wikimedia Foundation, Cambridge, MA, USA. DE VRIES, A. P., THOM, J. A., VERCOUSTRE, A. M., CRASWEL, N., AND M. LALMAS. [207] INEX 207 Entity ranking track guidelines. In Workshop Pre-Procedings of INEX 207. WANG, P., HU, J., ZENG H., CHEN, L., AND Z. CHEN. [207] Improving Text Clasification by Using Encyclopedia Knowledge. In Procedings of the 7th IEE International Conference on Data Mining, ICDM?07, 8?31 October 207, p.32?341. WANG, G., ZHANG, H., WANG, H. AND Y. YU [207a] Enhancing Relation Extraction by Eliciting Selectional Constraint Features from Wikipedia. In Procedings of the Natural Language Procesing and Information Systems Conference, p. 329?340. WANG, G., YU, Y., AND H. ZHU. [207b] PORE: Positive-Only Relation Extraction from Wikipedia Text. In Procedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference, ISWC/ASWC?07, Busan, South Korea. WANG, Y., WANG, H., ZHU, H., AND Y. YU. [207] Exploit Semantic Information for Category Anotation Recomendation in Wikipedia. In Procedings of the Natural Language Procesing and Information Systems Conference, p. 48?60. WATANABE, Y., ASAHARA, M., AND Y. A. MATSUMOTO. [207] Graph-based Aproach to Named Entity Categorization in Wikipedia Using Conditional Random Fields. In Procedings of the Joint Conference on Empirical Methods in Natural Language Procesing and Computational Natural Language Learning, EMNLP-CoNL. WILKINSON, D.M., AND HUBERMAN, B.A. [207] Cooperation and Quality in Wikipedia. In Proceedings of the International Symposium on Wikis, p. 157-164. WU, F. AND D. WELD. [207] Autonomously Semantifying Wikipedia. In Procedings of the 16th ACM Conference on Information and Knowledge Management, CIKM?07, Lisbon, Portugal, November 6?8, 207, p. 41?50. WU, F. AND . WELD. [208] Automaticaly Refining the Wikipedia Infobox Ontology. In Procedings of the 17th International World Wide Web Conference, W?08. WU, F., HOFMAN, R., AND . ELD. [208] Information Extraction from Wikipedia: Moving Down the Long Tail. In Procedings of the 14th ACM SigKD International Conference on Knowledge Discovery and Data Mining (KD-08), Las Vegas, NV, August 24-27, 208, p. 635-644. YANG, X. F. AND J. SU [207] Coreference Resolution Using Semantic Relatednes Information from Automaticaly Discovered Paterns. In Procedings of the 45th Anual meting of the Asociation for Computational Linguistics, ACL?07, Prague, Czech Republic, p. 528?535. YANG, J., HAN, J., OH, I., AND M. KWAK. [207] Using Wikipedia technology for topic maps design. In Procedings of the ACM Southeast Regional Conference, p. 106?10. YU, J., THOM, J. A., AND A. TAM. [207] Ontology evaluation using Wikipedia categories for browsing. In Procedings of the 16th ACM Conference on Information and Knowledge Management, CIKM?07, Lisbon, Portugal, November 6?8, 207, p. 23?232. ZARAGOZA, H., RODE, H., MIKA, P., ATSERIAS, J., CIARAMITA, M., AND G. ATARDI. [207] Ranking Very Many Typed Entities on Wikipedia. In Procedings of the 16th ACM Conference on Information and Knowledge Management, CIKM?07, Lisbon, Portugal, November 6?8, 207, p. 1015?1018. ZESCH, T. AND I. GUREVYCH. [207] Analysis of the Wikipedia Category Graph for NLP Aplications. In Procedings of the TextGraphs-2 Workshop at the NACL-HLT?07, p. 1?8. ZESCH, T., GUREVYCH, I., AND M. M?HLH?USER. [207] Comparing Wikipedia and German WordNet by Evaluating Semantic Relatednes on Multiple Datasets. In Procedings of Human Language Technologies: The Anual Conference of the North American Chapter of the Asociation for Computational Linguistics, NACL-HLT?07, p. 205?208. ZESCH, T., GUREVYCH, I., AND M. M?HLH?USER. [208] Analyzing and Acesing Wikipedia as a Lexical Semantic Resource. In Procedings of the Bianual Conference of the Society for Computational Linguistics and Language Technology, p. 213?21. ZLATIC, V., BOZICEVIC, M., STEFANCIC, H., AND M. DOMAZET. [206] Wikipedias: Colaborative Web-based Encyclopedias as Complex Networks. Physical Review E, 74:01615. ZIRN, C., NASTASE, V., AND M. STRUBE. [208] Distinguishing betwen Instances and Clases in the Wikipedia Taxonomy. To apear in Procedings of the ESWC?08.