Working Paper Series 
ISN 17-777X 
 
 
 
 
 
 
 
 
 
 
MINING MEANING FROM WIKIPEDIA 
 
 
 
 
 
Olena Medelyan, Catherine Leg, 
David ilne and Ian H. Witen 
 
 
 
 
 
 
 
 
 
 
 
Working Paper: 1/208 
September 208 
 
 
 
 
 
 
 
? 208 Olena Medelyan, Catherine Leg, 
David ilne and Ian H. Witen 
Department of Computer Science 
The University of Waikato 
Private Bag 3105 
Hamilton, New Zealand 
Mining meaning from Wikipedia 
OLENA MEDELYAN, CATHERINE LEG, DAVID MILNE and IAN H. WITEN 
University of Waikato, New Zealand 
 
____________________________________ 
 
Wikipedia is a goldmine of information; not just for its many readers, but also for the growing comunity of 
researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of 
manual efort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being 
aplied to a host of tasks. 
 
This article provides a comprehensive description of this work. It focuses on research that extracts and makes 
use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four 
broad categories: aplying Wikipedia to natural language procesing; using it to facilitate information retrieval 
and information extraction; and as a resource for ontology building. The article adreses how Wikipedia is 
being used as is, how it is being improved and adapted, and how it is being combined with other structures to 
create entirely new resources. We identify the research groups and individuals involved, and how their work 
has developed in the last few years. We provide a comprehensive list of the open-source software they have 
produced. We also discus the implications of this work for the long-awaited semantic web. 
____________________________________ 
 
1. INTRODUCTION 
Wikipedia requires litle introduction or explanation. As everyone knows, it was 
launched in 201 with the goal of building fre encyclopedias in al languages. Today it 
is easily the largest and most widely-used encyclopedia in existence. Wikipedia has 
become something of a phenomenon among computer scientists as wel as the general 
public. It represents a vast investment of frely-given manual efort and judgment, and the 
last few years have sen a multitude of papers that aply it to a host of diferent problems. 
This paper provides the first comprehensive sumary of this research (up to mid-208), 
which we colect under the deliberately vague umbrela of mining meaning from 
Wikipedia. By meaning, we encompas everything from concepts, topics, and 
descriptions to facts, semantic relations, and ways of organizing information. Mining 
involves both gathering meaning into machine-readable structures (such as ontologies), 
and using it in areas like information retrieval and natural language procesing. 
Traditional aproaches to mining meaning fal into two broad camps. On one side are 
carefuly hand-crafted resources, such as thesauri and ontologies. These resources are 
generaly of high quality, but by necesity are restricted in size and coverage. They rely 
on the input of experts, who canot hope to kep abreast of the incalculable tide of new 
discoveries and topics that arise constantly. Even the most extensive manualy created 
resource?the Cyc ontology, whose hundreds of contributors have toiled for 20 years?
has limited size and patchy coverage [Sowa 204]. The other option is to sacrifice quality 
for quantity and obtain knowledge by performing large-scale analysis of unstructured text. 
However, human language is rife with inconsistency, and our intuitive understanding of it 
canot be entirely replicated in rules or trends, no mater how much data they are based 
upon. Aproaches based on statistical inference might emulate human inteligence for 
specific tasks and in specific situations, but cracks apear when generalizing or moving 
into new domains and tasks. 
Wikipedia provides a midle ground betwen these two camps?quality and 
quantity?by ofering a rare mix of scale and structure. With two milion articles and 
thousands of contributors, it dwarfs any other manualy created resource by an order of 
magnitude in the number of concepts covered, has far greater potential for growth, and 
ofers a wealth of further useful structural features. It contains around 18 Gb of text, and its 
extensive network of links, categories and infoboxes provide a variety of explicitly defined 
semantics that other corpora lack. One must, however, kep Wikipedia in perspective. It 
does not always engender the same level of trust or expectations of quality as traditional 
resources, because its contributors are largely unknown and unqualified. It is also much 
smaler and les representative of al human language use than the web as a whole. 
Nevertheles, Wikipedia has received enthusiastic atention as a promising natural 
language and informational resource of unexpected quality and utility. Here we focus on 
research that makes use of Wikipedia, and as far as posible leave aside its controversial 
nature. 
This paper is structured as folows. In the next section we describe Wikipedia?s 
creation proces and structure, and how it is viewed by computer scientists as anything 
from a corpus, taxonomy, thesaurus, or hierarchy of knowledge topics to a ful-blown 
ontology. The next four sections describe diferent research aplications. Section 3 
explains how it is being drawn upon for natural language procesing; understanding 
writen text. In Section 4 we describe its aplications for information retrieval; searching 
through documents, organizing them and answering questions. Section 5 focuses on 
information extraction; mining text for topics, relations and facts. Section 6 describes 
uses of Wikipedia for ontology building, and asks whether this ads up to Tim Berners-
Lee?s long-delayed vision of the semantic web. Section 7 documents the people and 
research groups involved, while Section 8 lists the resources they have produced, with 
URLs. The final section gives a brief overal sumary. 
2 WIKIPEDIA: A RESOURCE FOR MINIG MEANIG 
Wikipedia, one of the most visited sites on the web, outstrips al other encyclopedias in 
size and coverage. Its English language articles alone are 10 times the size of the 
Encyclopedia Britanica, its nearest rival. But material in English constitutes only a 
quarter of Wikipedia?it has articles in 250 other languages as wel. Co-founder Jimy 
Wales is on record as saying that he aspires to distribute a fre encyclopedia to every 
person on the planet, in their own language. 
This section provides a general overview of Wikipedia, as background to our 
discusions in Sections 3?6. We begin with an insight into its unique editing methods, 
their benefits and chalenges (Section 2.1); and then outline its key structural features, 
such as articles, hyperlinks and categories (Section 2.2). In Section 2.3 we identify some 
diferent roles that Wikipedia as a whole may usefuly be regarded as playing?for 
instance, as wel as an encyclopedia it can be viewed as a linguistic corpus. We conclude 
in Section 2.4 with some practical information on how to work with Wikipedia data. 
2.1 The Encyclopedic Wisdom of Crowds 
From its inception the Wikipedia project ofered a unique, entirely open, colaborative 
editing proces, scafolded by then-new iki software for group website building, and it is 
fascinating to se how the resource has flourished under this system. It has efectively 
enabled the entire world to become a panel of experts, authors and reviewers?
contributing under their own name, or, if they wish, anonymously. 
In its early days the project atracted widespread skepticism. It was thought that its 
editing system was so anarchic that it would surely fil up with misconceptions, outright 
lies, vanity pieces and other worse-than-useles human output. A piece in The Onion 
satirical newspaper ?Wikipedia Celebrates 750 Years Of American Independence: 
Founding Fathers, Patriots, Mr. T. Honored?
1
 nicely captures this point of view. 
Moreover, it was argued, surely the ability for anyone to make any change, on any page, 
entirely anonymously, would leave the resource ludicrously vulnerable to vandalism, 
particularly to articles that cover sensitive topics. What if the hard work of 200 people 
were erased by one ecentric? And inded, ?edit wars? did erupt, though it turned out 
that some of the most vicious raged over such aparently trivial topics as the ancestry of 
Fredy Mercury and the true speling of yoghurt. Yet this turbulent experience was 
chaneled into developing a set of ever-more sophisticated Wikipedia policies and 
guidelines,
2
 as wel as a more subtle code of recomended god maners refered to as 
Wikiquete.
3
 A self-selecting set of administrators emerged, who performed regulatory 
functions such as blocking individuals from editing for periods of time?for instance edit 
wariors, identified by the fact that they ?revert? an article more than thre times in 24 
hours. Interestingly, the development of these rules was guided by the goal of reaching 
consensus, just as the encyclopedia?s content is. 
                                                             
1
 htp:/ww.theonion.com/content/node/50902 
2
 htp:/en.wikipedia.org/wiki/Wikipedia:Policies_and_guidelines 
3
 htp:/en.wikipedia.org/wiki/Wikipedia:WQT 
Somehow these proceses worked suficiently to shepherd the resource through its 
growing pains, and today Wikipedia is wildly popular and growing al the time. Section 
2.3.1 discuses its acuracy and trustworthines as an encyclopedia. 
There is stil skepticism. For example, Magnus [206], a philosopher, argues that 
Wikipedia does not enable him to use the methods he usualy uses to ?ases claims,? 
such as relying on the reputation of the source, asesing whether the claims are writen 
in an apropriate style or have content that sounds plausible to him. However, these 
observations can be placed in the context of larger philosophical discusions about the 
nature of knowledge and truth: potentialy chalenging contemporary philosophical 
wisdom itself. In many ways the history of the so-caled ?modern? period in Western 
culture?the 30 years or so since the Scientific Revolution?may be sen as the strugle 
to escape a medieval conception of knowledge as defined by some kind of stamp of 
aproval confered on human beliefs by a recognized authority. The key medieval 
authorities were the Bible and Aristotle, and although humanity now avails itself of many 
more sources of information, including scientific experiments, arguably Universities stil 
claim the same kind of authoritative role as validators of knowledge, in particular through 
the per review proces, which underpins what is published. The received wisdom is that 
surely some external source or body has to validate knowledge claims, or where would 
we be? Yet Wikipedia threatens to tear this function from the academy. Many scholars 
have noticed this, and some fight back?for instance by baning students from using it 
[Baker 208]. 
Other models of knowledge have ben ofered, however, that cast Wikipedia?s suces 
in a new light. In the late 19th century the pragmatist Peirce proposed that beliefs be 
understod as knowledge due not to their prior justification, but to their usefulnes, 
public character and future development. His acount of knowledge was based on a unique 
acount of truth, which claimed that true beliefs are those that al sincere participants in a 
?comunity of inquiry? would converge on, given enough time. Influential 20th century 
philosophers [e.g. Quine 1960] scofed at this notion as being insuficiently objective. Yet 
Peirce claimed that there is a kind of person whose greatest pasion is to render the 
Universe inteligible and wil frely give time to do so, and that over the long run, 
within a suficiently broad comunity, the use of signs is intrinsicaly self-corecting 
[Peirce 1868]. Wikipedia can be sen as a fascinating and unanticipated concrete 
realization of these aparently wildly idealistic claims. 
In this context it is interesting to note that Lary Sanger, Wikipedia co-founder and 
editor-in-chief, had his initial training as a philosopher?with a specialization in theory of 
knowledge. In public acounts of his work he has tried to bypas vexed philosophical 
discusions of truth by claiming that Wikipedians are not seking it but rather a neutral 
point of view.
4
 But as the purpose of this is to suport every reader being able to build 
their own opinion, it can be argued that somewhat paradoxicaly this is the fastest route 
to genuine consensus. Interestingly, however, he and the other co-founder Jimy Wales 
eventualy clashed over the isue of expert opinion?s role in Wikipedia. Thus, in 207 
Sanger diverged to found a new public online encyclopedia Citizendium
5
 in an atempt to 
?do beter? than Wikipedia, aparently reaserting validation by external authority, e.g. 
academics. Interestingly, although it is early days, Citizendium sems to lack 
Wikipedia?s popularity and momentum. 
Wikipedia?s unique editing methods, and the isues that suround them, have 
complex implications for mining. First, unlike a traditional corpus, it is constantly 
growing and changing, so results obtained at any given time can become stale. Some 
research strives to measure the degre of diference betwen Wikipedia versions over time 
(though this is only useful insofar as Wikipedia?s rate of change is itself constant), and 
ases the impact on comon research tasks [e.g. Ponzeto and Strube 207a]. Second, 
how are projects that incorporate Wikipedia data to be evaluated? If Wikipedia editors are 
the only people in the world who have ben enthusiastic enough to write up certain 
topics (for instance, details of TV program plots), how is one to determine ?ground truth? 
for evaluating aplications that utilize this information? The third factor is more of an 
oportunity than a chalenge. The awe-inspiring abundance of manual labor given frely 
to Wikipedia raises the posibility of a new kind of research project, which would consist 
in encouraging Wikipedians themselves to perform certain tasks on the researchers? behalf 
(posibly tasks of a scale the researchers themselves could not hope to achieve). As we 
wil se (for instance in Section 6), some have begun to glimpse this posibility, while 
others continue to view Wikipedia in more traditional ?product? rather than ?proces? 
terms. At any rate, this research area sits on a fascinating interface betwen software and 
social enginering. 
2.2. Wikipedia's structure 
Traditional paper encyclopedias consist of articles aranged alphabeticaly, with internal 
cros-references to other relevant places in the encyclopedia, external references to the 
academic literature, and some kind of general index of topics. These structural features 
have ben adapted by Wikipedia for the online environment, and some new features 
arising from the Wiki editing proces have ben aded. The statistics presented in this 
section were obtained from a version of English Wikipedia released in July 208. 
                                                             
4
 htp:/en.wikipedia.org/wiki/Wikipedia:Neutral_point_of_view 
5
 htp:/en.citizendium.org 
2.2.1. Articles: The basic unit of information in Wikipedia is the article. 
Internationaly, Wikipedia contains 10M articles in its 250 diferent languages.
6
 The 
English version contains 2.4M articles (not counting redirects and disambiguation pages, 
which are discused below). About 1.8M of these are bona fide articles with more than 30 
words of descriptive text and at least one incoming link from elsewhere in Wikipedia. 
Articles are writen in a form of fre text that folows a comprehensive set of editorial and 
structural guidelines in order to promote consistency and cohesion. These are laid down 
in the Manual of Style,
7
 and include the folowing: 
1. Each article describes a single concept, and there is a single article for each 
concept. 
2. Article titles are sucinct phrases that resemble terms in a conventional 
thesaurus. 
3. Equivalent terms are linked to an article using redirects (Section 2.2.2). 
4. Disambiguation pages present various posible meanings from which users can 
select an intended article. (Section 2.2.3). 
5. Articles begin with a brief overview of the topic, and the first sentence defines 
the entity and its type. 
                                                             
6
 htp:/en.wikipedia.org/wiki/Wikipedia 
7
 htp:/en.wikipedia.org/wiki/Wikipedia:Manual_of_Style 
 
Figure 1. Wikipedia article on Library. 
 
6. Articles contain hyperlinks that expres relationships to other articles (Section 
2.2.6). 
Figure 1 shows a typical article, entitled Library. The first sentence describes the 
concept: 
A library is a colection of information, sources, resources, and services: it is 
organized for use and maintained by a public body, an institution, or a private 
individual. 
Here the article?s title is the single word Library, but titles are often qualified by 
apending parenthetical expresions. For example, there are other articles entitled Library 
(computing), Library (electronics), and Library (biology). Wikipedia distinguishes 
capitalization when it is relevant: the article Optic nerve (the nerve) is distinguished from 
Optic Nerve (the comic bok). 
2.2.2. Redirects: A redirect page is one with no text other than a directive in the form 
of a redirect link. There are about a dozen for Library and just under thre milion in the 
entire English Wikipedia; they encode pluralism (libraries), technical terms 
(bibliotheca), comon mispelings (libary), and other variants (reading rom, bok 
stack). The aim is to have a single article for each concept and define redirects to link 
equivalent terms to the article?s prefered title. As we wil se, this helps with mining 
because to resolve synonymy an external thesaurus is unecesary.  
2.2.3. Disambiguation pages: Instead of taking readers to an article named by the 
term, as Library does, the Wikipedia search engine sometimes takes them directly to a 
special disambiguation page where they can click on the meaning they want. These pages 
are identified by invoking certain templates (discused in Section 2.2.8) or asigning 
them to certain categories (Section 2.2.6), and often contain (disambiguation) in their 
title.  
The English Wikipedia contains 10,00 disambiguation pages. The first line of the 
Library article in Figure 1 (?For other uses ??) links to a disambiguation page that lists 
Library (computing), Library (electronics), Library (biology), and other senses of the 
term. Brief scope notes acompany each sense, to help users identify the corect one. For 
instance Library (computer science) is ?a colection of subprograms used to develop 
software.? The articles themselves serve as detailed scope notes. Disambiguation pages 
are helpful sources of information concerning homonyms. 
2.2.5. Hyperlinks: Articles are pepered with hyperlinks to other articles: on average, 
about 25 of them. The English Wikipedia contains 60 milion in total. They provide 
explanations of the topics being discused and suport an environment where 
serendipitous encounters with information are comonplace. Anyone who has browsed 
Wikipedia has likely experienced the feling of being hapily lost, browsing from one 
interesting topic to the next and encountering information that they would never have 
searched for. 
Wikipedia?s hyperlinks are also useful from a linguistic standpoint. They are an 
aditional source of synonyms that are not captured by redirects, because the terms used 
as anchors are often couched in diferent words. Library, for example, is referenced by 20 
diferent anchors including library, libraries, and biblioteca. They also complement 
disambiguation pages by encoding polysemy; library links to diferent articles depending 
on the context in which it is found. They also give a sense of how el known each sense 
is; 84% of library links go to the article shown in Figure 1, while only 13% go to 
Library (computing). Furthermore, since hyperlinks in Wikipedia indicate that one article 
relates to another in some respect, this fundamental structure can be mined for meaning in 
many interesting ways?capturing the asociative relations included in standard thesauri 
(Section 5.2), to give just one example. 
2.2.6. Category structure: Authors are encouraged to asign categories to their 
articles. For example, the article Library fals in the category Book Promotion. Authors 
are also encouraged to asign the categories themselves to other more general categories; 
Book Promotion belongs to Books, which in turn belongs to Writen Comunication. 
These categorizations, like the articles themselves, can be modified by anyone. There are 
almost 40,00 categories in the English Wikipedia, with an average of 19 articles and 
two subcategories each. 
Categories are not themselves articles. They are merely nodes for organizing the 
articles they contain, with a minimum of explanatory text. Often (in about a third of 
cases), categories corespond to a concept that requires further description. In these cases 
they are paired with an article of the same name: the category Libraries is paired with the 
article Library, and Bilionaires with Bilionaire. Other categories, such as Libraries by 
country, have no coresponding articles and serve only to organize the content. For 
clarity, in this paper we indicate categories in the form Category:Boks unles it is 
obvious that we are not talking about an article. 
The goal of the category structure is to represent information hierarchy. It is not a 
simple tre-structured taxonomy, but a graph in which multiple organization schemes 
coexist. Thus both articles and categories can belong to more than one category. The 
category Libraries belongs to four: Buildings and structures, Civil services, Culture and 
Library and information science. The overal structure aproximates an acyclic directed 
graph; al relations are directional, and although cycles sometimes ocur, they are 
uncomon. Acording to Wikipedia?s own guidelines, cycles are generaly discouraged 
but may be aceptable in rare cases. For example, Education is a field within Social 
Sciences, which is an Academic discipline, which belongs under Education. In other 
words, you can educate people about how to educate. 
A relatively recent adition to the encyclopedia, and les visible than articles, the 
category structure is haphazard, redundant, incomplete, and inconsistent [Chernov et al. 
2006; Muchnik et al. 207]. Links represent a wide variety of types and strengths of 
relationships. Although there has ben much cleanup and the greatest proportion of links 
now represent clas membership (isa), there are stil many representing physical 
parthod, geographical location and many other merely thematic asociations betwen 
entities?as wel as meta-categories used for editorial purposes, such as Disambiguation. 
Thus Category:Pork curently contains, among others, the categories Domestic Pig, 
Bacon Bits, Religious Restrictions on the Consumption of Pork, and Ful Breakfast. We 
wil se in Section 6 that there are oportunities for recruiting users to help with data 
cleaning. We wil also se in Section 5 that the isues mentioned above have not 
prevented researchers from inovatively and fruitfuly mining the category structure for a 
range of diferent purposes. 
2.2.8 Templates and infoboxes: Templates are pages that are not used in isolation, 
but are instead invoked to ad information to other pages in a reusable fashion. 
Wikipedia contains 174,00 diferent templates, which have ben invoked 23 milion 
times. They are comonly used to identify articles that require atention; e.g. if they are 
biased, porly writen, or lacking citations. They can also define pages of diferent types, 
such as disambiguation pages or featured (high quality) articles. A comon aplication is 
to provide navigational links, such as the for other uses link shown in Figure 1. 
An infobox is a special type of template that displays factual information in a 
structured uniform format. Figure 2 shows one from the article on the Library of 
Congres. It was created by invoking the Infobox Library template and populating its 
fields, such as location and colection size. There are 8,00 diferent infobox templates 
that are used for anything from animal species to strategies for starting a game of ches, 
and the number is growing rapidly. 
There are several simple ways in which the infobox structure could be improved. 
Standard representations for units would alow quantities to be extracted reliably. 
Diferent atribute names are often used for the same kind of content. More far-reaching 
would be to asociate data types with atribute values, and alow language and unit tags 
when information can be expresed in diferent ways (e.g. Euro and USD). Many 
Wikipedia articles use tables for structured information that would be beter represented as 
templates [Auer and Lehman 207]. Despite these problems, it is surprising how much 
meaningful and machine-interpretable information 
can be extracted from Wikipedia templates. This is 
discused further in Sections 5.3 and 6.6. 
2.2.4. Discusion Pages: A discusion tab at the 
top of each article takes readers to its Talk page, 
representing a forum for discusions (often longer 
than the article itself) as to how it might be 
criticized, improved or extended in the future. For 
example, the talk page of the Library article, 
Talk:Library, contains the folowing observations, 
among many others: 
location? 
Libraries can also be found in churches, prisons, 
hotels etc. Should there be any mention of this? 
? Daniel C. Boyer 20:38, 10 Nov 203 
 Libraries can be found in many places, and 
articles should be writen and linked. A wiki 
article on libraries can never be more of a 
sumary, and wil always be expandable ? 
DG 04:18, 1 September 206 
There are talk pages for other aspects of Wikipedia?s 
structure, such as templates and categories, as wel 
as user talk pages that editors use to comunicate 
with each other. These pages are a unique and 
interesting feature of Wikipedia not replicated in 
traditional encyclopedias. They have ben mined for determining quality metrics of 
Wikipedia edits [Emigh et al. 205; Vi?gas et al. 207] but have not ben yet employed 
for any tasks discused in this paper?perhaps because of their unstructured nature. 
2.2.5 Edit histories: To the right of the discusion tab is a history tab that takes 
readers to each article?s editing history. This contains the name or pseudonym of every 
editor, with the changes they made. From the revision history of Library we can se that 
this article was created on 9 November 201 in the form of a short note?which, in fact, 
bears litle relationship to the curent version?and has ben edited about 150 times 
since. Recent edits ad new links and new entries to lists; indicate posible vandalism 
and its reversal; corect speling mistakes; and so on. 
 
 
Figure 2. Infobox for the Library of 
Congres 
 
Analyzing editing history is an interesting research area its own right. For example, 
Vi?gas [204] describes how history pages can be mined to discover colaboration 
paterns. Nelken and Yamangil [208] discus several ways of utilizing the unique 
properties of history pages as a corpus for extracting lexical erors caled egcorns, e.g. 
<rectify, ratify>, as wel as phrases that can be droped to compres sentences, a useful 
component of automatic text sumarization. 
It is natural to ask whether the content of individual articles converges in some 
semantic sense, staying stable despite continuing edits. Thomas and Amit [207] cal the 
information in a Wikipedia article ?justified? if, after going through the comunity 
proces of discusion, repeated editing, and so on, it has reached a stable state. They 
found that articles do, in general, become stable, but that it is dificult to predict where in 
its journey towards maturity a given article is at any point in time. They also point out 
that although information about an article?s edit history might indicate its likely quality, 
mining systems invariably ignore it. 
Table 1 breaks down the number of diferent pages and conections in the English 
version at the time of writing. There are almost 5.5 milion pages in the section 
dedicated to articles. Most are redirects. Many others are disambiguation pages, lists 
(which group related articles but do not provide explanatory text themselves) and stubs 
(incomplete articles with fewer than 30 words or at least one incoming link from 
elsewhere in Wikipedia). Removing al these leaves about 1.8 milion bona-fide articles, 
each with an edit history and most with some content on their discusion page. The 
articles are organized into 40,00 diferent categories and augmented with 170,00 
diferent templates. They are densely interlinked, with 62 milion conections?an 
average of 25 incoming and 25 outgoing links from each article. 
Articles and related pages 5,460,00  Categories 390,00 
 redirects 2,970,00    
 disambiguation pages 10,00  Templates 174,00 
 Lists and stubs 620,00   infoboxes 9,00 
 bona-fide articles 1,760,00   other 165,00 
     
Links 
 betwen articles 62,00,00 
 betwen category and subcategory 740,00 
 betwen category and article 7,270,00 
Table 1. Content of English Wikipedia. 
2.3. Perspectives on Wikipedia 
Wikipedia is a rich resource with several diferent broad functionalities. We wil se 
in subsequent sections that researchers have developed sophisticated mining techniques 
with which they can identify, isolate and utilize these diferent perspectives. Here we 
introduce the most important examples. 
2.3.1 Wikipedia as an encyclopedia: The first and most obvious usage for Wikipedia 
is exactly what it was intended as: an encyclopedia. Ironicaly, this is the very 
aplication that has generated most doubt and cynicism. As noted above, the open 
editing policy has led many to doubt its authority. Dening et al. [205] provide a god 
review of early concerns. They conclude that, while Wikipedia is an interesting example 
of large-scale colaboration, its use as an information source is risky. Their core argument 
is the lack of formal expert review procedures, which gives rise to two key isues: 
acuracy within articles, and bias of coverage acros them. 
Acuracy within articles is investigated by Giles [205], who compares randomly 
selected scientific Wikipedia articles with their equivalent entries in Encyclopedia 
Britanica. Both sources were equaly prone to significant erors, such as 
misinterpretation of important concepts. More subtle erors, however, such as omisions 
or misleading statements, were more comon in Wikipedia. In the 41 articles reviewed 
there were 162 mistakes in Wikipedia versus 123 for Britanica. Britanica Inc. atacked 
Giles? study as ?fataly flawed?
8
 and demanded a retraction; Nature defended itself and 
declined to retract.
9
 Ironicaly, while Britanica?s part in the debate has ben polemical 
and plainly biased, Wikipedia provides objective coverage on the controversy in its 
article on Encyclopedia Britanica.  
Several authors have developed metrics that evaluate the quality of Wikipedia articles 
based on such features as number of authors, number of edits, internal and external 
linking, and article size, e.g. Lih [204] and Wilkinson and Huberman [207]; article 
stability, e.g. Dondio et al. [206]; and the amount of conflict an article generates, e.g. 
Kitur [207]. Emigh and Hering [205] perform a genre analysis on Wikipedia using 
corpus linguistic methods to determine ?features of formality and informality,? and claim 
that its degre of post-production editorial control produces entries as standardized as 
those in traditional print encyclopedias. Vi?gas et al. [207] claim that overal 
cordination and organization, one of the fastest growing areas of Wikipedia, ensures 
great resilience to malicious editing despite high trafic; they highlight in particular the 
role played by discusion pages. 
                                                             
8
 htp:/ww.corporate.britanica.com/britanica_nature_response.pdf 
9
 htp:/ww.nature.com/pres_releases/Britanica_response.pdf 
So much for acuracy. A second isue is bias of coverage. Wikipedia is edited by 
volunters, who naturaly aply more efort to describing topics that pique their interest. 
For example, there are 60 diferent articles dedicated to the The Simpsons carton. In 
contrast, there are half as many pages about the namesake of the carton?s main character, 
the Grek poet Homer, and al the literary works he created and inspired. Lih [204] 
shows that Wikipedia?s content, and therefore bias, is also driven to a large extent by the 
pres. Milne et al. [206] identify a bias towards concepts that are general or 
introductory, and therefore more relevant to ?everyman.? 
2.3.2. Wikipedia as corpus: Large text colections are useful for creating language 
models that capture particular characteristics of language use. For example, the language 
in which a text is writen can be determined by analyzing the statistical distribution of 
the leter n-grams it contains [Cavnar and Trenkle 194], whereas word co-ocurence 
statistics are helpful in tasks like speling corection [Mays et al. 191]. Aligned text 
corpora in diferent languages are extremely useful in machine translation [Brown et al. 
193]. Extensive coverage and high quality of the corpus is a crucial criterion in the 
suces of such aplications. While the web has enabled rapid acquisition of large text 
corpora, their quality leaves much to be desired, due to spaming and the varying format 
of websites. In particular, manualy anotated corpora and aligned multilingual corpora 
are stil rare and in high demand. 
Wikipedia provides a plethora of wel-writen and wel-formulated articles?several 
gigabytes in the English version alone?that can easily be separated from other parts of 
the website. The Simple Wikipedia is significantly smaler, but its articles are writen for 
non-English speakers and do not contain complex sentences. This makes automatic 
linguistic procesing easier, and some researchers focus on Simple Wikipedia for their 
experiments [Ruiz-Casado et al. 205; Toral and Mu?os 206]. Many researchers take 
advantage of the large number of definitions in Wikipedia for question answering (Section 
4.3) and automatic extraction of semantic relations (Section 5.1). Section 2.2.5 mentions 
how Wikipedia history pages can be used as a corpus for training text sumarization 
algorithms, as wel as for determining the quality of the articles themselves. 
Wikipedia also contains anotations in the form of targeted hyperlinks. Consider the 
folowing two sentences from the article about the Formula One team named McLaren. 
1. The [Kiwi (people)|Kiwi] made the team?s Grand Prix debut at the 196 
Monaco race. 
2. Original McLaren [Kiwi|kiwi] logo; a New Zealand icon. 
In the first case the word kiwi links to Kiwi (people); in the second, to Kiwi, the article 
describing the bird. This mark-up is nothing more or les than word sense anotation. 
Mihalcea [207] shows that Wikipedia is a ful fledged alternative to manualy sense-
taged corpora. Section 3.2 discuses research that makes use of these anotations for 
word sense disambiguation and computing the semantic similarity betwen words. 
Although the exploration of Wikipedia as a source of multilingual aligned corpora has 
only just begun, its links betwen description of concepts in diferent languages have ben 
exploited for cros-language question answering [Fer?ndez et al. 207] and automatic 
generation of bilingual dictionaries [Erdman et al. 208]. This is further discused in 
Section 3.4, while Section 4.3 investigates Wikipedia?s potential for multilingual 
information retrieval. 
2.3.3 Wikipedia as a thesaurus: There are many similarities betwen the structure of 
traditional thesauri and the ways in which Wikipedia organizes its content. As noted, 
each article describes a single concept, and its title is a sucinct, wel-formed phrase that 
resembles a term in a conventional thesaurus. If article names corespond to manualy 
defined terms, links betwen them corespond to relations betwen terms, the building 
blocks of thesauri. The international standard for thesauri (ISO 278) specifies four kinds 
of relation: 
? Equivalence: USE, with inverse form USE FOR 
? Hierarchical: broader term (BT), with inverse form narower term (NT) 
? Any other kind of semantic relation (RT, for related term). 
Wikipedia redirects provide precisely the information expresed in the equivalence 
relation. As noted, they are a powerful way of dealing with word variations such as 
abreviations, equivalent expresions and synonyms. The hierarchical relations (broader 
and narower terms) are reflected in Wikipedia?s category structure. Hyperlinks betwen 
articles capture other kinds of semantic relation. (Restricting consideration to mutual 
cros-links eliminates many of the more tenuous asociations.) 
As we wil se, researchers compare Wikipedia with manualy created domain-specific 
thesauri and augment them with knowledge from it (Section 3.2.3). Redirects turn out to 
be very acurate and can safely be aded to existing thesauri without further checking. 
Wikipedia also has the potential to contribute new topics and concepts, and can be used 
as a source of sugestions for thesaurus maintenance. Manual creation of scope notes is a 
labor-intensive aspect of traditional thesauri. Instead, the first paragraph of a Wikipedia 
article can be extracted as a description of the topic, backed up by the ful article should 
more explanation be required. Finaly, Wikipedia?s multilingual nature alows thesauri 
to be translated into other languages. 
2.3.4. Wikipedia as a database: Wikipedia contains a masive amount of highly 
structured information. Several projects (notably DBpedia, discused in Sections 5.2 and 
6.6) extract this and store it in formats acesible to database aplications. The aim is 
two-fold: to alow users to pose database-style queries against datasets derived from 
Wikipedia, and to facilitate linkage with other datasets on the web. Some projects even 
aim to extract database-style facts directly from the text of Wikipedia articles, rather than 
from infoboxes. Furthermore, disambiguation and redirect pages can be turned into a 
relational database that contains tables for terms, concepts, term concept relationships 
and concept relationships [Gregorowicz and Kramer 206]. 
Another idea is to botstrap fact extraction from articles by using the content of 
infoboxes as training data and aplying machine learning techniques to extract even more 
infobox-style information from the text of other articles. This alows infoboxes to be 
generated for articles that do not yet have them [Wu and Weld 207]. Related techniques 
can be used to clean up the underlying infobox data structure, with its proliferation of 
individual templates. 
2.3.5 Wikipedia as an ontology: Articles can be viewed as ontology elements, for 
which the URIs of Wikipedia entries serve as surprisingly reliable identifiers [Hep et al. 
206]. Of course, true ontologies also require concept nodes to be conected by 
informative relations, and in Section 6 we wil se researchers mine such relations in a 
host of inovative ways from Wikipedia?s structure?including redirects, hyperlinks 
(both incoming and outgoing, as wel as the anchor text), category links, category names 
and infoboxes, and even raw text, as wel as experimenting with ading relations to and 
from other resources such as WordNet and Cyc. 
From this viewpoint Wikipedia is arguably by far the largest living ontological 
structure available today, with its distinctive Wiki technology serving as a large-scale 
colaborative ontology development environment. Some researchers are begining to mix 
traditional mining techniques with posibly more far-sighted atempts to encourage 
Wikipedia editors themselves in directions that might bear ontological fruit. 
2.3.6 Wikipedia as a network structure: Wikipedia can be viewed as a hyperlinked 
structure of web pages, a microcosm of the web. Standard methods of analyzing the 
network structure can then be aplied [Belomi and Bonato 205]. The two most 
prominent techniques used for web analysis are PageRank, which underpins Gogle?s 
suces [Brin and Page 198], and the HITS algorithm [Kleinberg 198]. Belomi and 
Bonato [205] aplied both of these to Wikipedia and discerned some interesting 
underlying cultural biases (as of April 205). These authors conclude that PageRank and 
HITS sem to identify diferent kinds of information. They report that acording to the 
HITS authority metric, space (in the form of political geography) and time (in the form of 
both time spans and landmark events) are the primary organizing categories for Wikipedia 
articles. Within these, information tends to be organized around famous people, comon 
words, animals, ethnic groups, political and social institutions, and abstract concepts 
such as music, philosophy, and religion. 
In contrast, the most important articles acording to PageRank include an 
overwhelming number of concepts tightly related to religion. For example, Pope, God 
and Priest were the highest-ranking nouns, as compared to Television, Scientific 
clasification, and Animal for HITS. They found that PageRank semed to transcend 
recent political events to give a wider historical and cultural perspective in weighting 
geographic entities. It also tends to bring out a global rather than a Western perspective, 
both for countries and cities and for historical events. HITS reveals a strong bias towards 
recent political leaders, whereas people with high PageRank scores tend to be ones with 
an impact on religion, philosophy and society. It would be interesting to se how these 
trends have evolved in the thre years since the publication of this work. 
An alternative to PageRank and HITS is the Gren method [Dufy 201], which 
Olivier and Senelart [207] aplied to Wikipedia?s hyperlink network structure in order 
to find related articles. This method, which is based on Markov Chain theory, is related 
to the topic-sensitive version of PageRank introduced by Haveliwala [203]. Given a 
target article, one way of finding related articles is to lok at nodes with high PageRank 
in its imediate neighborhod. For this a topic-sensitive measure like Gren?s is more 
apropriate than the global PageRank. 
The Wikipedia category graph also forms a network structure. Zesch and Gurevych 
[207] showed that it is a scale-fre, smal-world graph, like other semantic networks 
such as WordNet. They adapted WordNet-based measures of semantic relatednes to use 
the Wikipedia category graph instead, and found that they work wel?at least for nouns. 
They sugest that this, coupled with Wikipedia?s multilingual nature, may enable 
natural language procesing algorithms to be transfered to languages that lack wel-
developed semantic WordNets. 
2.4. Obtaining Wikipedia data 
Wikipedia is based on the MediaWiki software. As an open source project, its entire 
content is easily obtainable. It is available in the form of large XML files and database 
dumps that are released sporadicaly, from several days to several weks apart.
10
 The ful 
content (without revision history or images) of the English version of Wikipedia ocupies 
18 Gb of uncompresed data at the time of writing. There are several tols for extracting 
information from these files, which are discused in Section 7. 
                                                             
10
 htp:/download.wikimedia.org/wikipedia 
Instead of obtaining the database directly, specialized web crawlers have ben 
developed to download the entire content of Wikipedia. Belomi and Bonato [205] 
scaned the All pages index section, which contains a complete list of the pages exposed 
on the website. Pages that do not contain a regular article were identified by testing for 
specific paterns in the URL, and discarded. Wikipedia?s administrators prefer the use of 
the database dumps, however, to minimize the strain placed on their services. 
3 SOLVING NATURAL LANGUAGE PROCESING TASKS 
Natural language procesing aplications fal into two major groups: i) those relying on 
symbolic methods, where the system utilizes a manualy encoded repository of human 
language, and i) statistical methods, which infer properties of language by procesing 
large text corpora. The problem with the former is a dearth of high-quality knowledge 
bases. Even the lexical database WordNet, which, as the largest of its kind, receives 
substantial atention [Felbaum 198], has ben criticized for low coverage?particularly 
of proper names?and high sense proliferation [Mihalcea and Moldovan 201; Ponzeto 
and Strube 207a]. Initial enthusiasm with statistical methods somewhat faded once they 
hit an uper performance bound that is hard to improve upon unles they are combined 
with symbolic elements [Klavans and Resnik 196]. Several research groups 
simultaneously discovered Wikipedia as an alternative to WordNet. Direct comparison of 
their performance on the same task has shown that Wikipedia can be employed in a 
similar way and significantly outperforms WordNet on various tasks [Strube and 
Ponzeto 206]. This section describes research in the four language procesing tasks to 
which Wikipedia has ben sucesfuly aplied: semantic relatednes (Section 3.1), word 
sense disambiguation (Section 3.2), co-reference resolution (Section 3.3) and multilingual 
alignment (Section 3.4). 
3.1 Semantic relatednes 
Semantic relatednes quantifies the similarity betwen two concepts, e.g. doctor and 
hospital. Budanitsky and Hirst [201] diferentiate betwen semantic similarity, where 
only predefined taxonomic relations are used to compute similarity, and semantic 
relatednes, where other relations like has-part, is-made-of are used as wel. Semantic 
relatednes can be also quantified by statistical methods without requiring a manualy 
encoded taxonomy, for example by analyzing term co-ocurence in a large corpus 
[Resnik 195; Jiang and Conrath 197]. 
To evaluate automatic methods for estimating semantic relatednes, the corelation 
coeficient betwen machine-asigned scores and those asigned by human judges is 
computed. Thre standard datasets are available for evaluation: 
? Miler and Charles? [191] list of 30 noun pairs, which we denote by M&C; 
? Rubenstein and Godenough?s [1965] 65 synonymous word pairs, R&G, 
? [Finkelstein et al. 202]?s colection of 353 word pairs (WordSimilarity-353), 
WS-353. 
The best pre-Wikipedia result for the first set was a corelation of 0.86, achieved by Jiang 
and Conrath [197] using a combination of statistical measures and taxonomic analysis 
derived from WordNet. For the third, Finkelstein et al. [202] achieved 0.56 corelation 
using Latent Semantic Analysis. The discovery of Wikipedia began a new era of 
competition. 
Strube and Ponzeto [206] and Ponzeto and Strube [207a] re-calculated several 
measures developed for WordNet using Wikipedia?s category structure. The best 
performing metric on most datasets was Leacock and Chodorow?s [198] normalized path 
measure: 
! 
lchc
1
,
2
()
="log
lengthc
1
,
2
()
2D
, 
where length is the number of nodes on the shortest path betwen nodes c
1
 and c
2
, and D 
is the maximum depth of the taxonomy. WordNet-based measures outperform Wikipedia-
based ones on the smal datasets M&C and R&G, but on WS-353 Wikipedia wins by a 
large margin. Combining similarity evidences from Wikipedia and WordNet using a 
SVM to learn relatednes from the training data yielded the highest corelation score of 
0.62 on a designated ?testing? subset of WS-353. 
Strube and Ponzeto remark that WordNet?s sense proliferation was responsible for its 
por performance on WS-353. For example, when computing the relatednes of jaguar 
and stock, the later is interpreted in the sense of animals kept for use or profit rather than 
in the sense of market, which people find more intuitive. WordNet?s fine sense 
granularity has ben also criticized in word sense disambiguation (Section 3.2.1). The 
overal conclusion is that Wikipedia can serve AI aplications in the same way as hand-
crafted knowledge resources. 
Zesch et al. [207] perform similar experiments with the German Wikipedia, which 
they compare to GermaNet on thre datasets including the translated M&C. The 
performance of Wikipedia-based measures was inconsistent, and, like Strube and Ponzeto 
[206], they obtained best results by combining evidence from GermaNet and Wikipedia. 
Ponzeto and Strube [207a] investigate whether performance on Wikipedia-based 
relatednes measures changes as Wikipedia grows. After comparing February 206, 
September 206 and May 207 versions they conclude that the relatednes measure is 
robust. There was no improvement, probably because new articles were unrelated to al 
words in the evaluation datasets. A Java API is available for those wishing to experiment 
with these techniques [Ponzeto and Strube [207c].
11
 
Gabrilovich and Markovitch [207] develop Explicit Semantic Analysis (ESA) as an 
alternative to the wel-known Latent Semantic Analysis. They use a centroid-based 
clasifier to map input text to a vector of weighted Wikipedia articles. For example, for 
Bank of Amazon the vector contains Amazon River, Amazon Basin, Amazon Rainforest, 
Amazon.com, Rainforest, Atlantic Ocean Brazil, etc. To compute semantic relatednes 
betwen two terms, they compute the cosine similarity of their vectors. This significantly 
outperforms Latent Semantic Analysis on WS-353, with an average corelation of 0.75. 
With the same technique, the Open Directory Project
12
 achieves a 0.65 corelation, 
indicating that Wikipedia?s quality is greater. The maping developed in this work has 
ben sucesfuly utilized for text categorization (Section 4.4). 
While Gabrilovich and Markovitch [207] use the ful text of Wikipedia articles to 
establish relatednes betwen terms, Milne [207] analyses just the internal hyperlinks 
that apear, arguing that Wikipedia?s link structure bears much significant information 
about concepts. To compute the relatednes betwen two terms they are first maped to 
coresponding Wikipedia articles, and then vectors are created containing the links to 
other Wikipedia articles that ocur in these articles. For example, a sentence like Bank of 
America is the largest comercial <bank> in the <United States> by both <deposits> 
and <market capitalization> contributes four links to the vector. Each link is weighted 
by the inverse number of times it is linked from other Wikipedia articles?the les 
comon the link, the higher its weight. For example, market capitalization receives 
higher weight than United States and thus contributes more to the semantic relatednes. 
Disambiguation is a serious chalenge for this technique. Strube and Ponzeto [206] 
chose the most likely meaning from the order in which entries ocur in Wikipedia?s 
disambiguation pages; Gabrilovich and Markovitch [207] avoid disambiguation entirely 
by simultaneously asociating a term with several Wikipedia articles. However, Milne?s 
[207] aproach hinges upon corect maping of terms to Wikipedia articles. When terms 
are manualy disambiguated, a corelation of 0.72 is achieved for WS-353. Automatic 
disambiguation that simply selects whatever meaning produces the greatest similarity 
score is only 0.45, showing that unlikely senses often produce greater similarity than 
comon ones. 
Milne and Witen [208a] disambiguate term mapings automaticaly using thre 
features. One is the conditional probability of the sense given the term, acording to the 
Wikipedia corpus (discused further in Section 3.2.1). For example, the term leopard 
                                                             
11
 htp:/ww.eml-r.org/english/research/nlp/download/jwordnetsimilarity.php 
most often links to the animal description rather than the eponymous Mac operating 
system. They also analyze how comonly two terms apear in Wikipedia as a 
colocation. Finaly, they replace the vector-based similarity metric described above by a 
measure inspired by Cilibrasi and Vitanyi?s [202] Normalized Gogle Distance, which 
is based on term ocurences in web pages, but using Wikipedia?s links rather than 
Gogle?s search results. The semantic similarity of two terms is determined by the sum 
of these thre values?conditional probability, colocation and similarity. 
This technique achieves 0.69 corelation with human judgments on WS-353, not far 
of Gabrilovich and Markovitch?s [207] figure for ESA. However, it is far les 
computationaly intensive because only links are analyzed, not the entire Wikipedia text. 
Further analysis of the results shows that performance is even higher on terms that are 
wel defined in Wikipedia. 
Table 2 sumarizes the results of the similarity metrics that we have described, using 
the same datasets and evaluation technique. ESA is best, with WLM not far behind and 
WikiRelate the lowest. The astonishingly high corelation with human performance that 
these techniques obtain was wel out of reach in pre-Wikipedia days. This is an 
important advance, because?as we wil se when discusing information retrieval and 
extraction?automatic computation of semantic similarity helps with many natural 
language procesing tasks. 
3.2 Word sense disambiguation 
Techniques for word sense disambiguation?i.e., resolving polysemy?use a dictionary 
or thesaurus that defines the inventory of posible senses [Ide and Veronis 198]. 
Wikipedia provides an alternative resource. Each article describes a concept that is a 
posible sense for words and phrases that denote it, whether by redirection, via a 
disambiguation page, or as anchor text that links to the article. 
The terms to be disambiguated may either apear in plain text or in an existing 
knowledge base (thesaurus or ontology). The former situation is more complex because 
the context is les clearly defined. Consider the example in Figure 3. Even human readers 
canot be sure of the intended meaning of wod from the sentence alone, but a diagram 
showing semanticaly related words in WordNet puts it into context and makes it clear 
that the meaning is the tres and other plants in a large densely woded area, rather 
than the hard fibrous lignified substance under the bark of tres. This highlights the 
main idea behind disambiguation: identify the context and analyze which of the posible 
senses fits it best. 
                                                                                                                                                       
12
 htp:/ww.dmoz.org 
We first cover techniques for disambiguating phrases in text to Wikipedia articles, 
then examine the important special case of named entities, and finaly show how 
disambiguation is used to map manualy created knowledge structures to Wikipedia. 
3.2.1. Disambiguating phrases in runing text: Discovering the intended senses of 
words and phrases is an esential stage in every natural language aplication, otherwise 
ful ?understanding? canot be claimed. WordNet is a popular resource for word sense 
disambiguation, but succes has ben mixed [Vorhes 198]. One reason is that the task 
is demanding because ?linguistic [disambiguation] techniques must be esentialy perfect 
to help? [Vorhes 198]; another is that WordNet defines word senses with such fine 
granularity that even human anotators strugle to diferentiate them [Edmonds and 
Kilgarif 198]. The two are related, because fine sense granularity makes disambiguation 
more dificult. In contrast, Wikipedia defines only those senses on which its contributors 
reach consensus, and include an extensive description of each one rather than WordNet?s 
brief glos. Substantial advances have ben made since it was discovered as a resource for 
disambiguation. 
Mihalcea [207] use Wikipedia articles as a source of sense-taged text to form a 
training corpus for supervised disambiguation. They folow the evaluation methodology 
developed by SIGLEX, the Asociation for Computational Linguistics? Special Interest 
Group on the Lexicon.
13
 For each example they colect its ocurences as link anchors in 
Wikipedia. For example, the term bar is linked to bar (establishment) and bar (music), 
each of which coresponds to a WordNet synset?that is, a set of synonymous terms 
representing a particular meaning of bar. The results show that a machine learning 
aproach trained on Wikipedia sentences in which both meanings of bar ocur clearly 
outperforms two simple baselines. 
Method M&C R&G WS-353 
WordNet 
[Strube and Ponzeto, 206] 
0.82 0.86 ful: 0.36 
test: 0.38 
WikiRelate! 
[Ponzeto and Strube, 207] 
0.49 0.5 ful: 0.49 
test: 0.62 
ESA 
[Gabrilovich and Markovitch, 207] 
0.73 0.82 0.75 
WLVM 
[Milne, 207] 
n/a n/a man: 0.72 
auto: 0.45 
WLM 
[Milne and Witen, 208] 
0.70 0.64 0.69 
Table 2. Overview of semantic relatednes methods. 
This work uses Wikipedia solely as a resource to disambiguate words or phrases into 
WordNet synsets. Mihalcea and Csomai [207] go further, using Wikipedia?s content as 
a sense inventory in its own right. They disambiguate terms?words or phrases?that 
apear in plain text to Wikipedia articles, concentrating exclusively on ?important? 
concepts. They cal this proces wikification because it simulates how Wikipedia authors 
manualy insert hyperlinks when writing articles. There are two stages: extraction and 
disambiguation. In the first, terms that are judged important enough to be highlighted as 
links are identified in the text. Only terms ocuring at least five times in Wikipedia are 
considered, and likelihod of a term being a hyperlink is estimated by expresing the 
number of articles in which a given word or phrase apears as anchor text as a proportion 
of the total number of articles in which it apears. Al terms whose likelihod exceds a 
predefined threshold are chosen, which yields an F-measure of 5% on a subset of 
manualy anotated Wikipedia articles. 
In the second stage these terms are disambiguated to Wikipedia articles that capture 
the intended sense. For example, in the sentence Jenga is a popular ber in the bars of 
Thailand the term bar coresponds to the bar (establishment) article. Given a term, those 
articles for which it is used as anchor text in the Wikipedia are candidate senses. Best 
results are achieved by a machine learning aproach in which Wikipedia?s already-
anotated articles serve as training data. Features?like part-of-spech tag, local context of 
thre words to the left and right, and their part-of-spech tags?are computed for each 
ambiguous term that apears as anchor text of a hyperlink. A Na?ve Bayes clasifier is 
then aplied to disambiguate unsen terms. Csomai and Mihalcea [207] report an F-
measure of 87.7% on 6,50 examples, and go on to demonstrate that linking educational 
material to Wikipedia articles in this maner improves the quality of knowledge that 
people acquire when reading the material, and decreases the time taken. 
                                                                                                                                                       
13
 htp:/ww.senseval.org 
  
He could se wod around the house. 
Figure 3. What is the meaning of wood in both examples? 
In a paralel development, Wang et al. [207] use a fixed-length window to identify 
terms in a document that match the titles of Wikipedia articles, eliminating matches 
subsumed by longer ones. They disambiguate the matches using two methods. One 
works on a document basis, seking those articles that are most similar to the original 
document acording to the standard cosine metric betwen TF?IDF-weighted word 
frequency vectors. The second works on a sentence basis, computing the shortest distance 
betwen the candidate articles for a given ambiguous term and articles coresponding to 
any non-ambiguous terms that apear in the same sentence. The distance metric is 1 if 
the two articles link to each other; otherwise it is the number of nodes along the shortest 
path betwen two Wikipedia categories to which they belong, normalized by the 
maximum depth of the category taxonomy. The result is the average of the two 
techniques (if no unambiguous articles are available, the similarity technique is aplied 
by itself). Wang et al. do not compare this method to other disambiguation techniques 
directly. They do, however, report the performance of text categorization before and after 
synonyms and hyponyms of matching Wikipedia articles, and their related terms, were 
aded to the documents. The findings were mixed, and somewhat negative. 
Medelyan et al. [208] use Mihalcea and Csomai?s [207] wikification strategy with 
a diferent disambiguation technique. Document terms with just one match are 
unambiguous, and their coresponding articles are colected and used as ?context articles? 
to disambiguate the remaining terms. This is done by determining the average semantic 
similarity of each candidate article to al context articles identified for the document. The 
semantic similarity of a pair of articles is obtained from their incoming links as described 
by Milne and Witen [208a] (se Section 3.1). Acount is also taken of the conditional 
probability of a sense given the term, acording to the Wikipedia corpus (proposed by 
Mihalcea and Csomai [207] for a baseline). For example, the term jaguar links to the 
article Jaguar cars in 46 out of 927 cases, thus its conditional probability is 0.5. The 
resulting maping is the one with the largest product of semantic similarity and 
conditional probability. This achieves an F-measure of 93% on 17,50 mapings in 
manualy anotated Wikipedia articles. 
Milne and Witen [208b] extend this aproach using machine learning. Rather than 
extracting terms and then disambiguating them, they alow a term?s posible mapings 
to influence whether it should be adjudged an important concept for the document. 
Conditional probability of a maping, its semantic similarity to other context articles, 
and other features are combined in a machine learning clasifier, baged decision tres, 
which determines a probability figure for each maping. More than one Wikipedia article 
can be chosen for a given document term, which improves recal at the expense of a slight 
decrease in precision, raising the F-measure from 93% to 97% on the same data. 
3.2.2. Disambiguating named entities: Phrases refering to named entities, which are 
proper nouns such as geographical and personal names, and titles of boks, songs and 
movies contribute to the largest part of our vocabulary. Wikipedia is recognized as the 
largest available resource of such entities. It has become a platform for discusing curent 
news, and contributors put isues into encyclopedic context by relating them to historical 
events, geographic locations and significant personages, thereby increasing the coverage of 
named entities. Here we describe thre aproaches that focus specificaly on linking 
named entities apearing in text or in search queries to coresponding Wikipedia articles. 
Techniques for recognizing named entities in Wikipedia itself are sumarized in Section 
5.3. 
Bunescu and Pa?ca [206] disambiguate named entities in search queries in order to 
group search results by the coresponding senses. They first create a dictionary of 50,00 
entities that apear in Wikipedia, and ad redirects and disambiguated names to each 
one. If a query contains a term that coresponds to two or more entries, they chose the 
one whose Wikipedia article has the greatest cosine similarity with the query. If the 
similarity scores are to low they use the category to which the article belongs instead of 
the article itself. If even this fals below a predefined threshold they asume that no 
maping is available. The reported acuracies are betwen 5% and 85% for members of 
Wikipedia?s People by ocupation category, depending on the model and experimental 
data employed. 
Cucerzan [207] identifies and disambiguates named entities in text. Like Bunescu 
and Pa?ca [206], he first extracts a vocabulary from Wikipedia. It is divided into two 
parts, the first containing surface forms and the second the asociated entities, along with 
contextual information about them. The surface forms are titles of articles, redirects, and 
disambiguation pages, and anchor text used in links. This yields 1.4 milion entities, 
with an average of 2.4 surface forms each. Further <named entity, tag> pairs are extracted 
from Wikipedia list pages?e.g., Texas (band) receives a tag LIST_band name 
etymologies, because it apears in the list with this title?yielding a further 540,00 
entries. Categories asigned to Wikipedia articles describing named entities serve as tags 
to, yielding 2.65 milion entries. Finaly a context for each named entity is colected?
e.g., parenthetical expresions in its title, phrases that apear as link anchors in the 
article?s first paragraph of the article, etc.?yielding 38 milion <named entity, context> 
pairs. 
To identify named entities in text, capitalization rules indicate which phrases are 
surface forms of named entities. Co-ocurence statistics generated from the web by a 
search engine help to identify boundaries betwen them (e.g. Whitney Museum of 
American Art is a single entity, whereas Whitney Museum in New York contains two). 
Lexical analysis is used to colate identical entities (e.g., Mr. Brown and Brown), and 
entities are taged with their type (e.g., location, person) based on statistics colected 
from manualy anotated data. Disambiguation is performed by comparing the similarity 
of the document in which the surface form apears with Wikipedia articles that represent 
al named entities that have ben identified in it, and their context terms, and chosing 
the best match. Cucerzan [207] achieves 8% acuracy on 5,00 entities apearing in 
Wikipedia articles, and 91% on 750 entities apearing in news stories. 
Kazama and Torisawa [207] recognize and clasify entities but do not disambiguate 
them. Their work resembles the methods described above. Given a sentence, their goal is 
to extract al n-grams representing Wikipedia articles that corespond to a named entity 
and asign a type to it. For example, in the sentence Rare Jimy Hendrix song draft 
sels for almost $17,00 they identify Jimy Hendrix as an entity of type musician. To 
determine the type they extract the first noun phrase folowing the verb to be from the 
Wikipedia article?s first sentence, excluding phrases like kind of, type of?e.g., guitarist 
in Jimy Hendrix was a guitarist. Recognition is a supervised taging proces based on 
standard features such as surface form and part of spech tag, augmented with category 
labels extracted from Wikipedia and a gazeter. An F-measure of 8% was achieved on a 
standard set of 100 training and 20 development and testing documents. 
Cucerzan [207] and Kazama and Torisawa [207] report similar performance, while 
Bunescu and Pa?ca?s [206] results sem slightly worse. However, comparison is 
unreliable because diferent datasets are used. Acuracy also depends on the type of the 
named entity. 
3.2.3. Disambiguating thesaurus and ontology terms: Wikipedia?s category and link 
structure contains the same kind of information as a domain-specific thesaurus, as 
ilustrated by Figure 4, which compares it to the agricultural thesaurus Agrovoc [195]. 
Whereas in Section 3.1.2 Wikipedia is used as an independent knowledge base, it can 
also be used to extend and improve existing resources. For example, if it were known 
that cardiovascular system and circulatory system in Figure 4 refer to the same concept, 
the synonym blod circulation could be aded to Agrovoc. The major problem is to 
establish a maping betwen Wikipedia and other resources, disambiguating situations 
that suport multiple mapings. 
Ruiz-Casado et al. [205] map Wikipedia articles to WordNet. They work with the 
Simple Wikipedia,
14
 a reduced version that contains easier words and shorter sentences, 
intended for people learning English. WordNet synsets cluster word senses so that 
homonyms can be identified. If a Wikipedia article matches several WordNet synsets, the 
apropriate one is chosen by computing the similarity betwen the Wikipedia entry 
word-bag and the WordNet synset glos. This technique achieves 84% acuracy, when 
dot product similarity of stemed word vectors is aplied. The problem is that as 
Wikipedia grows, so does ambiguity. For instance even the Simple Wikipedia contains 
the article Cats (musical), which is absent from WordNet. The maping technique must 
be able to deal with absent items as wel as polysemy in both resources. 
Overel and R?ger [206] disambiguate place names mentioned in Wikipedia to 
locations in gazeters. Instead of semantic similarity they develop geographicaly-based 
disambiguation methods. One seks a minimum bounding box enclosing the location 
being disambiguated and other place names that are mentioned in the same context, using 
geographical cordinates from the gazeter. Another analyzes the place name?s referent; 
for example, if the surface form Ontario is maped to Ontario, Canada, then London, 
Ontario can be maped to London, Canada. Best results were achieved by combining 
the minimum bounding box method with ?importance,? measured by population size. 
                                                             
14
 htp:/simple.wikipedia.org 
 
 
 
 
 
Figure 4. Comparison of organization structure in Agrovoc and Wikipedia. 
 
An F-measure of 80% was achieved on a test set with 1,70 locations and 12,275 non-
locations. 
Overel and R?ger [207] extend this aproach by creating a co-ocurence model for 
each place name. They map place names to Wikipedia articles, colect their redirects as 
synonyms, and gather the anchor text of links to these articles. This yields diferent ways 
of refering to the same place, e.g., {Londinium ? London} and {London, UK ? 
London}. Next they colect evidence from Wikipedia articles: geographical cordinates, 
and location names in subordinate categories. They also mine Placeopedia, a mash-up 
website that conects Wikipedia with Gogle Maps. Together, these techniques 
recognize 75% of place names and map them to geographical locations with an acuracy 
of betwen 78 and 90%. 
Milne et al. [207] investigate whether domain-specific thesauri can be obtained from 
Wikipedia for use in natural language aplications within restricted domains, comparing 
it with Agrovoc, a manualy built agricultural thesaurus. On the positive side, Wikipedia 
article titles cover the majority of Agrovoc terms that were chosen by profesional 
indexers as index terms for an agricultural corpus, and its redirects corespond closely 
with Agrovoc?s synonymy relation. However, neither category relations nor (mutual) 
hyperlinks betwen articles corespond wel with Agrovoc?s taxonomic relations. Instead 
of extracting new domain-specific thesauri from Wikipedia they examine how existing 
ones can be improved, using Agrovoc as a case study [Medelyan and Milne 2008]. Given 
an Agrovoc descriptor, they colect semanticaly related terms from the Agrovoc hierarchy 
as context terms and map each one to the Wikipedia articles whose conditional 
probability (as explained in Section 3.2.1) is greatest. Then they compute the semantic 
similarity of each candidate maping to this set of context articles. Manual evaluation of a 
subset with 40 mapings shows an average acuracy of 92%. The results are slightly 
beter if there are fewer than four posible mapings and remain stable at 8% if there are 
ten or more. 
Medelyan and Leg [208] map terms from the Cyc ontology to Wikipedia articles 
using the disambiguation aproach proposed by Medelyan and Milne [208]. However, 
since they draw on the Cyc ontology as part of their disambiguation, and the project can 
be viewed as a large-scale ?ontology alignment?, discusion of it wil be postponed to 
Section 6.5. 
There is stil far les research on word sense disambiguation using Wikipedia than for 
WordNet. However, significant advances have ben made, and over the last two years the 
acuracy of maping documents to relevant Wikipedia articles has improved by one third 
[Milne and Witen 208]. Other researchers (such as Wang et al. [207]) use word sense 
disambiguation as a part of an aplication but do not provide an intrinsic evaluation. 
Furthermore, for fair comparison the same version of Wikipedia and the same training and 
test set should be used, as has ben done for WordNet by SIGLEX (Senseval, cited 
earlier). Evaluation of named entity extraction is even more complex, with each research 
group concentrating on diferent types of entity, e.g. persons or places. Here, extrinsic 
evaluations may be helpful?e.g., performance on a particular task, for example question 
answering, before and after integration with Wikipedia. The next section describes an 
extrinsic evaluation of Wikipedia for co-reference resolution and compares the results with 
WordNet. 
3.3 Co-reference resolution 
Natural language understanding tasks such as textual entailment and question answering 
involve co-reference resolution?identifying which text entities refer to the same concept. 
Unlike word sense disambiguation, it is not necesary to determine the actual meaning of 
these entities, but merely identify their conection. Consider the folowing example from 
Wikipedia?s article on New Zealand: 
Elizabeth I, as the Quen of New Zealand, is the Head of State and, in 
her absence, is represented by a non-partisan Governor-General. The Quen 
?reigns but does not rule.? She has no real political influence, and her position 
is esentialy symbolic. [emphasis aded] 
Without knowing that Elizabeth I and the Quen refer to the same entity, which can be 
refered to by the pronouns she and her, the information that can be infered from this 
paragraph is limited. To resolve the highlighted co-referent expresions requires linguistic 
knowledge and world knowledge?that Elizabeth I is the Quen, and female. Curent 
methods often derive semantic relations from WordNet or mine large corpora using 
lexical paterns such as X is a Y and Y such as X. The task can be modeled as a binary 
clasification problem?to determine, for each pair of entities, whether they co-refer or 
not?and adresed using machine learning techniques, with features such as whether 
they are semanticaly related, the distance betwen them, agrement in number and 
gender. 
The use of Wikipedia for these tasks has ben explored in two ways. Ponzeto and 
Strube [206a, 207] analyze its hyperlink structure and text to extract semantic features; 
whereas Yang and Su [207] use it as a large semi-structured corpus for mining lexical 
paterns. They are easy to compare because both use test data from the Mesage 
Understanding Conference organized by NIST. 
Ponzeto and Strube?s [206, 207a] main goal is to show that Wikipedia can be 
used as a fuly-fledged lexical and encyclopedic resource, comparable to WordNet but far 
more extensive. While their work on semantic relatednes (Section 3.1) evaluates 
Wikipedia intrinsicaly, co-reference is evaluated extrinsicaly to demonstrate 
Wikipedia?s utility. As a baseline they re-implement Son et al.?s [201] method with a 
set of standard features, such as whether the two entities share the same gramatical 
feature, or belong to the same WordNet clas. Aditional features mined from WordNet 
and Wikipedia are evaluated separately. The WordNet features for two given terms A, 
e.g. Elisabeth I, and B, e.g. Quen, are: 
? The highest similarity score from al synset pairs to which A and B belong 
? The average similarity score. 
The Wikipedia analogue to these two features, 
? The highest similarity score from al Wikipedia categories to which A and B 
belong 
? The average similarity score, 
is augmented by further features: 
? Does the first paragraph of the Wikipedia article describing A mention B? 
? Does any hyperlink in A?s article target B? 
? Does the list of categories for A?s article contain B? 
? What is the overlap betwen the first paragraphs of the articles for A and B? 
The similarity and relatednes scores are computed using various metrics. Feature 
selection is aplied during training to remove irelevant features for each scenario. The 
results are included in Table 3, which we wil discus shortly. 
Yang and Su [207] utilize Wikipedia in a diferent way, asesing semantic 
relatednes betwen two entities by analyzing their co-ocurence paterns in Wikipedia. 
(Patern matching using the Wikipedia corpus is practiced extensively in information 
extraction, as described in Section 5). The paterns are evaluated based on positive 
instances in the training data that serve as seds. For example, given the pair of co-
referents Bil Clinton and president, and Wikipedia sentences like Bil Clinton is elected 
President of the United States and The US president, Mr Bil Clinton; the paterns [X is 
elected Y] and [Y, Mr X] are extracted. Sometimes paterns ocur in structured parts of 
Wikipedia like lists and infoboxes?for example, in United States | Washington, D.C., 
the bar symbol is the patern. An acuracy measure is used to eliminate paterns that are 
frequently asociated with both negative and positive pairs. Yang and Su [207] found 
  NWIRE BNEWS 
  R P F R P F 
baseline 56.3 86.7 68.3 50.5 82.0 62.5 
+WordNet 62.4 81.4 70.7 59.1 82.4 68.8 
Ponzeto and Strube 
[206, 207a] 
+ikipedia 60.7 81.8 69.7 58.3 81.9 68.1 
baseline 54.5 80.3 64.9 52.7 75.3 62.0 Yang and Su [207] 
+sem. related. 57.4 80.8 67.1 54.0 74.7 62.7 
Table 3. Performance comparison of two independent techniques on the same datasets. 
that using the 10 most acurate paterns as features did not improve performance over the 
baseline. However, ading a single feature representing semantic relatednes betwen the 
two entities did improve results. Yang and Su use mined paterns to ases relatednes 
by multiplying together two measures of reliability: the strength of asociation betwen 
each positive sed pair and the pointwise mutual information betwen the entities 
ocuring with the patern and by themselves. 
Table 3 shows the results that both sets of authors report for co-reference resolution. 
They use the same baseline, but the implementation was evidently slightly diferent, for 
Ponzeto and Strube?s yielded a slightly improved F-measure. Ponzeto and Strube?s 
results when features were aded from WordNet and Wikipedia are remarkably similar, 
with no statistical diference betwen them. These features decrease precision over the 
baseline on NWIRE by 5 points but increase recal on both datasets, yielding a 
significant overal gain (1.5 to 2 points on NWIRE and 6 points on BNEWS). Yang and 
Su improve the F-measure on NWIRE and recal on BNEWS by 2 points. Overal, it 
sems that Ponzeto and Strube?s technique performs slightly beter. 
These co-reference resolution systems are quite complex, which may explain why no 
other methods have ben described in the literature. We expect further developments in 
this area. 
3.4 Multilingual alignment 
In 206, five years after its inception, Wikipedia contained 10,00 articles for eight 
diferent languages. The closest precedent to this unique multilingual resource is the 
comercial EuroWordNet that unifies seven diferent languages but covers a far smaler 
set of concepts?8,00 to 4,00, depending on the language [Vosen et al. 197]. Of 
course, multilingual vocabularies and aligned corpora benefit any aplication that 
involves machine translation. 
Adafre and de Rijke [206] began by generating paralel corpora in order to identify 
similar sentences?those whose information overlaps significantly?in English and 
Dutch. First they used a machine translation tol to translate Wikipedia articles and 
compared the result with the coresponding manualy writen articles in that language. 
Next they generated a bilingual lexicon from links betwen articles on the same topic in 
diferent languages, and determined sentence similarity by the number of shared lexicon 
entries. They evaluated these two techniques manualy on 30 randomly chosen Dutch and 
English Wikipedia articles. Both identified rather a smal number of corect sentence 
alignments: the machine translation had lower acuracy but higher coverage than the 
lexicon aproach. The authors ascribed the por performance to the smal size of the 
Dutch version but were optimistic about Wikipedia?s potential. 
Fer?ndez et al. [207] use Wikipedia for cros-language question answering (se 
Section 4.3 for research on monolingual question answering). They identify named 
entities in the query, link them to Wikipedia article titles, and derive equivalent 
translations in the target language. Wikipedia?s exceptional coverage of named entities 
(Section 3.2.2) counters the main problem of cros-language question answering: low 
coverage of the vocabulary that links questions to documents in other languages. For 
example, the question In which town in Zeland did Jan Torop spend several weks 
every year betwen 1903 and 1924? mentions the entities Zeland and Jan Torop, 
neither of which is covered by EuroWordNet. In an initial version of the system using 
that resource, Zeland remains unchanged and the phrase Jan Torop is translated to 
Enero Torop because Jan is eroneously interpreted as January. With Wikipedia as a 
reference, the translation is corect: ?En qu? ciudad de Zelanda pasaba varias semanas al 
a?o Jan Torop entre 1903 y 1924? With Wikipedia?s help, Fer?ndez et al. increase the 
percentage of corectly answered questions by 20%. 
Erdman et al. [208] show that simply folowing language links in Wikipedia is 
insuficient for a high-coverage bilingual dictionary. They develop heuristics based on 
Wikipedia?s link structure that extract significantly more translation pairs, and evaluate 
them on a manualy created test set containing terms of diferent frequency. Given a 
Wikipedia article that has ben translated into another language?the target article?they 
augment the translated article name with redirects and also anchor text used to refer to the 
article. Redirects are weighted by the proportion of links to the target article (including 
al redirects) that use this particular redirect. Anchors are weighted similarly, by 
expresing the number of links that use this particular anchor text as a proportion of the 
total number of incoming links to the article. If a term apears as both redirect and anchor 
text, the two weights are combined. The resulting dictionary contains al translation pairs 
whose weight exceds a certain threshold. This achieves significantly beter results than a 
standard dictionary creation aproach using paralel corpora. Figure 5 shows the system 
in action. 
This section has demonstrated Wikipedia?s imense potential as a repository of 
linguistic knowledge for natural language procesing. Impresive results have ben 
achieved, particularly on wel-defined tasks such as determining semantic relatednes and 
word sense disambiguation. 
4. INFORMATION RETRIEVAL 
Given its utility for natural language procesing, it is not surprising that Wikipedia has 
also ben used to organize documents and locate them. This section describes 
aplications of Wikipedia to information retrieval. These split roughly into searching 
and browsing.  
For searching, Wikipedia has ben leveraged to gain a deper understanding of both 
queries and documents, and improve how they are matched to each other. Section 4.1 
describes how it has ben used to expand queries to alow them to return more relevant 
documents, while Section 4.2 describes experiments in cros-language retrieval. 
Wikipedia has also ben used to retrieve specific portions of documents, such as answers 
to questions (Section 4.3) or important topics (Section 4.4). 
For browsing, the same Wikipedia-derived understanding has ben used to 
automaticaly organize documents into helpful groups. Section 4.5 shows how Wikipedia 
has ben aplied to document clasification, where documents are categorized under broad 
headings like Sport and Technology. To a leser extent it has also ben used to determine 
the main topics that documents discus, so that they can be organized under more specific 
tags (Section 4.6). 
4.1 Query expansion 
Query expansion aims to improve queries by ading terms and phrases, such as 
synonyms, alternative spelings, and closely related concepts. Such query reformulations 
can be performed automaticaly?without the user?s input?or interactively?where the 
system sugests modifications that could be made. 
Milne et al. [207] use Wikipedia to provide both forms of expansion in their 
knowledge-based search engine Koru.
15
 They first obtain a subset of Wikipedia articles 
that are relevant for a particular document colection, and use the links betwen these to 
                                                             
15
 Demo at htp:/ww.nzdl.org/koru 
 
 
Figure 5. Scren shot of automaticaly created translations for plant. 
build a corpus-specific thesaurus. Given a query they map its phrases onto topics in this 
thesaurus. Figure 6 demonstrates how a query president bush controversy is maped to 
potentialy relevant thesaurus topics (or Wikipedia articles) George H.W. Bush, George 
W. Bush and Controversy. President Bush is initialy disambiguated to the younger of 
the two, because he ocurs most often in the document set. This can be corected 
manualy. The redirects from his article and that of Controversy are then mined for 
synonyms and alternative spelings, such as Dubya and disagrement, and quotes are 
aded around multi-word phrases (such as Bush administration). This results in a 
complex Bolean query such as an expert librarian might isue. The knowledge base 
derived from Wikipedia was capable of recognizing and lending asistance to 95% of the 
queries isued to it. Evaluation over the TREC HARD Track [Alan 205] shows that 
the expanded queries are significantly beter than the original ones in terms of overal F-
measure. 
Milne et al. also provided interactive query expansion by using the detected query 
topics as starting points for browsing the Wikipedia-derived thesaurus. For example, 
George Bush provides a starting point for locating related topics such as Dick Cheney, 
Terorism, and President of the United States. The evaluation of such exploratory search 
provided litle evidence that it asisted users. Despite this, the authors argue that 
Wikipedia should be an efective base for this task, due to its extensive coverage and 
inter-linking. This is yet to be proven, however: to our knowledge there are no other 
examples of exploratory searching with Wikipedia. 
Li et al. [207] also use Wikipedia to expand queries, but focus on the most 
problematic ones; those that traditional aproaches fail to improve. The standard method 
for improving queries?pseudo-relevance fedback?works by feding terms from the 
highest ranked documents back into the query [Ruthven and Lalmas 203]. This works 
wel in general, so most of the state-of-the-art aproaches are variants of this idea. 
Unfortunately it makes bad queries even worse, because it relies on at least the top few 
documents being relevant. Li et al. avoid this by using Wikipedia as an external corpus 
to obtain aditional query terms. They isue the query on Wikipedia to retrieve relevant 
articles. They then use these articles? categories to group them, and rank articles so that 
those in the largest groups apear more prominently. Forty terms are then picked from 
the top 20 articles?it is unclear how they are selected?and aded to the original query. 
When tested on queries from TREC?s 205 Robust track [Alan 205], this improved 
those queries on which traditional pseudo-relevance fedback performs most porly. It did 
not perform as wel as the state of the art in general, however. The authors atribute this 
to diferences in language and context betwen Wikipedia and the dated news articles used 
for evaluation, which render many aded terms irelevant. 
Where the previous two systems departed from traditional bag-of-words relevance 
fedback, Egozi et al. [208] instead aim to augment it. Their system, MORAG, uses 
Explicit Semantic Analysis (described in Section 3.1) to represent documents and queries 
as vectors of their most relevant Wikipedia articles. Comparison of document vectors to 
the query vector results in concept-based relevance scores, which are combined with those 
given by state-of-the-art retrieval systems, such as Xapian and Okapi. Aditionaly, both 
concept-based and bag-of-words scores are computed by segmenting documents into 
overlaping 50 word subsections (a comon strategy), so that the total score of a 
document is the sum of the score obtained from its best section and its overal content. 
One complication that this aproach must overcome is ESA?s tendency to provide 
features (Wikipedia articles) that are only peripheraly related to queries. The query law 
enforcement, dogs, for example, results not just in police dog and cruelty to animals, but 
also contract and Louisiana. To adres this, MORAG first ranks documents acording 
to their BOW scores, and then uses the highest and lowest ranking documents to provide 
positive and negative examples for selecting features. When used to augment the four top 
performing systems from the TREC-8 competition [Vorhes and Harman 200] MORAG 
achieved improvements of betwen 4% and 15% to Mean Average Precision, depending 
on the system being augmented. 
We were surprised to find only these thre papers on using Wikipedia to expand 
queries, despite the fact that it sems wel suited to this task. Bag-of-words based 
Figure 6. Using Wikipedia to recognize and expand query topics. 
George W. Bush Controversy 
?George W. Bush? OR ?George Bush? OR ?G.W. Bush? OR 
Bush OR ?Bush Junior? OR ?Bush government? OR Dubya 
OR Dubyuh OR ?Bush administration? OR ? 
AND 
( ) 
 
George H.W. Bush 
? 
? ? 
controversy OR controversial OR controversies OR 
disagreement OR dispute OR squable 
( ) 
aproaches stand to benefit from Wikipedia?s understanding of what the words mean and 
how they relate to each other. Concept based aproaches that draw on traditional 
knowledge bases could profit just as much from Wikipedia?s unmatched breadth. We 
expect widespread usage of Wikipedia in the future, both for automatic query expansion 
and exploratory searching, and for both improving existing techniques and suporting 
entirely new ones. 
4.2 Multilingual Retrieval 
Multilingual or cros-language information retrieval involves searching for relevant 
documents that were not writen in the same language as the query, which serves the 
large number of bilingual or multilingual users. Wikipedia has clear aplication to this 
task. Although its language versions grow at diferent rates and cover diferent topics, they 
are carefuly interwoven. For example, the English article on Search engines is linked to 
the German Suchmaschine, the French Moteur de recherch?, and more than 40 other 
translations. These links constitute a comprehensive cros-lingual dictionary of topics 
and terms, which is growing rapidly. This makes Wikipedia ideal for translating 
emerging named entities and topics, such as people and technologies?exactly the items 
that more traditional multilingual resources (dictionaries) strugle with. Surprisingly, we 
failed to locate any papers that use Wikipedia?s cros-language links directly to translate 
query topics. 
Instead Pothast et al. [208] jump directly to a more sophisticated solution that uses 
Wikipedia to generate a multilingual retrieval model. This is a generalization of 
traditional monolingual retrieval models?like the vector space model or latent semantic 
analysis?which ases similarities betwen documents and fragments of text. 
Multilingual and cros-language models are capable of identifying similar documents 
even when they are writen in diferent languages. Pothast et al. take Explicit Semantic 
Analysis?which, as described in Section 3.1, represents documents by their most 
relevant Wikipedia concepts?as the starting point for a new model caled Cros-language 
Explicit Semantic Analysis or CL-ESA. This aproach depends on the hypothesis that 
the relevant concepts identified by ESA are esentialy language independent, so long as 
the concepts are suficiently described in diferent languages. If there were suficient overlap 
betwen the English and German Wikipedias, for example, then one would get roughly 
the same list of concepts (and in the same order) from ESA regardles of whether the 
document being represented, or the concept space it was projected onto, was in English 
or German. This means that the languages of documents and concept spaces are largely 
irelevant, and documents in diferent languages can be compared without explicit 
translation. 
To evaluate this idea, Pothast et al. conducted several experiments with a bilingual 
(German/English) set of 3,00 documents. One test was to use articles in one language as 
queries, to retrieve their direct translations in the other language. When CL-ESA was 
used to rank al English documents by their similarity to German ones, the explicit 
translation of the document was consistently ranked highly?it was first 91% of the time, 
and in the top 10 more than 9% of the time. Another test was to use an English 
document as a query for the English document set, and its translation as a query for the 
German one. The two result sets had an average corelation of 72%. These results were 
obtained with a dimensionality of 10
5
; that is, 10,00 bilingual concepts were used to 
generate the concept spaces. Today, only German and English Wikipedias have this 
degre of overlap. Results degrade as fewer concepts are used; Pothast et al. found that 
betwen 1,00?10,00 concepts are suficient for reasonable retrieval performance. At the 
time, this made CL-ESA capable of pairing English with German, French, Polish, 
Japanese, and Dutch. In time, improvements to the algorithm and continued growth of 
Wikipedia wil alow these techniques to be aplied to other languages as wel. 
4.3 Question answering 
Question answering is a more complex form of information retrieval, which aims to return 
specific answers to questions, rather than entire documents. This ranges in sophistication 
from merely obtaining the most relevant sentences or sections from documents, to 
ensuring that they are in the corect form to constitute an answer, to constructing answers 
on the fly. Wikipedia provides an extremely broad corpus filed with numerous facts, 
which makes it a promising source of answers. A simple but wel-known example of this 
is how Gogle queries prefixed with define, and Ask.com queries starting with What is? 
or Who is?, often return the first sentences from relevant Wikipedia articles. 
Kaiser?s [208] QuALiM system, ilustrated in Figure 7, provides a more 
sophisticated example of question answering with Wikipedia.
16
 When asked a question 
(such as Who is Tom Cruise maried to?) it mines Wikipedia not only for relevant 
articles, but also for the sentences and paragraphs in which the answer is given. It also 
provides the exact entity that answers the question?e.g. Katie Holmes. Interestingly, 
this entity is not mined from Wikipedia but obtained by analyzing results from various 
web search engines. It parses questions to identify the expected clas of the answer (in 
this case, a person), and construct valid queries (e.g. Tom Cruise is maried to or Tom 
Cruise?s wife). Responses to these queries are then parsed to identify entities of the 
                                                             
16
 Demo at htp:/demos.inf.ed.ac.uk:8080/qualim/ 
corect type to satisfy the answer. Wikipedia is then only used to provide the suporting 
sentences and paragraphs. 
The TREC series of conferences hosts a prominent forum for investigating question 
answering,
17
 The question-answering track provides ground truth for experiments with a 
corpus from which answers to questions have ben manualy extracted. The 204 track 
saw two of the first uses of Wikipedia for question answering, from Lita et al. [204] and 
Ahn et al. [205]. The former does not perform question answering per se; instead it 
investigates whether diferent resources provide answers to questions, without atempting 
to extract the answers automaticaly. Wikipedia?s coverage of answers was 10 percentage 
points higher than WordNet, and about 30 points higher than the other resources they 
compared it to, including Gogle define queries and gazeters such as the CIA World 
Fact Book. 
Ahn et al. [205] sem to be the first to provide explicit answers from Wikipedia. 
They first identify the topic of the question?Tom Cruise in our example?and locate the 
relevant article. They then identify the expected type of the answer?in this case, another 
person (his wife)?and scan the article for matching entities. These are ranked by both 
                                                             
17
 htp:/trec.nist.gov/ 
 
Figure 7. The QuALiM system, using Wikipedia to answer Who is Tom Cruise maried to? 
 
prior answer confidence (probability that they answer any question at al) and posterior 
confidence (probability that they answer the question at hand). Prior confidence is given 
by the position of the entity in the article, since articles cover the most important facts 
first. Posterior confidence is given by the Jacard similarity of the original question and 
the sentence surounding the entity. Wikipedia is used as one stream among many from 
which to extract answers, and unfortunately the experiments do not tease out its specific 
contribution. Consequently is dificult to measure the efectivenes of their aproach. 
Overal, however, they describe the results as ?disapointing? because it did not improve 
upon their previous work. 
The CLEF series of conferences and competitions is another popular forum for 
investigating question answering.
18
 Monolingual and cros-language QA are adresed by 
providing corpora and tasks in many diferent languages. One source of documents is a 
cros-language crawl of Wikipedia. Most entries for this competition extract answers from 
Wikipedia but are not covered here because they do not take advantage of its unique 
properties. 
Buscaldi and Roso [207a] use Wikipedia to augment their question answering 
system QUASAR. The way in which this system extracts answers was left unchanged, 
except for an aditional step where Wikipedia is consulted to verify the results. They 
index four diferent views of Wikipedia?titles, ful text, first sections (definitions), and 
the categories that articles belong to?and search them diferently depending on the 
question type. Answers to definition questions (e.g., Who is Nelson Mandela?) are 
verified by seking articles whose title contains the coresponding entity and whose first 
section contains the proposed answer. If the question requires a name (e.g., Who is the 
President of the United States?) the proces is reversed: candidate answers (Bil Clinton, 
George Bush) are sought in the title field and query constraints (President, United States) 
in the definition. In either case, if at least one relevant article is returned the answer is 
verified. This yielded an improvement of 4.5% over the original system, acros al 
question types. Fer?ndez et al. [207] also make use of Wikipedia?s structure to answer 
questions, but focus on cros-lingual tasks, where questions are formulated in a language 
diferent from that of the documents from which answers are extracted. Their work is 
described in Section 3.4. 
As wel as using Wikipedia as a corpus for standard question answering tasks, CLEF 
has a track (WiQA) specificaly designed to asist Wikipedia?s contributors. Its aim, 
given a source article, is to extract new snipets of information from related articles that 
should be incorporated into it [Jijkoun and de Rijke 206]. The authors conclude that the 
                                                             
18
 The homepage for the CLEF series of conferences is at htp:/ww.clef-campaign.org/ 
task is dificult but posible, as long as the results are used in a supervised fashion. The 
best out of seven participating teams aded an average of 3.4 perfect (important and novel) 
snipets to each English article, with a precision of 36%. Buscaldi and Roso [207b], 
one of the contributing entries,
19
 search Wikipedia for articles containing the text of the 
target article?s title. They extract snipets from them, rank them acording to their 
similarity to the original article using the standard bag-of-words model, and discard those 
that are redundant (to similar) or irelevant (not similar enough). On English data this 
yields 2.7 perfect snipets per topic, with a precision of 29%. On Spanish data it obtains 
1.8 snipets with 23% precision. 
Higashinaka et al. [207] extract questions, answers and even hints from Wikipedia 
to automaticaly generate ?Who am I?? quizes. The first two tasks are simple because 
the question is always the same and the answer is always a person. The chalenging part 
is extracting hints (which are esentialy facts about the person) and ranking them so that 
they progres from vague to specific. They used machine learning for this, based on 
biographical Wikipedia articles whose facts have ben manualy ranked. 
Overal, research on question answering tends to treat Wikipedia as just another 
plain-text corpus from which to extract answers. Few researchers take advantage of 
Wikipedia?s unique structural properties (e.g. categories, links, etc) or the explicit 
semantics it provides. Instead they aply standard word-based similarity measures, even 
when Wikipedia concept-based measures such as ESA have ben proven to be more 
efective. We were surprised to find litle overlap betwen this work and research on 
information extraction from Wikipedia (Section 5), and no use of Wikipedia derived 
ontologies or its infoboxes (Section 6). Perhaps this reflects an overal goal of crawling 
the entire web for answers, requiring techniques that are generalizable to any textual 
resource. 
4.4 Entity ranking 
It is often expedient to return entities in response to a query rather than ful documents as 
in clasical retrieval. This resembles question answering and often fulfils the same 
purpose?for example, the query countries where I can pay in euros could be answered 
by a list of relevant countries. For other queries, however, entity ranking does not 
provide answers but instead generates a list of pertinent topics. For example, as wel as 
Gogle, Yaho, and Microsoft Live the query search engines would also return 
PageRank and World Wide Web. The literature sems to use the term entity and named 
                                                             
19
 We were unable to locate papers describing the others. 
entity interchangeably, thus it is unclear whether concepts such as information retrieval 
and ful text search would also be valid results. 
Section 5.3 demonstrates that Wikipedia ofers an exceptionaly large pol of 
manualy-defined entities, which can be typed (as people, places, events, etc.) fairly 
acurately. The entity ranking track of the Initiative for Evaluation of XML Retrieval 
(INEX) compares diferent methods for entity ranking by how wel they are able to return 
relevant Wikipedia entities in response to queries [de Vries et al. 207]. Zaragoza et al. 
[207] also use Wikipedia as a dataset for comparing two main aproaches to entity 
ranking: entity containment graphs and web search based methods. Their results are of 
litle interest here because they do not relate directly to Wikipedia. More relevant is that 
they have developed a version of Wikipedia that has ben automaticaly anotated with 
named entities, and are sharing it so that others can investigate diferent aproaches to 
named entity ranking.
20
 
As wel as a being source of entities, Wikipedia provides a wealth of information 
about them, which can improve ranking. Vercoustre et al. [208] combine traditional 
search with Wikipedia-specific features. They rank articles (which they asume are 
synonymous with entities) by combining the score provided by a search engine (Zetair) 
with features mined from categories and inter-article links. The article links provide a 
simplified PageRank for entities and the categories provide a similarity score for how they 
relate to each other. The resulting precision is almost double that of the search engine 
alone. Vercoustre et al. were the only competitors for the INEX entity-ranking track we 
were able to locate,
21
 and it sems that Wikipedia?s ability to improve entity ranking has 
yet to be evaluated against more sophisticated baselines. Moreover, the features that 
Vercoustre et al. derive from Wikipedia are only used to rank entities in general, not by 
their significance for the query. Regardles, entity ranking wil no doubt receive more 
atention as the INEX competition grows and others use Zaragosa et al.?s dataset. 
The knowledge that Wikipedia provides about entities can also be used to organize 
them. This has not yet ben thoroughly investigated, the only example being Yang et 
al.?s [207] use of Wikipedia articles and WikiBoks to organize entities into 
hierarchical topic maps. They search for the most relevant article and bok for a query and 
simply strip away the text to leave lists of links?which again they asume to be 
entities?under the headings in which they were found. This is both a simplistic entity 
ranking method and a tol for generating domain-specific taxonomies, but has not ben 
evaluated as either. 
                                                             
20
 The anotated version of Wikipedia is at htp:/ww.yr-bcn.es/semanticWikipedia 
21
 It began in 207 and the Procedings are yet to be published. 
4.5 Text categorization 
Text categorization (or clasification) organizes documents into meaningful homogeneous 
groups. Documents are labeled from a pol of categories in the same way that articles in a 
newspaper are asigned to sections like busines, sport, or entertainment. The traditional 
aproach to this task is to represent documents with the words they contain, and use 
training documents to identify the words and phrases that are most indicative of each 
category label. Wikipedia alows categorization techniques to draw on background 
knowledge about the concepts these words represent. As Gabrilovich and Markovitch 
[206] note, traditional aproaches are britle. They break down when documents discus 
similar topics in diferent terms?as when one talks of Wal-Mart and the other of 
department stores. They canot make the necesary conections because they lack 
background knowledge about what the words mean. Wikipedia can fil the gap. 
As a quick indication of Wikipedia?s aplication to text categorization, Table 4 
compares Wikipedia-based aproaches with state of the art categorization that only uses 
information obtained from the documents themselves. The figures were obtained on the 
Reuters-21578 colection, a set of news stories that have ben manualy asigned to 
categories. Results are presented as the break even point (BEP) where recal and precision 
are equal. The micro and macro columns corespond to how these are averaged: the 
former averages acros documents, so that smaler categories are largely ignored; while 
the later averages by category. The first entry is a baseline provided by Gabrilovich and 
Markovitch, which is in line with state-of-the-art document-based methods such as 
[Dumais et al. 198]. The remaining thre entries use aditional information gleaned 
from Wikipedia and are described below. The gains may sem slight, but they represent 
the first improvements upon a performance plateau reached by previous state-of-the-art 
techniques, which are now a decade old. 
Gabrilovich and Markovitch [206] observed that documents can be augmented with 
Wikipedia concepts without complex natural language procesing. Both are in the same 
form?plain text?so standard similarity algorithms can be used to compare documents 
with potentialy relevant articles. Thus documents can be represented weighted lists of 
 Micro BEP Macro BEP 
Baseline (from Gabrilovich and Markovitch [206]) 87.7 60.2 
Gabrilovich and Markovitch [206] 8.0 61.4 
Wang et al. [207] 91.2 63.1 
Minier et al. [207] 86.1 64.1 
 
Table 4. Performance of text categorization over the Reuters-21578 colection. 
relevant concepts, rather than bags of words. This should sound familiar; it is the 
predecesor of Explicit Semantic Analysis, an influential technique that we have sen 
several times before (Section 3.1, 4.1, 4.2). For each document, Gabrilovich and 
Markovitch generate a large set of features (articles) not just from the document as a 
whole, but also by considering each word, sentence, and paragraph independently. 
Training documents are then used to filter out the best of these features, to augment the 
original bags of words. Aditionaly the number of links made to each article is used to 
identify and emphasize those that are most wel known. This results in consistent 
improvements over the previous clasification techniques, particularly over short 
documents (which otherwise have few features) and smal categories (which provide fewer 
training examples). 
The ability of Wikipedia to improve clasification of short documents is confirmed by 
Banerje et al. [207], who focus on clustering news articles under fed items such as 
those provided by Gogle News. They tok a simple aproach for obtaining relevant 
articles for each news story, by isuing its title and short description (Gogle snipet) as 
separate queries to a Lucene index of Wikipedia. They were able to cluster the documents 
under their original headings (each fed item organizes many similar stories) with 90% 
acuracy using only the titles and descriptions as input. This work is somewhat suspect, 
however, in that it treats Gogle?s automaticaly clustered news stories as ground truth, 
and only compares their Wikipedia-based aproach to a baseline of their own design. 
Wang et al. [207] also use Wikipedia to improve document clasification, but focus 
on mining Wikipedia for terms and phrases to ad to the bag of words that represent each 
document. For each document, they locate relevant Wikipedia articles by matching n-
grams to article titles. They then augment the document by crawling these articles for 
synonyms (redirects), hyponyms (parent categories) and asociative concepts (inter-article 
links). In the later case they acknowledge that many links exist betwen articles that are 
only tenuously related at best. They overcome this by only selecting linked articles that 
are closely related acording to textual content or parent categories. As shown in Table 4, 
this results in the best overal performance. 
As wel as a source of background knowledge for improving clasification techniques, 
Wikipedia can be used as a corpus for training and evaluating them. Almost al 
clasification aproaches are machine-learned, and thus require training examples. 
Wikipedia provides milions of them. Each asociation betwen an article and the 
categories to which it belongs can be considered as manualy defined ground truth for 
how that article should be clasified. Gleim et al. [207], for example, use it to evaluate 
their techniques for categorizing web pages solely on their structure rather than textual 
content. Admitedly, this is a wel-established research area with wel-known datasets, so 
it is unclear why another one is required. Table 4, for example, would be more 
informative if al of the researchers using Wikipedia for document clasification had used 
standard datasets instead of creating their own. 
Two interesting aproaches that do not compete with the traditional bag-of-words 
aproaches (and wil therefore be discused only briefly) are Janik and Kochut [207] and 
Minier et al. [207]. The former is one of the few techniques that does not use machine 
learning for clasification. Instead Janik and Kochut mine miniature ?ontologies??rough 
networks of relevant concepts?from Wikipedia for each document and category, and 
algorithmicaly identify the most relevant category ontology for each document ontology. 
The later aproach transforms the document-term matrix used by traditional techniques 
by maping it onto a gigantic term-concept matrix obtained from Wikipedia. PageRank 
is run over Wikipedia?s inter-article links in order to weight the derived Wikipedia 
concepts, and dimensionality reduction techniques (latent semantic analysis, kernel 
principle component analysis and kernel canonical corelation analysis) are used to reduce 
the representation to a manageable size. Minier et al. atribute the disapointing results 
(shown in Table 4) to diferences in language usage betwen Wikipedia the Reuters 
corpus used for evaluation. It should be noted that their Macro BEP (the highest in the 
Table) may be misleading; their baseline achieves an even higher result, indicating that 
their experiment should not be compared to the other thre. 
Banerje [207] observed that document categorization is a problem where the 
goalposts shift regularly. The typical aplication is organizing news stories or emails, 
which arive in a constant stream where the topics being discused constantly evolve. A 
categorization method trained today may not be particularly helpful next wek. Instead of 
throwing away old clasifiers, they show that inductive transfer alows old clasifiers to 
influence new ones. This improves results and reduces the ned for fresh training data. 
They find that clasifiers which derive aditional knowledge from Wikipedia are more 
efective at transfering this knowledge, which they atribute to Wikipedia?s ability to 
provide background knowledge about the content of articles, making their representations 
more stable. 
Daka and Cucerzan [208] and Bhole et al. [207] perform the reverse of the above 
techniques. Instead of using Wikipedia to augment document categorization, they aply 
categorization techniques to Wikipedia. Their aim is to clasify articles to detect the 
types (people, places, events, etc.) of the named entities they represent. Since this has 
more to do with named entity recognition than document clasification, discusion of it 
is defered to Section 5.3. Also discused elsewhere is Sch?nhofen [206] who developed 
a topic indexing system but evaluated it as a document clasifier. His work is left for the 
next section. 
Overal, the use of Wikipedia for text categorization is a flourishing research area. 
Many recent eforts have improved upon the previous state of the art; a plateau that had 
stod for almost a decade. Some of this suces may be due to the amount of atention 
the problem has generated (at least 10 papers in just thre years), but more fundamentaly 
it can be atributed to the way in which researchers are aproaching the task. Just as we 
saw in Section 4.1, the greatest gains have come from drawing closely on and 
augmenting existing research, while thoroughly exploring the unique features that 
Wikipedia ofers. 
4.6 Topic Indexing 
Topic indexing is subtly diferent from text categorization. Both label documents so that 
they can be grouped sensibly and browsed eficiently, but in topic indexing labels are 
chosen from the topics the documents discus rather than from a predetermined pol of 
categories. Topic labels are typicaly obtained from a domain-specific thesaurus?such as 
MESH [Lipscomb 200] for the Medical domain?because general thesauri like WordNet 
and Roget are to smal to provide suficient detail. An alternative is to obtain labels 
from the documents themselves, but this is inconsistent and eror-prone because topics 
are dificult to recognize and apear in diferent surface forms. Using Wikipedia as a source 
of labels sidesteps the onerous requirement for developing or obtaining relevant thesauri, 
since it is large and general enough to aply to al domains. It might not achieve the 
same depth as domain-specific thesauri, but tends to cover the topics that are used for 
indexing most often [Milne et al. 206]. It is also more consistent than extracting terms 
from the documents themselves, since each concept in Wikipedia is represented by a 
single sucinct manualy chosen title. In adition to the labels themselves, Wikipedia 
provides many aditional features about the concepts, such as how important and wel 
known they are, and how they relate to each other. 
Medelyan et al. [208] propose topic indexing that uses Wikipedia as a controled 
vocabulary and aplies wikification (defined in Section 3.2.1) to identify the topics 
mentioned within documents. For each candidate topic they identify several features, 
including clasical, such as how often topics are mentioned, and two Wikipedia-specific 
ones. One is node degre: the extent to which each candidate topic (article) is linked to 
the other topics detected in the document. The other is keyphrasenes: the extent to 
which the topics are used as links in Wikipedia. They use a supervised aproach that 
learns the typical distributions of these features from manualy taged corpus [Frank et al. 
199]. For training and evaluation they had 30 people, working in pairs, index 20 
documents. Figure 8 shows key topics for one document and demonstrates the inherent 
subjectivity of the task?the indexers did not al chose the same topics, and achieved 
only 30% agrement with each other. Medelyan et al.?s automatic system, whose choices 
are shown as filed circles in the figure, obtained the same level of agrement and requires 
litle training. 
Although it has not ben evaluated as such, Gabrilovich and Markovitch?s [207] 
Explicit Semantic Analysis, described in Section 3.1, esentialy performs topic 
indexing. For each document or text fragment it generates a weighted list of relevant 
Wikipedia concepts, the strongest of which should be suitable topic labels. Another 
aproach that has not ben compared to manualy indexed documents is Sch?nhofen 
[206], who uses Wikipedia categories as the vocabulary from which key topics are 
selected. Documents are scaned to identify the article titles and redirects they mention, 
and documents are represented by the categories that contain these articles?weighted by 
how often the document mentions the category title, its child article titles, and the 
individual words in them. Sch?nhofen did not compare the resulting categories with 
index topics, but instead used them to perform document categorization. Roughly the 
same results are achieved whether documents are represented by these categories or by 
their content in the standard way. Combining the two yields a significant improvement. 
Like document categorization, research in topic indexing builds solidly on related 
work, but has ben augmented to make interesting use of Wikipedia. Although not a 
great deal of research has ben done, significant gains have ben achieved over the 
previous state of the art. The results have not yet ben evaluated as rigorously as in 
categorization, however. Medelyan et al. [208] have directly compared their results 
 
Figure 8. Topics asigned to a document entitled ?A Safe, Eficient Regresion Test Selection 
Technique? by human teams (outlined circles) and the new algorithm (filed circles). 
against manualy defined ground truth, but this was restricted to a relatively smal 
dataset. To advance further, larger datasets ned to be developed for evaluation and 
training. 
5. INFORMATION EXTRACTION 
Where information retrieval is driven largely by the goal of answering specific questions, 
information extraction seks to deduce meaningful structures from unstructured data such 
as natural language text, though in practice the dividing line betwen the fields is not 
sharp. These structures are usualy represented as relations. For example, from this: 
Apple Inc.?s world corporate headquarters are located in the midle of Silicon 
Valey, at 1 Infinite Lop, Cupertino, California. 
a relation hasHeadquarters(Apple Inc., 1 Infinite Lop-Cupertino-California) might be 
extracted. The chalenge is to extract this relation from sentences expresing the same 
information about Apple Inc., regardles of the actual wording. Moreover, given a similar 
sentence about other companies, the same relation should be determined with diferent 
arguments, e.g., hasHeadquarters(Gogle Inc., Gogle Campus-Mountain View-
California). 
Methods for extracting relations from Wikipedia can be grouped into those that use 
its raw text (Section 2.3.2) and those that use its semi-structured parts and internal 
hyperlink structure (Section 2.3.3, 2.3.4 and 2.3.5). The former, described in Section 
5.1, aply methods developed before Wikipedia was recognized as a linguistic resource; 
for them, any text represents a source of relations. The extraction proces benefits from the 
encyclopedic nature of Wikipedia articles and their uniform writing style. The later, 
described in Section 5.2, exploit unique Wikipedia properties such as infoboxes and the 
category structure. Finaly, in Section 5.3 the determination of named entities and their 
type is treated as a task of its own. As noted earlier, Wikipedia?s coverage of named 
entities is uniquely comprehensive and up-to-date (Section 3.2.3). Such work extracts 
named entity information such as isA(Portugal, Location) and isA(Bob Marley, Person). 
Again, although the task is similar to that in Sections 5.1 and 5.2, diferent techniques 
are aplied, like analysis of geographical cordinates. 
5.1 Semantic relations in Wikipedia?s raw text 
Extracting semantic relations from raw text begins by taking known relations that 
serve as seds and extracting paterns from their text?X?s * headquarters are located in 
* at Y in the above example. These paterns are aplied to a large text corpus to identify 
new relations. For this, a phrase chunker or named entity recognizer is aplied to identify 
entities that apear in a sentence, intervening paterns are compared to the sed paterns, 
and when they match, new semantic relations are discovered. Culota et al. [206] 
sumarize dificulties in this proces: 
? Enumerating over al pairs of entities yields a low density of corect relations 
even when restricted to a single sentence 
? Erors in the entity recognition stage create inacuracies in relation clasification. 
Wikipedia?s structure helps combat these dificulties. Each article represents a particular 
concept that serves as a clearly recognizable principal entity for relation extraction from 
that article. Its description contains links to other, secondary, entities. Al that remains is 
to determine the semantic relation betwen these entities. For example, the description of 
the Waikato River, shown in Figure 9, links to entities like river, New Zealand, Lake 
Taupo and many others. Apropriate syntactic and lexical paterns can extract a host of 
semantic relations betwen these items. 
Ruiz-Casado et al. [205] mine relations from Simple Wikipedia using WordNet as a 
source of positive examples (Ruiz-Casado et al. [207] explain the technique in greater 
detail). Given two co-ocuring semanticaly related WordNet nouns in a Wikipedia 
article, the intervening text is used to find relations that are absent from WordNet. But 
first the text is generalized. If the edit distance fals below a predefined threshold?i.e., 
the two strings nearly match?those parts that do not match are replaced by a wildcard 
(*). For example, a generalized patern: X directed the * famous|known film Y is obtained 
from two strings: X directed the famous film Y and X directed the wel known film Y. 
Using this technique Ruiz-Casado et al. identify 120 new semantic relations with a 
precision of 61?69% depending on the relation type. 
Ruiz-Casado et al. [206] generalize this technique to extract relations betwen 
automaticaly identified entities without using WordNet as a reference. The English 
 
 
Figure 9. Wikipedia?s description of the Waikato River. 
 
Wikipedia is used as a corpus, but now the authors concentrate only on those parts that 
are likely to contain relations of interest. They crawl Wikipedia?s list pages to aces 
prime ministers, authors, actors, fotbal players, and capitals; and infer the same kind 
of predefined paterns as above. They manualy evaluate precision on at least 50 examples 
for each relation type. If the pages are combined into a single corpus results vary wildly, 
from 8% precision on the player-team relation to 90% for death-year. The reason is 
heterogeneity in style and mark-up of articles. When the player-team paterns are aplied 
just to articles about fotbal players, precision increases to 93%. 
Herbelot and Copestake [206] extract hyponymy relations from sentences containing 
the verb to be (including is, was, wil be etc.) Instead of performing simple patern 
matching of the form X is a Y with some wildcards, they analyze the sentences to identify 
the subject, object and their relationship, regardles of word order. These authors use 
their own dependency analyzer, caled Robust Minimal Recursion Semantics, which can 
handle partialy parsed sentences. This analyzer re-organizes a parsed sentence into a 
series of minimal semantic tres whose rot elements corespond to lemas in the 
sentence. The same tre is obtained for similar sentences like Xanthidae is a family of 
crabs and Xanthidae is one of the families of crabs (Figure 10). 
The results are evaluated manualy on a subset of 10 articles and automaticaly using 
a thesaurus, restricted it to Wikipedia articles describing animal species. Because only 3 
paterns were used, recal was low: 14% at precision 92%. To improve recal they 
sugest extracting paterns automaticaly. The same dependency analyzer is used, which 
yields paterns that are more general than regular expresions, although no explicit 
performance comparison is provided. Initial experiments increase recal to 37%; however, 
precision drops to 65%. 
Suchanek et al. [206] also employ linguistic techniques to achieve beter results 
than regular expresions. They parse each sentence with a context-fre gramar. A patern 
is defined by a set of syntactic links betwen two given concepts, caled a bridge. For 
example, the bridge in Figure 11 matches sentences like Chopin was great among the 
composers of his time where Chopin=X and composers=Y. Machine learning techniques 
are aplied to determine and generalize paterns that describe relations of interest from 
manualy suplied positive and negative examples. The aproach is evaluated on article 
sets with diferent degres of heterogeneity: articles about composers, geography, and 
random articles. As expected, the more heterogeneous the corpus the worse the results, 
with best results achieved on composers for the relations birthDate (F-measure 75%) and 
instanceOf (F-measure 79%). Unlike Herbelot and Copestake [206], Suchanek et al. 
show that their aproach outperforms other systems, including a shalow patern 
matching resource TextToOnto
22
 and the more sophisticated scheme of Chimiano and 
Volker [195]. 
Nguyen et al. [207a, 207b] augment these ways of combining lexical and syntactic 
paterns with techniques such as anaphora resolution (to increase coverage), ful 
dependency parsing and subtre mining. Sentences are analyzed with OpenNLP
23
 and 
anaphora and co-referents resolved using a simple heuristic developed specialy for the 
purpose. Thus, in an article about the software company 3PAR, phrases like 3PAR, 
manufacturer, it and company are taged as the same principal entity. Next, al link 
anchors in the article are taged as secondary entities?ones relating to the principal 
entity. Sentences with at least one principal and one secondary entity are analyzed by the 
Minipar dependency parser. The dependency tre of Figure 12a is extracted from the 
sentence David Scot joined 3PAR as CEO in January 201 and is then generalized to 
match similar sentences (Figure 12b). The subtres are extracted from a set of training 
sentences containing positive examples and then aplied as paterns to find new semantic 
relations. The scheme was evaluated using 3,30 manualy anotated entities, 20 of 
which were reserved for testing. 6,00 Wikipedia articles, including 45 test articles, were 
used as the corpus. The new aproach achieved an F-measure of 38%, with precision 
significantly higher than recal, significantly outperforming two simple baselines. 
Wang et al. [207a] use selectional constraints in order to increase the precision of 
regular expresions without reducing coverage. They also automaticaly extract positive 
seds from infoboxes. For example, the infobox field Directed by describes relation 
hasDirector(FILM, DIRECTOR) with positive examples <Titanic, James Cameron> and 
                                                             
22
 htp:/sourceforge.net/projects/textonto 
23
 htp:/openlp.sourceforge.net/ 
 
Figure 10. Output of the Robust Minimal Recursion Semantics analyzer for the sentence 
Xanthidae is one of the families of crabs [Herbelot and Copestake, 206]. 
<King Kong (2005), Peter Jackson>. They colect paterns that intervene betwen these 
entities in Wikipedia?s text and generalize them into regular expresions like 
X (is|was) (a|an) * (film|movie) directed by Y. 
Selectional constraints restrict the types of subject and object that can co-ocur within 
such paterns. For example, Y in the patern above must be a director?or at least a 
person. The labels specifying the types of entities implemented as features are derived 
using words comonly ocuring in Wikipedia articles describing these entities. For 
example, instances of ARTIST extracted from a relation hasArtist(ALBUM, ARTIST) often 
co-ocur with terms like singer, musician, guitarist, raper, etc. To ensure beter 
coverage, Wang et al. cluster such terms hierarchicaly. The advantage of selectional 
constraints is that they alow paterns such as ?X?s Y? and ?X of Y? to be aplied. 
The relations hasDirector and hasArtist are evaluated independently on a sample of 
10 relations extracted automaticaly from the entire Wikipedia and were manualy 
asesed by thre human subjects. An unsupervised learning algorithm was aplied, and 
the features were tested individualy and together. The authors report precision and 
acuracy values close to 100%. 
The same authors investigate a diferent technique that does not rely on paterns at al 
[Wang et al. 207b]. Instead, features are extracted from two articles before determining 
their relation: 
 
Figure 1. Example bridge patern used in Suchanek et al. [206]. 
(a)   (b)  
Figure 12. Example dependency parse in Nguyen et al. [207]. 
? The first noun phrase and its lexical head that folows the verb to be in the 
article?s first sentence (e.g., comedy film and film in Annie Hal is a romantic 
comedy film) 
? Noun phrases that apear in the coresponding category titles and the lexical 
heads. 
? Infobox predicates, e.g. Directed by and Produced by in Annie Hal. 
? Terms that apear betwen the articles in sentences that contain them both as a 
link. 
For each pair of articles the distribution of values of these features is compared with 
that of positive examples. Unlike in [Wang et al. 207a], no negative instances are used. 
A special learning algorithm (B-POL) designed for situations where only positive 
examples are available is aplied. First, negative examples are identified from unlabeled 
data using a weak clasifier, and then a strong clasifier (e.g., SVM) is used to iteratively 
clasify negative examples until none remain. Four relations were used for evaluation, 
hasArtist(ALBUM, ARTIST), hasDirector(FILM, DIRECTOR), 
isLocatedIn(UNIVERSITY, CITY), isMemberOf(ARTIST, BAND), along with 1,00 
named entity pairs clasified by thre human subjects. Best results were an F-measure of 
80% on the hasArtist relation, which had the largest training set; the worse was 50% on 
isMemberOf.  
Wu and Weld [207] view the extraction problem as a task of improving infoboxes in 
Wikipedia. Like Wang et al. [207a, 207b] they use their content as training data. 
Their system caled Kylin first maps infobox atribute-value pairs to sentences in 
coresponding Wikipedia article using some simple heuristics. Next, for each atribute it 
creates a sentence clasifier that uses sentence?s tokens and their part of spech tags as 
features. Given an unsen Wikipedia article, a document clasifier analyzes its categories 
and asigns an infobox clas, e.g. ?U.S. counties?. Next, sentence clasifier is aplied to 
asign relevant infobox atributes. Extracting values from the sentences is treated as a 
sequential data-labeling problem and Conditional Random Fields are aplied for this. 
Precision and recal of Kylin are measured by its ability to generate corect infoboxes for 
Wikipedia articles, for which infobox information is known. The authors judged 
manualy the atributes produces by their system and by Wikipedia authors. Kylin?s 
precision ranged from 74 to 97%, at recal levels of 60 to 96% respectively, depending on 
the infobox clas. The authors? precision was around 95% on average and more stable 
acros the clases; their recal was significantly beter on most clases but worse or same 
on others. 
In a later work Wu et al. [208] adres problems in their aproach in the folowing 
way. To generate complete infobox schemata for articles of rare clases, they refer to 
WordNet?s ontology and agregate atributes from parents to their children clases. E.g. 
knowing that isA(Performer, Person), infobox for Performers receives prior mising field 
BirthPlace. To provide aditional positive examples, they aply TextRuner [Banko et 
al. 207] to the web, in order to retrieve aditional sentences describing the same 
atribute-values pairs. Given a new entity for which an infobox neds to be generated, 
they use Gogle search to retrieve aditional sentences describing this entity. The 
combination of these techniques improves the recal by 2 to 9 percentage points while 
maintaining of increasing precision. Kylin?s results are the most complete and impresive 
in this group of aproaches. 
The majority presented aproaches take advantage of Wikipedia?s encyclopedic nature 
using it as a corpus for extracting semantic relations. Simple patern matching techniques 
are outperformed by those that use parsing [Suchanek et al. 206], selectional constraints 
[Wang et al. 207a] and lexical features [Wang et al. 207b]. Wang et al. [207a] and 
Wu et al. [207] show that Wikipedia infoboxes contain positive examples that can 
improve the extraction if machine learning is aplied. Wu et al. [208] prove that 
retrieving aditional content from the web bosts the extraction performance. 
It would be helpful to directly compare the aproaches on the same data set. Of course 
for this, the researchers would ned to reach a consensus on what relations they wil 
extract. At this point, while there is an overlap in some relations (isMemberOf, 
InstanceOf, hasDirector), the choice of a particular relation set by a research group sems 
to be arbitrary. Furthermore, none of these techniques take advantage of Wikipedia?s 
structural information like hyperlinks betwen the articles and their categorization. As the 
next section shows, such information contains a wealth of semantic relations 
outnumbering the ones apearing in Wikipedia?s actual text. 
 
Figure 13. Fragment of Wikipedia?s category structure [Ponzeto, 207]. 
5.2 Semantic relations in structured parts of Wikipedia 
Here we describe research that adreses the limitations just identified by seking 
semantic relations in (semi-)structured parts of Wikipedia, with the goal of building an 
alternative to manualy created knowledge bases such as WordNet and Cyc. Some label 
existing links betwen categories and articles, a proces sometimes refered as link-typing. 
As noted in Section 2.2.6, Wikipedia?s category structure is made up of what are in fact 
rather diferent kinds of relations. For example, in Figure 13 Category:Mathematical 
logic belongs to both Category:Logic and Category:Mathematics, the former relation 
should arguably be isA and the later partOf. Further diferentiation betwen category 
relations in Wikipedia is required to transform it into a lexical knowledge base like those 
created by humans. Some aproaches use Wikipedia?s infoboxes (Figure 14) as a further 
source of relational information. 
Chernov et al. [206] were one of the first to analyze links betwen Wikipedia 
categories. Their goal was to determine semanticaly strong links, as oposite to 
?iregular and navigational links.? They develop two measures. One corelates semantic 
strength with the number of hyperlinks betwen articles asigned to two categories in 
question; the other is the conectivity ratio?the number of links from articles in one 
category to articles in the other, expresed as a proportion of the total number of links in 
the first category. Evaluation uses a sample of 10 category pairs, each asesed by 
human subjects as strongly, averagely or weakly related. Chernov et al. observe that both 
measures corelate with human judgments, but a more thorough study is required. 
Several projects extract relations from Wikipedia of a quantity or organization that 
might properly be caled ?ontological?. Discusion of these projects impinges on the 
teritory of Section 6. Here we discus the projects? methods and relationship to other IE 
research, while in Section 6 we discus their end-products considered as ontologies in 
their own right. One such project is YAGO, Yet Another Great Ontology [Suchanek et 
al. 207]. Here Wikipedia?s leaf categories are maped onto the WordNet taxonomy of 
synsets, and the articles belonging to those categories are aded to the taxonomy as new 
elements. To perform the maping, each category?s lexical head is extracted?people in 
Category:American people in Japan and, if necesary, expresed in singular form?
person?before being sought in WordNet. If there is a match, it is chosen as the clas for 
this category. This scheme extracts 143,00 isA relations?in this case, isA(American 
people in Japan, person/human). If more than one match is posible, word sense 
disambiguation is required (cf. Section 3.2.3). The authors experimented with maping a 
category?s subcategories to WordNet and chosing the sense that coresponds to the 
smalest resulting taxonomic graph. However, they claim that this semanticaly enhanced 
technique does not perform as wel as chosing the most frequent WordNet synset for a 
given term (the frequency values are provided by WordNet), an observation that sems 
inconsistent with findings by other authors [e.g. Medelyan and Milne 208] who show 
that the most frequent sense is not necesarily the intended one (Section 3.2.3). 
Having established a large core taxonomy, the authors define a mixed suite of 
heuristics for extracting further relations to ad to it. For instance a name parser is aplied 
to al personal names to identify given and family names, ading 40,00 relations like 
familyNameOf(Albert Einstein, ?Einstein?). Many heuristics make use of the Wikipedia 
category names, alowing extraction of relations like bornInYear, establishedIn, locatedIn 
and others. For example, subcategories of categories ending with birth (e.g., 1879 birth) 
and establishments, corespond to the first two relations. A category like Cities in 
Germany indicates the locatedIn relation. This yields 370,00 non-hierarchical, non-
synonymous relations. Manual evaluation of sample facts by human judges shows 91?
9% acuracy, depending on the relation. Also aded are 2M synonymy relations 
generated from redirects, 40M context relations generated from cros-links betwen 
articles, and 2M type relations betwen categories considered as clases and their articles 
considered as entities. Section 6.6 discuses the number and kinds of facts in YAGO in 
more detail, as wel as further specificaly ontological features, such as its purpose-built 
ontology language.
24
 
Another extremely large-scale relation-extraction project is DBPedia [Auer and 
Lehman 2007]. This project analyses Wikipedia?s infoboxes and transforms their 
content into RDF triples. Figure 14 shows part of the infobox from the New Zealand 
article; on the right is the Wiki mark-up used to create it. Extracting information from 
infoboxes is by no means trivial. The information they contain is expresed in an 
atribute-value notion, which is rendered inside a wiki page by means of an asociated 
template. There are many diferent templates, with a great deal of redundancy betwen 
them?for example, Auer and Lehman report separate templates for Infobox_film, 
Infobox Film, and Infobox film. Recursive regular expresions are used to parse relational 
triples from al templates that are comonly used in Wikipedia and contain at least 
several predicates. For example, the country template encodes relations like 
hasCapital(New Zealand, Welington) or hasPrimeMinister(New Zealand, Helen Clark). 
The templates are taken at face value; no heuristics are aplied to verify their acuracy. 
The URL of each entity linked to from an article is recorded as a unique identifier. 
Wikipedia categories are treated as clases and articles as individuals. However, Auer 
and Lehman do not say what hapens to articles that have coresponding categories, like 
New Zealand; presumably article and category receive diferent identifiers. Unlike YAGO 
there is no atempt to place facts in the framework of an overal taxonomic structure of 
concepts. Apart from the infobox relations, links betwen categories are merely extracted 
and labeled with the relation isRelatedTo.  
The resulting DBPedia dataset contains 15,00 clases and 650,00 individuals 
sharing 8,00 types of semantic relations. A total of 103M triples are extracted, far 
surpasing any other scheme.
25
 However, 60% of these are internal links derived from 
Wikipedia?s link structure; only 15% are taken directly from infoboxes. Also since there 
is no evaluation it is dificult to judge how acurate the triples are. Unlike other 
aproaches, DBPedia relies on the acuracy of Wikipedia?s contributors, and Auer and 
Lehman sugest guidelines for authors in order to improve the quality of infoboxes with 
time. Section 6.6 further discuses DBPedia in the context of YAGO and other 
ontologies 
Work at the European Media Lab Research (EMLR) takes up the chalenge of further 
diferentiating category links independently of the DBpedia project. Ponzeto and Strube 
                                                                                                                                                       
24
 YAGO can be queried online or downloaded from htp:/ww.mpi-
inf.mpg.de/~suchanek/downloads/yago/ 
 
{{ Infobox Country or teritory | 
 
native_name = New Zealand | 
? 
capital = [Welington] | 
 
latd = 41 | latm = 17 | latNS = S | 
longd = 174 | longm = 27 | longEW = E | 
 
largest_city = [Auckland] | 
 
oficial_languages = 
[New Zealand English|English] (98%) 
[M?ori language|M?ori] (4.2%) 
[New Zealand Sign Language|NZ 
Sign Language] (0.6%) | 
 
demonym = [New Zealand People|New 
Zealander],[Kiwi (people)|Kiwi] | 
 
government_type = 
[Parliamentary democracy] and 
[Constitutional monarchy] 
?}} 
 
Figure 14. Wikipedia infobox on New Zealand. 
[207] observe that the first task is to construct a knowledge taxonomy, or subsumption 
hierarchy, and that the quickest way to do this is to identify and isolate isA relations from 
amongst already-existing category links. Here isA is thought of as subsuming relations 
betwen two clases?isSubclasOf(Apples, Fruit)?and betwen an instance and its 
clas?isInstanceOf(New Zealand, Country). They analyze category titles and their 
conectivity to distinguish betwen isA and what they cal ?notIsA? relations. Several 
steps are aplied in order of acuracy. One of the most acurate matches the lexical head 
and modifier of two phrases. Sharing the same lexical head indicates an isA relation, e.g., 
isA(British computer scientist, Computer scientist). Modifier matching indicates notIsA, 
e.g., notIsA(Islamic mysticism, Islam). Another method uses co-ocurence statistics of 
two categories within paterns to indicate hierarchical and non-hierarchical relations, e.g., 
NP
2
,? (such as|like|, especially) NP* NP
1
 indicates isA, and NP
1
 are? used in 
NP
2
 indicates notIsA. This technique induces 10,00 isA relations from Wikipedia. 
Comparing the derived labels with relations asigned (by knowledge enginers) to 
concepts with the same lexical heads in ResearchCyc shows that their labeling is highly 
acurate, depending on the method used, and yields an overal F-measure of 8%. 
Ponzeto [207] describes how they plan to aply the induced knowledge base to natural 
language procesing tasks such as co-reference resolution. 
Since then the same research group has further refined semantic relations betwen 
Wikipedia categories. Zirn et al. [208] divide the derived isA relations into those 
expresing isSubclasOf and isInstanceOf. For example, Category:American scientist 
generalizes Category:American physicists, whereas Category:Albert Einstein is an 
instance of Category:American physicists. Two methods asume that al named entities 
are instances and thus related to their categories by isInstanceOf. One uses a named entity 
recognizer, the other a heuristic based on capitalization in the category title. Further 
methods include heuristics like: If a category has at least one hyponym that has at least 
                                                                                                                                                       
25
 Further information, and the extracted data, can be downloaded from htp:/ww.dbpedia.org 
 
 
Figure 15. Relations inferred from BY categories [Nastase and Strube 208]. 
two hyponyms, it is a clas. Evaluation against 8,00 categories listed in ResearchCyc as 
individuals (instances) and colections (clases) shows that the capitalization method is 
best, achieving 83% acuracy; however, combining al methods into a single voting 
scheme improves this to 86%. The taxonomy derived from this work is available in RDF 
Schema format.
26
 
Nastase and Strube [208] extract non-taxonomical relations from Wikipedia by 
parsing category titles. They are no longer just working with the category network but 
also deriving entirely new relations betwen categories, articles and terms extracted from 
category titles. Explicit unitary relations are extracted?for example, analysis of the 
category title Quen (band) members results in the memberOf relation being infered from 
the articles in that category to the article for the band, e.g. memberOf(Brian May, Quen 
(band). Explicit binary relations are also extracted?for example, if a category title 
matches the patern X [VBN IN] Y, for instance Movies directed by Wody Allen, the verb 
phrase is used to ?type? a relation betwen al articles asigned to the category and the 
entity Y, e.g. directedBy(Annie Hal, Wody Alen), while the clas X is used to further 
type the articles in the category, e.g. isA(Annie Hal, Movie). 
Particularly sophisticated is their derivation of entirely implicit relations from the 
very comon X by Y patern in Wikipedia category names, which facets a great deal of 
the category structure (e.g. Writers By Nationality, Writers by Genre, Writers by 
Language). For instance, given the category title Albums By Artist, they not only label 
al the articles in the category isA(X, Album), but also find subcategories pertaining to 
particular artists (e.g. MilesDavis, Albums), locate the article coresponding to the artist, 
label the entity as an artist, e.g. isA(MilesDavis, Artist) and label al members of the 
subcategory as being produced by him, e.g. artist(KindOfBlue MilesDavis). Figure 15 
ilustrates this. 
Nastase and Strube identify a total of 3.4 milion isA and 3.2 milion spatial 
relations, along with 43,00 memberOf relations and 4,00 other relations such as 
causedBy and writenBy. Evaluation with ResearchCyc was not meaningful because of 
litle overlap in extracted concepts?particularly named entities. Instead, human 
anotators analyzed four samples of 250 relations from the above sets; precision ranged 
from 84 to 98% depending on relation type. Once again the implications of this work for 
ontology building wil be discused in Section 6.6. 
Although the thre aproaches presented in this section?YAGO, DBPedia and 
EMLR?s taxonomy?have the same goal, to create an extensive, acurate knowledge base 
of human language, the techniques difer significantly. The first combines Wikipedia?s 
                                                             
26
 htp:/ww.eml-r.org/english/research/nlp/download/wikitaxonomy.php 
leaf categories (and their instances) with Wordnet?s hypernym hierarchy, embelishing 
this structure with further relations; the second basicaly dumps the contents of 
Wikipedia?s infoboxes with litle further analysis; and the third performs a diferentiation 
or ?typing? of category links, folowed by an analysis of category titles and the articles 
contained by those categories to derive further relations. As a result, the information 
extracted varies. For instance whereas Suchanek et al. [207] extracts the relation 
writenInYear, Nastavi and Strube [208] detect writenBy and Auer and Lehman [207] 
generate writen, writenBy, writer, writers, writerName, coWriters, as wel as their case 
variants. There has so far ben litle comparison of these aproaches, testing of them 
against each other or atempts to integrate them. We lok forward to further research in 
this area. 
5.3 Typing Wikipedia?s named entities 
One main disadvantage of Wikipedia is its lack of semantic anotation. Infoboxes for 
entities of the same kind share similar characteristics?for example, Apple Inc, Microsoft 
and Gogle share the fields Founded, Headquarters, Key People and Products?but 
Wikipedia does not state that they belong to the same type of named entity, namely 
company. Knowing the type of entity?e.g., location or person?would suply 
information that is important for tasks such as information retrieval and question 
answering (Section 4). This section covers research that clasifies articles into predefined 
clases representing entity-types. The results are semantic relations of a particular kind, 
e.g. isA(London, Location). 
Toral and Mu?os [206] extract named entities from the Simple Wikipedia using 
WordNet?s noun hierarchy. Given an entry?Portugal?they extract the first sentence of 
its definition?Portugal is a country in the south-west of Europe?and tag each word 
with its part of spech. They asign nouns their first (i.e. most comon) sense from 
WordNet and move up in the hierarchy to determine its clas, e.g., country ? location. 
The majority clas apearing in the sentence determines the clas of the article itself (i.e. 
entity). The authors achieve 78% F-measure on 404 locations and 68% on 236 persons. 
They do not use Wikipedia?s special features but mention this as future work. 
Buscaldi and Roso [207] pursue the same task, but concentrate on locations. 
Unlike Toral and Mu?os [206], they analyze not merely the first sentence but the entire 
description of each article. In order to determine whether it describes a geographical 
location, they compare its content with a set of keywords extracted from gloses of 
locations in WordNet using the Dice metric and cosine coeficient; they also use a 
multinominal Na?ve Bayes clasifier trained on the Wikipedia XML corpus [Denoyer and 
Galinari 206]. When evaluated on data provided by Overel and R?ger [207] 
(described in Section 3.2.2) they find that cosine similarity outperforms both the 
WordNet-based Dice metric and Na?ve Bayes, achieving an F-measure of 53% on ful 
articles and 65% on the first sentence. However, the authors fail to achieve Overel and 
R?ger?s [206] results, and conclude that the content of articles describing locations is 
les discriminative than other features like geographical cordinates. 
Section 3.2.2 discused how Overel and R?ger [206, 207] analyze named entities 
representing geographic locations, thereby maping articles to place names listed in a 
gazeter. It also described another group of aproaches that recognize named entities 
apearing in raw text and map them to articles. Apart from these, litle research has ben 
done on determining the semantic types of named entities. It is surprising that both 
techniques described in the present section use WordNet as a reference for the entities? 
semantic clas instead of refering to Wikipedia?s categories. For example, the thre 
companies mentioned above belong to subcategories of Category:Companies and 
Portugal is listed under Category:Countries. Moreover, neither technique utilizes the 
shared infobox fields mentioned above. Anotating Wikipedia with entity labels sems to 
be low-hanging fruit and we expect to se more advances in the near future. 
Aproaches to information extraction are les wel defined than for natural language 
procesing and most information retrieval tasks, and vary in their scope and depth 
depending on the research group. There is a dearth of comonly used ground truth data, 
each technique being evaluated in a diferent way. It sems that a unified comprehensive 
general-purpose ontology would be the ideal extension of the research discused above. 
For instance, it could unify the specific relations concerning fotbal players and their 
birth dates extracted from article text with the wealth of taxonomic relations in 
Wikipedia?s category structure and any available named entity information. Thus the 
next section reviews some of the projects described above, and others, from the 
perspective of clasical, large-scale ontology building. 
6. ONTOLOGY BUILDING AND THE SEMANTIC WEB 
We now turn to the use of Wikipedia for creating ontologies: comprehensive, large-scale 
information resources. Section 5 also covers agregation of knowledge into forms 
structured for automated reasoning. Nevertheles it is worth treating the topics separately, 
because ontology building aims for a resource with a level of internal organization and 
consistency not always found in information extraction. Hence while Section 5 describes 
the many diferent methods used for the task, here we consider research projects from the 
perspective of the comprehensivenes and sophistication of their results, and also the 
extent to which they contribute to the broad-ranging and ambitious research project 
known as the semantic web. 
6.1 Background: What is Ontology? 
A formal ontology is a machine-readable theory of the meanings of some set of concepts 
or ?categories.? Building such a resource involves naming the concepts, representing and 
often categorizing the links betwen them, and usualy encoding some key facts about 
them. Thus it is generaly thought that an ontology which includes the concept tre 
should i) name it as a first-clas object (to which synonyms such as the French arbre 
may be atached), i) link it to closely-related concepts such as leaf, preferably with some 
indication that a leaf is part of a tre, rather than for instance a type of tre, and ii) it 
would be at least helpful if it represented facts such as ?There are no tres in the 
Antarctic.? 
Having said that, there is a large spectrum of complexity and ambition amongst 
ontology projects. One measure of complexity is the logical expresivity of the relevant 
ontology language [McGuines 2003], which has a direct trade-of with inferential 
tractability, due to the vastly increased computation required to prove statements true in 
more expresive languages. Expresivity ranges from thesaurus-style representations of 
synonyms and homonyms, through frame-systems in which individuals are placed in 
clases in a subsumption hierarchy, through description logics that constitute large 
decidable fragments of first-order logic [Bader et al. 207], to ful first-order and even 
higher-order logic?for instance the Cyc project, with its purpose-built inference engine 
[Lenat 195]. Ontology work began in earnest in the 1980s as a branch of AI research. 
After an initial rush of enthusiasm, the trade-of betwen logical expresivity and 
inferential tractability emerged and became a major obstacle, because much of the human 
knowledge that arguably should be represented in an ontology can only be stated in 
languages of great logical expresivity?for instance, negations and disjunctions require 
ful first-order logic, while statements about statements require higher-order logic. 
Nevertheles, the goals of formal ontology have reawakened with the semantic web 
[Berners-Le et al. 201; Berners-Le 203]. Since Berners-Lee?s vision is to index the 
web via meanings, not just character-strings, it is widely acepted that it wil have to 
draw on some kind of shared, machine-readable, conceptual scheme. But the big 
stumbling block has ben obtaining the world?s involvement. At least two major 
problems ned to be solved?first to define ?semantic metadata? and then to mark up the 
web with it. 
The World-Wide Web Consortium recently defined a web ontology language, OWL 
[McGuines and van Harmelen 2004]. It has thre versions of diferent levels of 
expresivity: Owl Lite (thesaurus level), OWL DL (description logic-level) and OWL 
Ful (ful first-order logic). But atempts to set up repositories for large-scale sharing and 
re-use of OWL ontologies have failed to gain traction. It is worth emphasizing that the 
manual creation of ontologies is enormously dificult. It requires detailed knowledge of 
formal logic, and for the creation of uper and midle ontologies some understanding of 
metaphysics (whether explicitly formulated or ?quick and dirty?). Moreover, as size 
increases, so do the interconections amongst ontology?s categories, rendering the 
potential ramifications of local changes exponentialy more significant. Cyc, the most 
ambitious ontology project, has employed specialist ontological enginers with PhDs in 
philosophy over a period of 20 years without reaching any natural end-point to the 
development proces. Its nearest competitor, SUMO,
27
 is an order of magnitude smaler. 
Large ontologies have ben created for specific, wel-funded research areas such as 
biomedical science, e.g. the Gene Ontology
28
 and SNOMED,
29
 but again with a huge 
investment of labor. They are not without their problems [Smith et al. 203], and have 
to be continualy updated. Projects in ?ontology learning? have ben tried but so far 
achieved rather por performance [Buitelar 2005]. 
Could Wikipedia, with its abundance of fre, up-to-the-minute contributions, high 
visibility and remarkable consensus, be used to bypas these laborious ontology-creation 
methods? Section 2.3.5 mentioned ways in which it may already be sen in this light: 
its articles are basic concepts, both general concepts and named entities, aranged in some 
kind of hierarchy via the category structure, and further organisable via a wealth of other 
relations that may be mined from Wikipedia?s structure. There is a vast quantity of 
?domain-ontology? facts in structured and semi-structured form. On the downside, 
however, as noted in Section 2.2.6, Wikipedia?s category system sems curently 
incapable of suporting principled knowledge inheritance, on pain of, for instance, 
infering isA(Domestic Pig, Pork). Finaly, Wikipedia provides no means to perform 
inferences over its various structures. 
This section, like Section 5, is organized around the diferent kinds of features that 
researchers sek to mine from Wikipedia. However, because the task is now ontology-
building, we consider a somewhat diferent list, namely: knowledge organization, named 
entities, synonymy relations and other thesaurus-type information, ontology alignments 
and finaly ful-blown facts. This research area may alternatively be broken down into 
projects that sek to augment already existing ontologies or knowledge bases, including 
Wikipedia itself, and those that build brand new resources, and we wil se both kinds. 
                                                             
27
 htp:/ww.ontologyportal.org 
28
 htp:/ww.geneontology.org 
29
 htp:/ww.snowmed.org 
6.2 Knowledge Organization 
Halavais and Lackaf [208] ases the overal breadth and comprehensivenes of 
Wikipedia?s coverage of al knowledge. They ask whether the particular enthusiasms of 
volunter editors produce excesive coverage of certain topics by comparing topic-
distribution in Wikipedia with that in Books In Print, and with a range of printed 
scholarly encyclopedias. They measure this using a Library of Congres categorization of 
300 randomly-chosen articles and find Wikipedia?s coverage remarkably representative, 
except for law and medicine. 
Muchnik et al. [207] recomend automatic generation of knowledge hierarchies. 
They develop five algorithms for organizing Wikipedia articles into a hierarchy, which 
they evaluate against Wikipedia?s category hierarchy. They note that although the 
matches are not exact, the category hierarchy itself leaves much to be desired?it would 
be fruitful to evaluate both against human benchmarks. 
6.3 Named Entities 
Turning now to named entities, Section 3.2.2 described detailed methods for 
disambiguating named entity terms by linking them to Wikipedia articles; Section 4.4 
covered named entity ranking for question answering; and Section 5.3 loked at ways of 
recognizing named entities in Wikipedia itself. 
Here it is worth highlighting Wikipedia?s natural and straightforward role as indexer 
of named entities. Regarding Wikipedia article URLs as URIs solves one of the most 
significant problems facing the semantic web: it is easy to create a XML/RDF namespace 
that names an entity, but dificult to publicize this URI, get anyone else to use it, or 
cordinate with other posible definitions of namespaces to represent the same things 
[Leg 207]. Many authors have noted that Wikipedia, by contrast, enjoys al the broad 
aceptance and availability that semantic web proponents originaly hoped for (e.g. Hep 
et al. [206], Bhole et al. [207], McCol [206]). However, using named entity URIs 
for semantic web purposes arguably awaits the arival of URIs for further crucial features of 
human language, such as general terms (e.g. tre), and predicates (e.g. cut down). 
6.4 Thesaurus Information 
Section 3 discused mining Wikipedia for ?thesaurus-style information??namely 
semantic relatednes measures (Section 3.1) and word sense disambiguation (3.2). Here 
we specificaly discus the use of Wikipedia to generate large-scale, independent, general 
and systematic thesauri. There is a natural bridge from this task to ful-blown ontology-
building, for once a system of terms is interconected via links representing general 
semantic relatednes, these links may then be upgraded, or ?typed?, to more specific 
ontological relations. 
Gregorowicz and Kramer [206] sek to construct a comprehensive term-concept map 
that wil solve ?the problem of variable terminology? and facilitate concept-based 
information retrieval by resolving synonyms in a systematic way. They use al 
Wikipedia articles as concepts, and establish synonyms via redirects and homonyms via 
disambiguation pages. The result is 2M concepts linked to 3M terms?a vast and 
impresive resource compared to WordNet?s 15,00 synsets created from 150,00 
words. Likewise Nakayama et al. [207, 207, 208] describe a project to build a large 
general-purpose thesaurus solely from Wikipedia?s hyperlink structure, obtaining a 
thesaurus of 1.3M concepts with a measured strength of relatednes betwen each one. 
They then sugest upgrading the thesaurus to a ful-blown ontology by typing the 
generic relatednes measures betwen concepts into more traditional ontological relations 
such as isA and partOf. Details of how this wil be done are sketchy. 
The idea of link typing is developed in greater detail in [Kr?tzsch et al. 205, 207] 
and [V?lkel et al. 206]. Unlike Nakayama et al., however, they plan to aply it to 
Wikipedia?s own hyperlink structure. They note the profusion of links betwen articles, 
al indicating some form of semantic relatednes, and then claim that categorizing them 
would be a simple, unintrusive way of rendering large parts of Wikipedia machine-
readable. For instance, the existing hyperlink from Leaf to Plant would be labeled 
partOf, that from Leaf to Organ labeled kindOf, and so on. Categorizing al hyperlinks 
would be a significant task, and they recomend introducing a system of link types and 
encouraging the Wikipedia editors to start using them, and to sugest further types. 
This raises interesting usability isues. Given that ontology is specialist knowledge 
(at least as traditionaly practiced by ontological enginers), it might be argued that 
disaster could result if every Wikipedian were alowed to aply it in acord with 
Wikipedia?s uniquely democratic editing model. On the other hand, one might ask why 
this is any diferent to other specialist aditions to Wikipedia (e.g. cel biology, diesel 
locomotive enginering, Scotish jaz musicians), whose contributors show a remarkable 
ability to self-select, yielding surprising and impresive quality control. Perhaps the most 
tricky characteristic of ontology is that, unlike specialist topics such as cel biology, 
people think they are experts in it when in fact they are not. At any rate, this research is 
esentialy a proposal for Wikipedia?s developers to ad further functionality, and its 
results canot yet be evaluated. 
Like Kr?tzsch et al., Wu and Weld [207, 208] sek to augment Wikipedia itself. 
Their aim is to help kick-start the semantic web by marking up Wikipedia semanticaly 
in order to create enough structured data to make it worthwhile for developers to produce 
aplications for it. To do this they propose a combination of automated and human 
proceses. They investigate the use of machine learning techniques for completing 
infoboxes by extracting data from article text, constructing new infoboxes from templates 
where apropriate, rationalizing tags, merging replicated data using microformats, 
disambiguating links, ading aditional links, and flaging items for verification, 
corection, or the adition of mising information. As with Kr?tzsch et al., it wil be 
interesting to se whether Wikipedia editors wil be eager to work on the colaborative 
side of this project, and also how efective they are. Furthermore, it is worth asking?
even if these projects? aims were achieved and Wikipedia became a complete machine-
readable knowledge base, would this bring about the semantic web? How exactly would 
its existence render the rest of the web machine-readable? 
Publications from EMLR that were discused in detail in Section 5.2 may also be 
viewed under this heading of link-typing for ontology-building. We saw that these 
authors focused initialy on Wikipedia?s category network, aiming to discriminate 
betwen isA and notIsA links [Ponzeto and Strube 2007]. They then further 
discriminated betwen two kinds of isA: clas instance and subclas relationships [Zirn et 
al. 208]. Unlike Kr?tzsch et al., and Wu and Weld, they sek to acomplish this task 
entirely automaticaly by deducing such relations from an analysis of the titles of 
interlinked categories. How do their results measure up as an ontology? They claim to 
derive 105,00 isA links, roughly one for each Wikipedia category. Evaluation of Zirn et 
al?s results against the entirely manualy created ResearchCyc yielded an acuracy of 
around 83%, which is impresive. However, though large and comparable with Cyc, this 
is stil much smaler than the 2M concepts in Wikipedia?s articles. Also, as a mere isA 
taxonomy it constitutes a relatively inexpresive frame-system-level ontology, lacking in 
any further relations that might define the concepts in the hierarchy. Finaly, though it 
has ben released as a giant set of RDF triples, no ready means to perform inferencing 
over it sems yet available. 
Section 5.2 also described how the same research group turned in later work to 
parsing category titles and using them to derive new (typed) relations betwen Wikipedia 
articles [Nastase and Strube 2008]. Because this work qualifies as mining ?facts? for 
ontology-building purposes, it is discused in Section 6.6.  
6.5 Ontology Alignment 
Finding categories in diferent ontologies that in some sense ?mean the same? can be a 
useful exercise in itself. If the resources are in the same language, string-matching on 
category titles goes a long way but is insuficient: homonyms in the mapings must be 
detected and eliminated. This task thus overlaps greatly with the word sense 
disambiguation problem discused in Section 3.2. The problem cuts both ways: there 
may be one-to-many string matches from a concept in either of the maped ontologies to 
concepts in the other. 
WordNet is a popular choice of ontology for alignment projects because it is simple 
and fairly large (frame-system level). Thus, as was described in Section 3.2.3, Ruiz-
Casado et al. [205] align Wikipedia articles with WordNet synsets, building a large 
general resource that marks up synsets with article URIs and bags of words from article 
text. However, other than the maping itself this project ads no ontological value to 
WordNet, particularly since Wikipedia entries whose title string does not already apear 
in a synset were discarded. The authors? later work (described in Section 5.1) has shifted 
to extracting semantic relationships. Suchanek et al. [207, forthcoming] also align 
WordNet and Wikipedia. However, discusion is defered to Section 6.6 because they 
ad many other relations as wel. 
Medelyan and Leg [208] map 50,00 Wikipedia articles to equivalent categories in 
ResearchCyc. Their ultimate aim is to create a resource combining Cyc?s principled 
ontological structure with Wikipedia?s mesier but much more abundant information. 
Instead of selecting one resource as a base, they merely produce a list of pairs of 
equivalent concepts in both resources. They use methods described in Section 3.2.3 to 
determine genuine semantic similarity, folowing earlier work aligning a domain-specific 
thesaurus (Agrovoc) with Wikipedia [Medelyan and Milne 208]. For each Cyc term, its 
surounding ontology is used to gather a context for disambiguation, using the 
taxonomic relations #$genls, #$isa and some specific relations like #$countryOfCity and 
#$conceptualyRelated. Then the most comon Wikipedia article for each context term 
is identified and compared with al candidates for a maping. A further test is aplied 
when several Cyc terms map to the same Wikipedia article?reverse disambiguation. 
First, mapings that score les than 30% of the highest score are eliminated. Then a 
comon-sense test is aplied to the remainder based on Cyc?s ontological knowledge 
regarding disjointnes betwen clases. If the best scoring Cyc term does not intersect 
with the second best one (that is, it represents ?a diferent kind of thing?), the later is 
eliminated; otherwise both mapings are acepted. An evaluation on 10,00 manualy 
maped terms provided by the Cyc Foundation, as wel as a study with six human 
subjects, shows that performance of the maping algorithm compares with the eforts of 
humans. 
6.6 Facts 
Now we turn to mining Wikipedia for what might be caled ful-blown facts, for the 
purpose of ontology building. This category is blured by the dificulty of defining what 
exactly constitutes a fact?e.g., the typing of links in Section 6.4 in some sense already 
qualifies. However, here we focus on projects that find and store entirely new literals, 
RDF triples and similar propositionaly-structured entities. Sections 4 and 5 have 
covered much of this work; here we consider to what extent it has resulted in large-scale 
re-usable knowledge resources. 
First we consider those who use Wikipedia to ad facts to existing ontologies. We 
saw in Section 5.2 that Suchanek et al. [207; forthcoming] use information extraction 
methods to create an ontology named YAGO
30
 that unifies WordNet and Wikipedia. 
This contains 1M concepts and 5M facts about them, an impresive quantity. Table 5 
breaks down the number of diferent types of fact. The concepts are al WordNet synsets, 
Wikipedia leaf categories and al Wikipedia articles whose titles are not listed as 
comon names in WordNet. This neatly bypases the por ontological quality of 
Wikipedia?s category structure, WordNet?s taxonomy being manualy generated and far 
cleaner. It also avoids Ruiz-Casado et al.?s problem of omiting Wikipedia concepts 
whose titles do not apear in WordNet, although it stil mises al proper names with 
WordNet synonyms?e.g. the programing language Python and the movie The Birds. 
In this way a graph-structured hierarchy of concepts is established, then embelished with 
facts harvested by a sophisticated suite of heuristics, many obtained by hand-picking 
popular paterns in the titles of Wikipedia categories and asigning relevant facts to al 
the instances of those categories. From an ontology-building perspective, these 
                                                             
30
 htp:/ww.mpi-inf.mpg.de/~suchanek/downloads/yago/ 
Relation Domain Range Number of facts 
subClasOf clas clas 143,210 
type entity class 1,901,130 
context entity entity 40,00,00 
describes word entity 986,628 
bornInYear person year 18,128 
diedInYear person year 92,607 
establishedIn entity year 13,619 
locatedIn object region 59,716 
writenInYear book year 9,670 
politicianOf organization person 3,59 
hasWonPrize person prize 1,016 
means word entity 1,598,684 
familyNameOf word person 23,194 
givenNameOf word person 217,132 
Table 5. Size of YAGO (facts). 
sophisticated automated methods are a real step forward, though only a tiny subset of 
category names has ben parsed. For instance they do not adres widespread paterns 
such as ?X by Y? (e.g. Persons by continent, Persons by company, Persons by 
nationality and so on), which was analyzed by the EMLR group (Section 5.2). 
YAGO has many features one seks in a formal ontology. Its authors have defined a 
logic-based representation language and a basic data model of entities and binary 
relations, with a smal extension to represent relations betwen facts (such as transitivity). 
This gives it formal rigor?the authors even provide a model-theoretic semantics?and 
the expresive power of a rich version of Description Logic. In terms of inferential 
tractability it compares favorably with the hand-crafted Cyc. A SPARQL interface 
(available online) alows queries of traditional knowledge-base logical complexity?for 
instance when asked for bilionaires born in the USA it came up with two (though it 
mised Bil Gates?coverage of Wikipedia?s structured data is not complete by the 
project?s methods). The authors plan to integrate their project with the latest version of 
OWL (released in 207). They claim to have already noticed a positive fedback lop 
whereby as more facts are aded, word senses can be disambiguated more efectively in 
order to corectly identify and enter further facts. Such a fedback lop was a long-
standing ambition of AI researchers (e.g. Lenat [195]), though claims that it was about 
to be achieved often turned out to be premature. 
Dataset Description Triples 
Page links Internal links betwen DBpedia instances derived from 
the internal pagelinks betwen Wikipedia articles 
62 M 
Infoboxes Data atributes for concepts that have ben extracted 
from Wikipedia infoboxes 
15.5 M 
Articles Descriptions of al 1.95 milion concepts within the 
English Wikipedia. Includes titles, short abstracts, 
thumbnails and links to the coresponding articles 
7.6 M 
Languages Aditional titles, short abstracts and Wikipedia article 
links in 13 other languages. 
5.7 M 
Article categories Links from concepts to categories using SKOS 5.2 M 
Extended abstracts Aditional, extended English abstracts 2.1  
Language abstracts Extended abstracts in 13 languages 1.9 M 
Type information Infered from category structure and redirects by the 
YAGO (?yet another great ontology?) project 
[Suchanek et al. 207] 
1.9  
External links Links to external web pages about a concept 1.6 M 
Categories Information which concept is a category and how 
categories are related 
1  
Persons Information about 80,00 persons (date and place of 
birth etc.) represented using the FOAF vocabulary 
0.5 M 
External links Links betwen DBpedia and Geonames, US Census, 
Musicbrainz, Project Gutenberg, the DBLP 
bibliography and the RDF Bok Mashup 
180 K 
Table 6. Content of DBPedia [Auer et al. 207]. 
By contrast, the flourishing and ambitious DBpedia project [Auer et al. 207; Auer 
and Lehman 2007] atempts to create an entirely new ontology by harvesting facts from 
Wikipedia. The facts are stored as a vast set of RDF triples. As noted in Section 5.2, this 
project strives to make al Wikipedia?s structured information frely available in database 
form. Of al projects, it takes the most purely automated aproach and gathers the largest 
quantity of structured data. The focus is on formating paterns in the text of Wikipedia 
articles, notably infoboxes, though categorization and other links are also harvested. A 
stagering 103M ?facts? (triplets) are obtained. Like YAGO, the dataset can be queried 
via SPARQL and Linked Data, and conects with other open datasets on the web. Table 
6 sumarizes its content. 
The project has already ben influential?for instance, to test their document 
clasification algorithm Janik and Kochut [207] use slightly modified methods from 
DBpedia to create an RDF ontology from Wikipedia (Section 4.5). From a general 
ontology-building perspective, however, it has some weakneses. There is litle or no 
conection betwen the facts, and the knowledge is not organized into a hierarchy that 
enables inheritance (although, of course, as a giant database, state of the art procesing 
techniques can be brought to bear). Unlike YAGO it has no formaly defined ontology 
language, and thus it would sem that many semantic relations amongst its triples wil 
go unrecognized (e.g., that the first argument of the predicate artistOf might bear a 
relationship to the colection Artists). Second, although a formal evaluation of the 
resource?s quality is not provided, a quick manual inspection reveals that large sections 
of the data has limited ontological value. For instance, 60% of the RDF triples are 
internal links derived from Wikipedia?s link structure; only 15% are taken directly from 
infoboxes, and of those, the most comon relation (over 10%) is the formating relation 
wikiPageUsesTemplate. Amongst the properly ontological relations are many obvious 
redundancies not identified as such, e.g. placeOfBirth and birthPlace, dateOfBirth and 
birthDate. Finaly, some individual relations contain por-quality infobox data?for 
instance, keyPeople asertions of the form ?CEO? or ?Bob?. 
We finaly come to consider the final phase of EMLR?s project [Nastase and Strube 
2008]. We saw in Section 5.2 that this work consisted in parsing category titles, 
analyzing paterns in them and using that information to derive new relations betwen 
articles. They manage a deper analysis of category titles than YAGO?in particular, they 
managing to crack open the extensive X by Y patern and derive entirely implicit 
relations, as we saw above. In this way they manage to ad a wealth of new ontological 
information to their existing taxonomy of 105,00 categories?9M new facts, about twice 
the size of YAGO. The facts include 3.4 milion isA and 3.2 milion spatial relations, 
along with 43,00 memberOf relations and 4,00 other specific relations such as 
causedBy and writenBy. The authors promise to release a new ontology containing these 
facts son. It wil be interesting to se whether they define a formaly specified ontology 
language, as with YAGO (and if so how expresive it is), or merely dump out the data as 
with DBpedia (in which case the tols available for inferencing, and the complexity of 
suported queries, become paramount). 
Table 7 shows the size of the larger ontologies. How much nearer does this work 
bring us to the semantic web? Great progres has ben made on named entities (such as 
?Helen Clark?), for al that is neded to establish shared meaning for a named entity is a 
shared URI. General concepts (such as ?tre?) are more tricky. There is certainly a wealth 
of semantic information regarding such concepts in Wikipedia, but an almost total lack of 
consensus on how to extract and analyze it, let alone inference over it. Yet for the 
semantic web, this was the whole point. 
7. PEOPLE, PLACES AND RESOURCES 
The research described here is scatered acros the globe; Figure 16 shows prominent 
countries and institutions. 
US and Germany are the largest contributors. The US research spreads acros many 
institutions. The University of North Texas, who work with entity recognition and 
disambiguation, produced the wikify system. In the Pacific Northwest, Microsoft 
Research focuses on named entity recognition, while the University of Washington 
extracts semantic relations from Wikipedia?s infoboxes. German research is more 
localized geographicaly. EML Research Institute works on relation extraction, semantic 
relatednes, and co-reference resolution; Darmstadt University of Technology on semantic 
relatednes and analyzing Wikipedia?s structure. The Max-Plank Institut produced the 
YAGO ontology; they colaborate with the University of Leipzig, who produced 
DBpedia. The University of Karlsruhe have focused on providing users with tols to ad 
formal semantics to Wikipedia. 
 Ontology Entities Facts 
SUMO 20,00 60,00 
WordNet 17,597 207,016 
OpenCyc 47,00 306,00 
Manualy 
created 
ResearchCyc 250,00 2,20,00 
YAGO 1M 5M 
DBpedia N/A 103M 
Automaticaly 
derived 
EMLR[208] 105,00 9M 
Table 7. Size of ontologies (adapted from Suchanek et al. [207]). 
Spain is Europe?s next largest contributor. Universidad Autonoma de Madrid extract 
semantic relations from Wikipedia; Universidad Politecnica de Valencia and Universidad 
de Alicente both use it to answer questions and recognize named entities. The 
Netherlands, France, and UK are each represented by a single institution. The University 
of Amsterdam focuses on question answering; INRIA works primarily on entity ranking, 
and Imperial Colege on recognizing and disambiguating geographical locations. 
The Israel Institute of Technology have produced widely cited work on semantic 
relatednes, document representation and categorization. They developed the popular 
technique of Explicit Semantic Analysis. 
Hewlet Packard?s branch in Bangalore puts India on the map with document 
categorization research. In China, Shanghai Jiatong University works on relation 
extraction and category recomendation. In Japan, the University of Osaka has produced 
several open source resources, including a thesaurus and a bilingual (Japanese?English) 
dictionary. The University of Tokyo, in conjunction with the National Institute of 
Advanced Industrial Science and Technology, have focused on relation extraction. 
 
Australia (5) 
RMIT University 
 
New Zealand (8) 
Waikato University 
 
Japan (10) 
Osaka University 
U. of Tokyo & AIST 
 
Austria (2) 
U. of Insbruck 
 
China (5) 
Shanghai Jiatong U. 
 
Germany (20) 
EML, Heidelberg 
Darmstadt U. of Technology 
Max-Plank I. Saarbruken 
University of Leipzig 
University of Karlsruhe 
 
India (3) 
H.P. Bangalore 
 
Israel (5) 
Israel I. of Tech. 
 
Italy (2) 
 
Spain (9) 
U. Autonoma de Madrid 
U. Politecnica de Valencia 
U. of Alicente 
 
Netherlands (5) 
U. of Amsterdam 
 
United States (21) 
U. of North Texas 
U. of Washington 
Microsoft Research 
 
United Kingdom (4) 
Imp. Colege, London 
 
France (5) 
INRIA, Rocquencourt 
 
Figure 16. Countries and institutions with significant research on mining meaning from Wikipedia. 
New Zealand and Australia are each represented by a single institution. Research at 
the University of Waikato covers entity recognition, query expansion, topic indexing, 
semantic relatednes and augmenting existing knowledge bases. RMIT in Melbourne 
have colaborated with INRIA?s work on entity ranking. 
Table 8 sumarizes tols and resources, along with brief descriptions and URLs. 
The first part shows tols for acesing and procesing Wikipedia. The second shows 
demos of Wikipedia mining aplications. The third lists datasets that have ben 
generated from Wikipedia. 
 
Procesing tols 
JWPL Java 
ikipedia 
Library 
API for structural aces of Wikipedia parts such as redirects, categories, 
articles and link structure. [Zesch et al. 208] 
htp:/ww.ukp.tu-darmstadt.de/software/jwpl/ 
WikiRelate! API for computing semantic relatednes using Wikipedia [Strube and Ponzeto 
206; Ponzeto and Strube 206] 
htp:/ww.eml-research.de/ english/research/ nlp/download/ 
wikipediasimilarity.php 
Wikipedia 
Miner 
API that provides a simplified aces to Wikipedia and models its structure 
semanticaly [Milne et al. 208] 
htp:/sourceforge.net/ projects/wikipedia-miner/ 
WikiPrep A Perl tol for preprocesing Wikipedia XML dumps [Gabrilovich and 
Markovitch 207] 
htp:/ww.cs.technion.ac.il/ ~gabr/resources/ code/wikiprep/ 
W.H.A.T. 
Wikipedia 
Hybrid 
Analysis Tol 
An analytic tol for Wikipedia with two main functionalities: an article 
network and extensive statistics. It contains a visualization of the article 
networks and a powerful interface to analyze the behavior of authors. 
htp:/sourceforge.net/ projects/ w-h-a-t/ 
  
Wikipedia mining demos 
DBpedia 
Online Aces 
Online aces of DBpedia data (103M facts extracted from Wikipedia) via a 
SPARQL query endpoint and as Linked Data. [Auer et al. 207] 
htp:/wiki.dbpedia.org/ OnlineAces 
YAGO Demo of the Yet Another Ontology YAGO, containing 1.7M entities and 14M 
facts [Suchanek et al. 207] 
htp:/ww.mpi.mpg.de/ ~suchanek/yago 
QuALiM A Question Answering system. Given a question in a natural language returns 
relevant pasages from Wikipedia. [Kaiser 208] 
htp:/demos.inf.ed.ac.uk:8080/ qualim/ 
Koru A demo of a search interface that maps topics involved in both queries and 
documents to Wikipedia articles. Suports automatic and interactive query 
expansion. [Milne et al. 2007] 
htp:/ww.nzdl.org/koru 
Wikipedia 
Thesaurus 
A large scale asociation thesaurus containing 78 milion asociations 
[Nakayama et al. 207 and 208] 
htp:/wikipedia-lab.org:8080/ WikipediaThesaurusV2/ 
Wikipedia 
English-
Japanese 
A dictionary returning translations from English into Japanese and vise versa, 
enriched with probabilities of these translations [Erdman et al. 207] 
dictionary 
htp:/wikipedia-lab.org:8080/ WikipediaBilingualDictionary/ 
Wikify Automaticaly anotates any text with links to Wikipedia articles [Mihalcea 
and Csomai 207] 
htp:/wikifyer.com/ 
Wikifier Automaticaly anotates any text with links to Wikipedia articles describing 
named entities 
htp:/wikifier.labs.exalead.com/ 
Location 
query server 
Location data acesible via REST requests returning data in a SOAP 
envelope. Two requests are suported: A bounding box or a Wikipedia Article. 
The reply is the number of references made to locations within that bounding 
box, and a list of Wikipedia articles describing those locations. Or none, if the 
request is not a location. [Overel and R?ger 206 and 207] 
htp:/ww.doc.ic.ac.uk/ ~seo01/wiki/demos 
 
Datasets 
DBpedia Facts extracted from Wikipedia infoboxes and link structure in RDF format. 
[Auer et al. 207] 
htp:/wiki.dbpedia.org 
Wikipedia 
Taxonomy 
Taxonomy automaticaly generated from the network of categories in 
Wikipedia (RDF Schema format) [Ponzeto and Strube 207; Zirn et al. 208] 
htp:/ww.eml-research.de/ english/research/ nlp/download/ 
wikitaxonomy.php 
Semantic 
Wikipedia 
A snapshot of Wikipedia automaticaly anotated with named entity tags. 
[Zaragosa et al. 207] 
htp:/ww.yr-bcn.es/ semanticWikipedia 
Cyc to 
Wikipedia 
mapings 
50,00 automaticaly created mapings from Cyc terms to Wikipedia articles. 
[Medelyan and Leg 208] 
htp:/ww.cs.waikato.ac.nz/ ~olena/cyc.html 
Topic indexed 
documents 
A set of 20 Computer Science technical reports indexed with Wikipedia 
articles as topics. 15 teams of 2 senior CS undergraduates have independently 
asigned topics from Wikipedia to each article. [Medelyan et al. 208] 
htp:/ww.cs.waikato.ac.nz/ ~olena/wikipedia.html 
Locations in 
Wikipedia, 
ground truth 
A manualy anotated sample of 100 Wikipedia articles. Each link in each 
article is anotated, whether it is a location or not. If yes, it contains the 
coresponding unique id from the TGN gazeter. [Overel and R?ger 206 
and 207] 
htp:/ww.doc.ic.ac.uk/ ~seo01/wiki/data_release 
Table 8. Wikipedia tols and resources. 
 
8. SUMARY 
A whole host of researchers have ben quick to grasp the potential of Wikipedia as a 
resource for mining meaning: the literature is large and growing rapidly. 
We began this article by describing Wikipedia?s creation proces and structure 
(Section 2). The unique open editing philosophy, which acounts for its suces, is 
subversive. Although regarded as suspect by the academic establishment, it is a 
remarkable concrete realization of the American pragmatist philosopher Peirce?s proposal 
that knowledge be defined through its public character and future usefulnes rather than 
any prior justification. Wikipedia is not just an encyclopedia but can be viewed as 
anything from a corpus, taxonomy, thesaurus, hierarchy of knowledge topics to a ful-
blown ontology. It includes explicit information about synonyms (redirects) and word 
senses (disambiguation pages), database-style information (infoboxes), semantic network 
information (hyperlinks), category information (category structure), discusion pages, and 
the ful edit history of every article. Each of these sources of information can be mined in 
various ways. 
Section 3 explains how Wikipedia is being drawn upon for natural language 
procesing. Unlike WordNet, it was not created as a lexical resource that reflects the 
intricacies of human language. Instead, its primary goal is to provide encyclopedic 
knowledge acros subjects and languages. However, the research described here 
demonstrates that it has, unexpectedly, imense potential as a repository of linguistic 
knowledge for natural language aplications. In particular, its unique features alow wel-
defined tasks such as word sense disambiguation and word similarity to be adresed 
automaticaly?and the resulting level of performance is remarkably high. Researchers on 
co-reference resolution and mining of multilingual information have only recently 
discovered Wikipedia; significant improvements in these areas can be expected shortly. 
To our knowledge, its use as a resource for other tasks such as natural language 
generation, machine translation and discourse analysis, has not yet ben explored. These 
areas are ripe for exploitation, and exciting discoveries can be expected. 
Section 4 describes aplications to information retrieval. Query expansion, document 
clasification and topic indexing provide the best examples of aplying Wikipedia for 
searching and organizing document colections. These areas can take advantage of its 
unique properties while grounding themselves in?and building upon?existing research. 
In particular, document clasification has gathered momentum and significant advances 
are obtained over the state of the art. Question answering and entity ranking are les wel 
adresed, because they do not sem to take ful advantage of Wikipedia: with a few 
exceptions they simply treat it as just another corpus and thus difer litle from previous 
work. We found litle evidence of cros-polination betwen this work and the 
information extraction eforts described in Section 5. Given how closely question 
answering and entity ranking depend on the extraction of facts and entities, we expect this 
to become a fruitful line of enquiry. 
In Section 5 we turn to information extraction; mining text for topics, relations and 
facts. Unlike the tasks in Sections 3 and 4, information extraction is not easy to define. 
Diferent researchers focus on diferent kinds of information: we have reviewed research on 
extracting information about movie directors and socer players, composers, corporate 
descriptions and hierarchical and ontological relations. Techniques range from those 
developed for standard text corpora to ones that utilize properties such as hyperlinks and 
category structure. The extracted resources range in size from several hundred to several 
milion relations, but the lack of a comon basis for evaluation prevents us from drawing 
any conclusion as to which aproach performs best. 
Section 6 discuses the use of Wikipedia for ontology-building. Wikipedia?s vast 
quantity of structured information provides low-hanging fruit for automating this proces. 
Article names can serve as URIs for named entities; hyperlinks and redirects can be mined 
for large-scale thesauri; the category structure can be treated as encoding taxonomic 
information (though not always very wel); and infoboxes are a rich source of domain 
knowledge. From the perspective of large-scale general ontology building, the two most 
impresive projects are YAGO and DBPedia. Which wil turn out to be more useful, the 
large but mesy and low-quality DBPedia, or the smaler but more rigorous and acurate 
YAGO? Meanwhile, EMLR?s latest eforts (not yet released) promise to combine some of 
the greater rigor of the former with the greater size of the later. We believe that an 
extrinsic evaluation would be most meaningful, and hope to se these systems compete 
on a wel-defined task in an independent evaluation. It wil also be interesting to se to 
what extent these resources are exploited by other research comunities in the future. 
Some authors have sugested using Wikipedia editors themselves to perform 
ontology-building, an enterprise that might be thought of as mining Wikipedia?s people 
rather than its data. Perhaps they grasp the implications of the underlying driving force 
behind this masively sucesful resource beter than the rest of us! Only time wil tel 
whether the comunity is amenable to folowing such sugestions. The idea of moving 
to a more structured and ontologicaly principled Wikipedia raises an interesting 
question: how il it interact with the public, amateur-editor model? Does this signal the 
long-awaited emergence of the semantic web? We suspect that, like the suces of 
Wikipedia itself, the result wil be something new, something that experts have not 
foresen and may not condone. That is the glory of Wikipedia. 
ACKNOWLEDGEMENTS 
We warmly thank Evgeniy Gabrilovich, Rada Mihalcea, Dan Weld, Fabian Suchanek 
and the YAGO team for their valuable coments on a draft of this paper. Medelyan is 
suported by a scholarship from Gogle, Milne by a New Zealand Tertiary Education 
Comision Top Achiever Scholarship. 
References 
ADAFRE, S.F., JIJKOUN, V., AND M. DE RIJKE. [207] Fact Discovery in Wikipedia. In Procedings of the 207 
IEE/WIC/ACM International Conference on Web Inteligence. 
ADAFRE, S.F., AND M. DE RIJKE. [206] Finding Similar Sentences acros Multiple Languages in Wikipedia. In 
Procedings of the EACL 206 Workshop on New Text?Wikis and Blogs and Other Dynamic Text 
Sources. 
ADAFRE, S.F., AND M. DE RIJKE. [205] Discovering Mising Links in Wikipedia. In Procedings of the 
LinkKD 205, August 21, 205, Chicago, IL. 
AGROVOC [195] Multilingual agricultural thesaurus. Fod and Agricultural Organization of the United 
Nations. htp:/ww.fao.org/agrovoc/ 
AHN, D., JIJKOUN, V., MISHNE, G., M?LER, K., DE RIJKE, M., AND S. SCHLOBACH. [204] Using Wikipedia at 
the TREC QA Track. In Procedings of the 13th Text Retrieval Conference (TREC 204). 
ALAN, J. [205] HARD track overview in TREC 205: High acuracy retrieval from documents. In 
Procedings of the 14th Text Retrieval Conference (TREC 205). 
AUER, S., BIZER, C., LEHMAN, J., KOBILAROV, G., CYGANIAK, R., AND Z. IVES [207] DBpedia: A Nucleus 
for a Web of Open Data. In Procedings of the 6th International Semantic Web Conference and 2nd 
Asian Semantic Web Conference (ISWC/ASWC207), Busan, South Korea, 4825: 715?728, 207. 
AUER, S. AND J. LEHMAN. [207] What have Insbruck and Leipzig in comon? Extracting Semantics from 
Wiki Content. In Franconi et al. (eds), Procedings of European Semantic Web Conference (ESWC?07), 
LNCS 4519, p. 503?517, Springer, 207. 
BADER, F., CALVANESE, D., MCGUINES, D. AND D. NARDI. [207] The Description Logic Handbok: 
Theory, Implementation and Aplications. Cambridge: Cambridge University Pres. 
BAKER, L. [208] Profesor Bans Gogle & Wikipedia: Encourages Critical Thinking & Research. Search 
Engine Journal, January 14th, 208. 
BANERJE, S. [207] Boosting Inductive Transfer for Text Clasification Using Wikipedia. In Procedings of 
the 6th International Conference on Machine Learning and Aplications (ICMLA), p. 148?153. 
BANERJE, S., RAMANATHAN, K. AND A. GUPTA. [207] Clustering Short Texts using Wikipedia. In 
Procedings of the 30th Anual International ACM SIGIR conference on Research and Development in 
Information Retrieval. Amsterdam, Netherlands. p. 787?78. 
BANKO, M., CAFARELA, M. J., SODERLAND, S., BROADHEAD, M. AND O. ETZIONI. [207] Open information 
extraction from the Web. In Procedings of the 20th International Joint Conference on Artificial 
Inteligence IJCAI?07, p. 2670?2676, January 207. 
BHOLE, A., FORTUNA, B., GROBELNIK, B. AND . MLADENI?. [207] Extracting Named Entities and Relating 
Them over Time Based on Wikipedia. Informatica. 
BELOMI, F. AND R. BONATO. [205] Network Analysis for Wikipedia. In Procedings of the 1st International 
Wikimedia Conference, Wikimania 205. Wikimedia Foundation. 
BERNERS-LEE, T., HENDLER, J, AND O. LASSILA. [201]. The Semantic Web. Scientific American 284 (5), 
34?43. 
BERNERS-LEE, T. [203]. Foreword. In D. Fensel, J. Hendler, H. Lieberman, and W. Wahlster (Eds.) 
Spining the Semantic Web: Bringing the World Wide Web to its Ful Potential. Cambridge, MA: MIT 
Pres. 
BRIN, S. AND L. PAGE. [198] The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer 
Networks and ISDN Systems, Vol. 3, p. 107?117. 
BROWN, P., DELA PIETRA, S., DELA PIETRA, V., AND R. MERCER. [193] The mathematics of statistical 
machine translation: parameter estimation. Computational Linguistics, 19(2), 263?311. 
BUDANITSKY, A. AND HIRST, G. [201] Semantic distance in WordNet: An experimental, aplication-oriented 
evaluation of five measures. Workshop on WordNet and Other Lexical Resources, Second meting of the 
North American Chapter of the Asociation for Computational Linguistics, Pitsburgh, PA. 
BUITELAR, P., CIMIANO, P., MAGNII, B. (eds). [205] Ontology Learning from Text: Methods, Evaluation 
and Aplications. Amsterdam, The Netherlands: IOS Pres. 
BUNESCU, B. AND PA?CA, M. [206] Using Encyclopedic Knowledge for Named Entity Disambiguation. In 
Procedings of the1th Conference of the European Chapter of the Asociation for Computational 
Linguistics, p. 9?16. 
BUSCALDI, D. AND P. A. ROSO. [207] Comparison of Methods for the Automatic Identification of Locations 
in Wikipedia. In Procedings of the 4th ACM workshop on Geographical information retrieval, GIR?07. 
Lisbon, Portugal, p. 89?92. 
BUSCALDI, D. AND P. A. ROSO. [207] A Bag-of-Words Based Ranking Method for the Wikipedia Question 
Answering. Task Evaluation of Multilingual and Multi-modal Information Retrieval, p. 50?53. 
CAVNAR, W. B. AND J. M. TRENKLE. [194] N-Gram-Based Text Categorization. In Procedings of 3rd 
Anual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV 
Publications/Reprographics, p. 161-175. 
CHERNOV, S., IOFCIU, T., NEJDL, W. AND X. ZHOU. [206] Extracting Semantic Relationships betwen 
Wikipedia Categories. In Procedings of the 1st International Workshop: SemWiki?06?From Wiki to 
Semantics. Co-located with the 3rd Anual European Semantic eb Conference ESWC?06 in Budva, 
Montenegro, June 12, 206. 
CIMIANO, P. AND J. VOLKER. [205] Towards large-scale, open-domain and ontology-based named entity 
clasification. In Procedings of the Internatioal Conference on Recent Advances in Natural Language 
Procesing, RANLP?05, p. 16?172. INCOMA Ltd., Borovets, Bulgaria, September 205. 
CSOMAI, A. AND R. MIHALCEA. [207] Linking Educational Materials to Encyclopedic Knowledge. Frontiers 
in Artificial Inteligence and Aplications, v.158, p. 57?59. IOS Pres, Netherlands. 
CUCERZAN, S. [207] Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In Procedings 
of the 207 Joint Conference on Empirical Methods in Natural Language Procesing and Computational 
Natural Language Learning, p. 708?716, Prague, Czech Republic, June 207. 
CULOTA, A., MCALUM, A. AND J. BETZ. [206]. Integrating Probabilistic Extraction Models and Data 
Mining to Discover Relations and Paterns in Text. In Procedings of the main conference on Human 
Language Technology Conference of the North American Chapter of the Asociation of Computational 
Linguistics. New York, NY, p. 296?303. 
DAKA, W. AND S. CUCERZAN. [208]. Augmenting Wikipedia with Named Entity Tags. In Procedings of the 
3rd International Joint Conference on Natural Language Procesing (IJCNLP 208), Hyderabad. 
DENING, P., HORNING, J., PARNAS, D., AND WEINSTEIN, L. [205]. Wikipedia Risks. In Comunications of the 
ACM 48(12), p. 152?152. 
DENOYER, L. AND GALINARI, P. [206] The Wikipedia XML corpus. SIGIR Forum, 40(1), p. 64?69, ACM 
Pres. 
DONDIO, P., BARRET, S., WEBER, S., AND SEIGNEUR, J. [206] Extracting Trust from Domain Analysis: A 
Case Study on the Wikipedia Project. Autonomous and Trusted Computing, p. 362-373. 
DUMAIS, S., PLAT, J., HECKERMAN, D. AND M. SAHAMI. [198] Inductive learning algorithms and 
representations for text categorization. In Procedings of the 7th international conference on Information 
and knowledge management, p. 148?155. 
EDMONDS, P. AND KILGARRIF, A. [202] Introduction to the special isue on evaluating word sense 
disambiguation systems. Journal of Natural Language Enginering, 8(4), p. 279?291. Cambridge 
University Pres, New York, NY, USA. 
EMIGH, W. AND HERRING, S. [205] Colaborative Authoring on the Web: A Genre Analysis of Online 
Encyclopedias. In Procedings of the 38
th
 Hawai International Conference on System Sciences, p.9a. 
ERDMAN, M., NAKAYAMA, K., HARA, T., AND S. NISHIO. [208] An Aproach for Extracting Bilingual 
Terminology from Wikipedia. In Procedings of the 13th International Conference on Database Systems 
for Advanced Aplications (DASFA, To apear). 
FELBAUM, C. (editor). [198] WordNet An Electronic Lexical Database. Cambridge, MA: MIT Pres. 
FERR?NDEZ, F., TORAL, A., FERR?NDEZ, ?., FERR?NDEZ, A., AND R. MU?OZ. [207] Aplying Wikipedia?s 
Multilingual Knowledge to Cros?Lingual Question Answering. In Procedings of the 12th International 
Conference on Aplications of Natural Language to Information Systems, Paris, France, p. 352?363. 
June 207 
FINKELSTEIN, L., GABRILOVICH, E., MATIAS, Y., RIVLIN, E., SOLAN, Z., WOLFMAN, G., AND E. RUPIN. [202] 
Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1), p. 
116?131. 
FRANK, E., PAYNTER, G. W., WITEN, I. H., GUTWIN, C. AND C. G. NEVIL-MANING. [199] Domain-
Specific Keyphrase Extraction. In Procedings of the 16th International Joint Conference on Artificial 
Inteligence, IJCAI?9, Stockholm, Sweden, p. 68?673. 
GABRILOVICH, G. AND S. MARKOVITCH. [207] Computing Semantic Relatednes using Wikipedia-based 
Explicit Semantic Analysis. In Procedings of the 20th International Joint Conference on Artificial 
Inteligence, IJCAI?07, Hyderabad, India, January 207, p.1606?161. 
GABRILOVICH, G. AND MARKOVITCH, S. [206] Overcoming the Britlenes Botleneck using Wikipedia: 
Enhancing Text Categorization with Encyclopedic Knowledge, Procedings of The 21st National 
Conference on Artificial Inteligence (AAI), p. 1301?1306, Boston, July 206 
GILES, J. [205] Internet Encyclopaedias Go Head to Head. In Nature 138(15), 14 December 205. 
GLEIM, R., MEHLER, A. AND M. DEHMER. [207] Web Corpus Mining by Instance of Wikipedia. In 
Kilgarrif, Adam; Baroni, arco (eds.) Procedings of the EACL 206 Workshop on eb as Corpus, 
Trento, Italy, April 3?7, 206, p. 67?74. 
GREGOROWICZ, A. AND M. A. KRAMER. [206] Mining a Large-Scale Term-Concept Network from 
Wikipedia. Mitre Technical Report 06?1028, October 206. 
HALAVAIS, A. AND LACKAF, D. [208] An Analysis of Topical Coverage of Wikipedia. Journal of Computer-
Mediated Comunication, 13(2), p. 429?440. 
HALER, H., KR?TZSCH, M., V?LKEL, M., AND . VRANDECIC. [206] Semantic Wikipedia (software demo). 
In Procedings of the 206 International Symposium on Wikis, p. 137?138. ACM Pres, August 206. 
HATCHER, E. AND O. GOSPODNETIC. [204] Lucene in Action. Maning Publications, Grenwich, CT. 
HAVELIWALA, T. H. [203] Topic-sensitive PageRank: A context-sensitive ranking algorithm for web search. 
IEE transactions on knowledge and data enginering, 15(4), p. 784?796. 
HERBELOT, A. AND A. COPESTAKE. [206] Acquiring Ontological Relationships from Wikipedia Using RMRS. 
In Proc. International Semantic Web Conference 206 Workshop on Web Content Mining with Human 
Language Technologies, Athens, GA. 
HEPP, M., BACHLECHNER, D., AND K. SIORPAES. [206] Harvesting Wiki Consensus?Using Wikipedia Entries 
as Ontology Elements. In Procedings of the 1st International orkshop: SemWiki?06?From Wiki to 
Semantics. Co-located with the 3rd Anual European Semantic Web Conference ESWC?06 in Budva, 
Montenegro, June 12, 206. 
HIGASHINAKA, R., DOHSAKA, K., AND H. ISOZAKI. [207] Learning to Rank Definitions to Generate Quizes 
for Interactive Information Presentation, in Companion Volume to the Procedings of the 45th Anual 
Meting of the Asociation for Computational Linguistics, p. 17?120 
HUANG, W.C., TROTMAN, A., AND S. GEVA. [207] Colaborative Knowledge Management: Evaluation of 
Automated Link Discovery in the Wikipedia. In Procedings of the Workshop on Focused Retrieval at 
SIGIR 207, July 27, 207, Amsterdam. 
IDE, N. AND J. V?RONIS (editors). [198] Word Sense Disambiguation. Special isue of Computational 
Linguistics, 24(1). 
JIANG, J. J. AND . W. CONRATH, D. W. [197] Semantic similarity based on corpus statistics and lexical 
taxonomy. In Procedings of the 10th International Conference on Research in Computational Linguistics, 
ROCLING?97. Taiwan. 
JIJKOUN, V. AND M. DE RIJKE. [206] Overview of the WiQA task at CLEF 206. In: C. Peters et al. (editors). 
Evaluation of ultilingual and Multi-modal Information Retrieval. 7th Workshop of the Cros-Language 
Evaluation Forum, CLEF 206, Alicante, Spain, September 20?2, 206, Revised Selected Papers, LNCS 
4730, p. 265?274, September 207 
JANIK, M. AND K. KOCHUT. [207] Wikipedia in Action: Ontological Knowledge in Text Categorization, 
University of Georgia, Computer Science Department Technical Report no. UGA-CS-TR-07-001. 
KAISSER, M. [208] The QuALiM Question Answering Demo: Suplementing Answers with Paragraphs 
drawn from Wikipedia. In Procedings of the ACL-08 HLT Demo Sesion, Columbus, Ohio, p. 32?35. 
KASNECI, G., SUCHANEK, F.M., IFRIM, G., RAMANATH, M. AND G. WEIKUM. [207] NAGA: Searching and 
Ranking Knowledge. In Procedings of the 24th IEE International Conference on Data Enginering, 
ICDE?08, Cancun, Mexico, 7?12 April 208, p. 953?962. 
KASNER, L., NASTASE, V., AND M. STRUBE. [208] Acquiring a Taxonomy from the German Wikipedia. To 
apear in Procedings of LREC 208. 
KAZAMA, J. AND K. TORISAWA. [207] Exploiting Wikipedia as External Knowledge for Named Entity 
Recognition. In Procedings of the Joint Conference on Empirical Methods in Natural Language 
Procesing and Computational Natural Language Learning, p. 698?707. 
KINZLER, D. [205] WikiSense: Mining the Wiki, v 1.1. In Procedings of the 1st International Wikimedia 
Conference, Wikimania 205. Wikimedia Foundation. 
KITUR, A., SUH., B., PENDLETON, B.A. AND CHI, E.H. [207] He says, she says: Conflict and Coordination in 
Wikipedia. In CHI, p. 453-462.  
KLEINBERG, J. [198] Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, 46, p. 604?
632. 
KLAVANS, J. L. AND P. RESNIK. [196] The balancing act: combining symbolic and statistical aproaches to 
language. Cambridge, MA: MIT Pres. 
KRIZHANOVSKY, A. [206] Synonym Search in Wikipedia: Synarcher. In Procedings of the 1th International 
Conference ?Spech and Computer? SPECOM?06. Rusia, St. Petersburg, June 25?29, 206, p. 474?477. 
KR?TZSCH, M., VRANDECIC, D., V?LKEL, M., HALER, H., AND R. STUDER. [207] Semantic Wikipedia. 
Journal of Web Semantics, 5, p. 251?261. 
KR?TZSCH, M., VRANDECIC, D. AND M. V?LKEL. [205] Wikipedia and the Semantic Web?The Mising 
Links. In Procedings of the 1st International Wikimedia Conference, Wikimania 205. Wikimedia 
Foundation. 
LEACOCK, C., AND M. CHODOROW. [198] Combining local context and WordNet similarity for word sense 
identification. In Felbaum, C. (editor), WordNet: An Electronic Lexical Database. Chapter 1, p. 265?
283. Cambridge, MA: MIT Press. 
LEHTONEN, M. AND A. DOUCET. [207] EXTIRP: Baseline Retrieval from Wikipedia. Comparative 
Evaluation of XML Information Retrieval Systems, p. 15?120. 
LEGG, C. [207] Ontologies on the Semantic Web. Anual Review of Information Science and Technology 
41, p. 407?452. 
LENAT, D. B. [195] Cyc: A Large-Scale Investment in Knowledge Infrastructure. Comunications of the 
ACM 38(11). 
LIPSCOMB, C.E. [200] Medical Subject Headings (MeSH). In Buletin of the Medical Library Asociation 
8(3), p. 265. 
LI, B., CHEN, Q., YEUNG, D.S., NG, W. .Y., WANG, X. [207] Exploring Wikipedia and Query Logs Ability 
for Text Feature Representation. In Procedings of the International Conference on Machine Learning 
and Cybernetics, Hong Kong, 19?2 August 207, v. 6, p. 343?348. 
LI, Y., LUK, R. W. P., HO, E. K. S., CHUNG, K. F. [207] Improving weak ad-hoc queries using Wikipedia as 
external corpus. In Kraij et al. (editors) Procedings of the 30th Anual International ACM SIGIR 
Conference on Research and Development in Information Retrieval, SIGIR?07, Amsterdam, The 
Netherlands, July 23?27, 207, p. 797?798. ACM Pres. 
LIH, A. [204] Wikipedia as Participatory Journalism: Reliable Sources? Metrics for Evaluating Colaborative 
Media as a News Source. In Procedings of the 5th International Symposium on Online Journalism. 
MAGNUS, P. D. [206] Epistemology and the Wikipedia. In Procedings of the North American Computing 
and Philosophy Conference, Troy, New York, August 206. 
MAYS, E., DAMERAU, F. J. AND R. L. MERCER. [191] Context-based speling corection. Information 
Procesing and Management 27(5), p. 517?52. 
MCOL, R. [206]. Rethinking the Semantic Web, Part 2. IEE Internet Computing 10(1), p. 93?96. 
CGUINNESS, D. [203]. Ontologies Come of Age. In D. Fensel, et al. (editors) Spining the Semantic Web: 
Bringing the World Wide Web to Its Ful Potential. Cambridge, MA: MIT Pres. 
MCGUINNESS, D. AND F. VAN HARMELEN. [204] OWL Web Ontology Language: Overview. 
htp:/ww.3.org/TR/owl-features/ 
MEDELYAN, O. AND . MILNE. [208] Augmenting domain-specific thesauri with knowledge from Wikipedia. 
In Procedings of the NZ Computer Science Research Student Conference, Christchurch, NZ. 
MEDELYAN, O., WITEN, I. H., AND D. MILNE. [208] Topic Indexing with Wikipedia. 
To apear in Procedings of the WIKI-AI: Wikipedia and AI Workshop at the AAI?08 Conference, 
Chicago, US. 
MEDELYAN, O. AND C. LEGG. [208] Integrating Cyc and Wikipedia: Folksonomy mets rigorously defined 
comon-sense. To apear in Procedings of the WIKI-AI: Wikipedia and AI Workshop at the AAI?08 
Conference, Chicago, US. 
MIHALCEA, R. [207] Using Wikipedia for Automatic Word Sense Disambiguation. In Procedings of the 
Human Language Technologies 207: The Conference of the North American Chapter of the Asociation 
for Computational Linguistics, Rochester, New York, April 207 
MIHALCEA, R. AND D. MOLDOVAN. [201] Automatic generation of a coarse grained WordNet. In 
Procedings of the NACL Workshop on WordNet and Other Lexical Resources. Pitsburgh, PA. 
MIHALCEA, R. AND A. CSOMAI. [207] ikify! Linking Documents to Encyclopedic Knowledge. In 
Procedings of the 16th ACM Conference on Information and Knowledge Management, CIKM?07, 
Lisbon, Portugal, November 6?8, 207, p. 23?241. 
MILER, E. [198] An Introduction to the Resource Description Framework. Buletin of the American Society 
for Information Science 25(1), p. 15?19. 
MILER, G. A., AND W. G. CHARLES. [191] Contextual corelates of semantic similarity. Language and 
Cognitive Proceses 6(1), p. 1?28. 
MILNE, D., MEDELYAN, O. AND I. H. WITEN. [206] Mining domain-specific thesauri from Wikipedia: A 
case study. In Procedings of the International Conference on Web Inteligence (IEE/ IC/ACM 
WI'206), Hong Kong. 
MILNE, D., WITEN, I. H. AND . M. NICHOLS. [207] A Knowledge-Based Search Engine Powered by 
Wikipedia. In Procedings of the 16th ACM Conference on Information and Knowledge Management, 
CIKM?07, Lisbon, Portugal, November 6?8, 207, p. 45?454. 
MILNE, D. [207] Computing Semantic Relatednes using Wikipedia Link Structure. In Procedings of the 
New Zealand Computer Science Research Student Conference, NZ CSRSC?07, Hamilton, New Zealand. 
MILNE D. AND I. H. WITEN. [208] Learning to link with Wikipedia. Forthcoming 
INIER, Z., ZALAN, B. AND L. CSATO. [207] Wikipedia-Based Kernels for Text Categorization. In 
Procedings of the International Symposium on Symbolic and Numeric Algorithms for Scientific 
Computing, SYNASC?07, IEE Computer Society Washington, DC, USA. p. 157?164. 
MUCHNIK, L., ITZHACK, R., SOLOMON, S. AND Y. LOUZOUN. [207] Self-emergence of Knowledge Tres: 
Extraction of the Wikipedia Hierarchies, in Physical Review E 76(1). 
NAKAYAMA, K., HARA, T., AND S. NISHIO. [207] Wikipedia: A New Frontier for AI Researches. Journal of 
the Japanese Society for Artificial Inteligence 2(5), p. 693?701. 
NAKAYAMA, K., HARA, T., AND S. NISHIO. [208] A Search Engine for Browsing the Wikipedia Thesaurus. In 
Procedings of the 13th International Conference on Database Systems for Advanced Aplications, Demo 
sesion (DASFA?08), p. 690?693. 
NAKAYAMA, K., ITO, M., HARA, T. AND S. NISHIO. [208] Wikipedia Mining for Huge Scale Japanese 
Asociation Thesaurus Construction. In Workshop Procedings of the 2nd International Conference on 
Advanced Information Networking and Aplications, AINA?08, GinoWan, Okinawa, Japan, March 25?
28, 208, p. 150?15. IEE Computer Society. 
NAKAYAMA, K., HARA, T., AND S. NISHIO. [207] A Thesaurus Construction Method from Large Scale Web 
Dictionaries. In Procedings of the 21st IEE International Conference on Advanced Information 
Networking and Aplications, AINA?07, May 21?23, 207, Niagara Fals, Canada, p. 932?939. IEE 
Computer Society. 
NAKAYAMA, K., HARA, T., AND S. NISHIO. [207] Wikipedia Mining for an Asociation Web Thesaurus 
Construction. In Procedings of the 8th International Conference on Web Information Systems 
Enginering, WISE?07, Nancy, France, December 3?7, 207, p. 32?34. Lecture Notes in Computer 
Science 4831 Springer. 
NASTASE, V. AND M. STRUBE. [208] Decoding Wikipedia Categories for Knowledge Acquisition. To apear 
in Procedings of the AAI?08 Conference, Chicago, US. 
NELKEN, R. AND E. YAMANGIL. [208] Mining Wikipedia?s Article Revision History for Traning 
Computational Lingustic Algorithms. In Procedings of the WIKI-AI: Wikipedia and AI Workshop at the 
AAI?08 Conference, Chicago, US. 
NGUYEN, D. P. T., MATSUO, Y., AND M. ISHIZUKA. [207] Relation Extraction from Wikipedia Using Subtre 
Mining. In Procedings of the AAI?07 Conference, p. 1414?1420, Vancouver, Canada, July 207. 
NGUYEN, D. P. T., MATSUO, Y., AND M. ISHIZUKA. [207] Subtre Mining for Relation Extraction from 
Wikipedia. In Procedings of the HLT-NACL 207, p, 125?128. 
NGUYEN, D. P. T., MATSUO, Y., AND M. ISHIZUKA. [207] Exploiting Syntactic and Semantic Information for 
Relation Extraction from Wikipedia. In Procedings of the IJCAI Workshop on Text-Mining and Link-
Analysis, TextLink?07. 
OLIVIER, Y. UND P. SENELART. [207] Finding Related Pages Using Gren Measures: An Illustration with 
Wikipedia. In Procedings of the AAI?07 Conference, p. 1427?143, Vancouver, Canada, July 207. 
OVEREL, S. E. AND S. R?GER. [207] Geographic co-ocurence as a tol for GIR. In Procedings of the 4th 
ACM Workshop on Geographical Information Retrieval. Lisbon, Portugal. 
OVEREL, S. E. AND S. R?GER. [206] Identifying and grounding descriptions of places. In Procedings of the 
3rd ACM workshop on Geographical Information Retrieval at SIGIR. 
PEIRCE, C.S. [187] The Fixation of Belief. Popular Science Monthly 12 (Nov. 187), p. 1?15. 
PEI, M., NAKAYAMA, K., HARA, T. AND NISHIO, S. [2008] Constructing a Global Ontology by Concept Maping 
using Wikipedia Thesaurus. In Procedings of the 2nd International Conference on Advanced 
Information Networking and Aplications, AINA?08, GinoWan, Okinawa, Japan, March 25?28, 208, p. 
1205?1210. IEE Computer Society. 
PONZETO, S. P. AND M. STRUBE. [206]. Exploiting Semantic Role Labeling, WordNet and Wikipedia for 
Coreference Resolution. In Procedings of HLT-NACL '06, p.192?19. 
PONZETO, S. P. AND M. STRUBE. [207a]. Knowledge Derived from Wikipedia for Computing Semantic 
Relatednes. Journal of Artificial Inteligence Research 30, p. 181?212 
PONZETO, S. P. AND M. STRUBE. [207b]. Deriving a Large Scale Taxonomy from Wikipedia. In 
Procedings of AAI '07, p.140?145. 
PONZETO, S. P. AND M. STRUBE. [207c]. An API for Measuring the Relatednes of Words in Wikipedia. In: 
Companion Volume of the Procedings of the 45th Anual Meting of the Asociation for Computational 
Linguistics, Prague, Czech Republic, 23?30 June, 207, p. 49?52. 
PONZETO, S. P. [207] Creating a knowledge base from a colaboratively generated encyclopedia. In: 
Procedings of the Human Language Technology Conference of the North American Chapter of the 
Asociation for Computational Linguistics Doctoral Consortium, Rochester, NY, 2?27 April, 207, p. 9?
12. 
POTHAST, M., STEIN, B., AND M. A. ANDERKA [208] Wikipedia-Based Multilingual Retrieval Model. In 
Procedings of the 30th European Conference on IR Research, ECIR?08, Glasgow. 
POTHAST, M. [207] Wikipedia in the pocket: indexing technology for near-duplicate detection and high 
similarity search. In Procedings of the 30th International ACM SIGIR Conference on Research and 
Development in Information Retrieval. 
QUINE, W.V.O. [1960] Word and Object. Cambridge, MA: MIT Pres. 
RANSDEL, J. [203] The Relevance of Peircean Semiotic to Computational Inteligence augmentation. SED 
Journal (Semiotics, Evolution, Energy, and Development). 
RESNIK, P. [199] Semantic similarity in a taxonomy: An information-based measure and its aplication to 
problems of ambiguity in natural language. Journal of Artificial Inteligence Research, 1, p. 95?130. 
RUBENSTEIN, H., AND J. GODENOUGH. [1965] Contextual corelates of synonymy. Comunications of the 
ACM 8(10), p. 627?633. 
RUIZ-CASADO, M., ALFONSECA, E., AND P. CASTELS. [205] Automatic asignment of Wikipedia 
Encyclopedic Entries to WordNet synsets. In Procedings of AWIC?05. 
RUIZ-CASADO, M., ALFONSECA, E., AND P. CASTELS. [207] Automatising the learning of lexical paterns: An 
aplication to the enrichment of WordNet by extracting semantic relationships from Wikipedia. Data 
Knowledge and Enginering 61(3), p. 484?49. 
RUIZ-CASADO, M., ALFONSECA, E., AND P. CASTELS. [206] From Wikipedia to Semantic Relationships: a 
Semi-automated Anotation Aproach. In Procedings of the 1st International Workshop: SemWiki?06?
From Wiki to Semantics. Co-located with the 3rd Anual European Semantic eb Conference ESWC?06 
in Budva, Montenegro, June 12, 206. 
RUIZ-CASADO, ., ALFONSECA, E., AND P. CASTELS. [205] Automatic Extraction of Semantic Relationships 
for WordNet by Means of Patern Learning from Wikipedia. In Procedings of the 10th International 
Conference on Aplications of Natural Language to Information Systems, NLDB?05, p. 67?79, Alicante, 
Spain, June 15?17, 205. 
RUTHVEN, I. AND M. LALMAS. [203] A survey on the use of relevance fedback for information aces 
systems. Knowledge Enginering Review 18(2), p. 95?145. 
SCHOENHOFEN, P. [206] Identifying Document Topics Using the Wikipedia Category Network. In 
Procedings of the International Conference on Web Inteligence (IEE/WIC/ACM WI'206), Hong 
Kong. 
SMITH, B., WILIAMS, J., AND S. SCHULZE-KREMER. [203]. The Ontology of the Gene Ontology. In 
Procedings of AIA Symposium, p. 609?613. 
SOON, W. M., NG, H. T., AND . C. Y. LIM [201]. A machine learning aproach to coreference resolution of 
noun phrases. Computational Linguistics 27(4), p. 521?54. 
SOWA, J. [204]. The Chalenge of Knowledge Soup. htp:/ww.jfsowa.com/pubs/chalenge.pdf. 
STVILIA, B., TWIDALE, M. B., GASSER, L., AND L. SMITH. [205]. Information Quality Discusions in 
Wikipedia. Graduate Schol of Library and Information Science, University of Illinois at Urbana-
Champaign. Technical Report ISRN UIUCLIS?205/2+CSCW 
STRUBE, M. AND PONZETO, S.P. [206]. WikiRelate! Computing Semantic Relatednes Using Wikipedia. In: 
AAI '06, p.1419?1424. 
SUCHANEK, F. M., KASNECI, G., AND G. WEIKUM. [207] Yago: a core of semantic knowledge. Proc 16th 
World Wide Web Conference, W?07. New York, NY: ACM Pres. 
SUCHANEK, F. M., KASNECI, G., AND G. WEIKUM. [forthcoming] Yago: A Large Ontology from Wikipedia and 
WordNet. Elsevier Journal of Web Semantics. 
SUCHANEK, F. M, IFRIM, G., AND G. EIKUM. [206] Combining Linguistic and Statistical Analysis to Extract 
Relations from Web Documents. In Procedings of the Knowledge Discovery and Data Mining 
Conference, KD?06. 
SUH, S., HALPIN, H., AND E. KLEIN. [206] Extracting Comon Sense Knowledge from Wikipedia. In 
Procedings of the ISWC?06 Workshop on Web Content Mining with Human Language technology. 
SYED, Z., FININ, T., AND A. JOSHI. [208] ikipedia as an Ontology for Describing Documents. In 
Procedings of the 2nd International Conference on Weblogs and Social Media, AAI, March 31, 208 
THOM, A., PEHCEVSKI, J., AND A. M. VERCOUSTRE. [207] Use of Wikipedia Categories in Entity Ranking. In 
Procedings of the 12th Australasian Document Computing Symposium, Melbourne, Australia. 
THOMAS, C.S., AND P. AMIT. [206] Semantic Convergence of Wikipedia Articles. In Procedings of the 
International Conference on Web Inteligence, IEE/WIC/ACM I?06, Hong Kong. 
TORAL, A. AND R. MU?OZH. [207] Towards a Named Entity WordNet (NEWN). In Procedings of the 6th 
International Conference on Recent Advances in Natural Language Procesing, RANLP?07, Borovets, 
Bulgaria. p. 604?608. 
TORAL, A. AND R. MU?OZH. [206] A proposal to automaticaly build and maintain gazeters for Named 
Entity Recognition by using Wikipedia. In Procedings of the Workshop on New Text at the 
1th EACL?06. Trento, Italy. 
TYERS, F. AND J. PIENAR. [208] Extracting bilingual word pairs from Wikipedia. In Procedings of the 
SALTMIL Workshop at Language Resources and Evaluation Conference, LREC?08. 
VERCOUSTRE, A. M., PEHCEVSKI, J., AND J. A. THOM [207]. Using Wikipedia Categories and Links in Entity 
Ranking. In Pre-procedings of the 6th International Workshop of the Initiative for the Evaluation of 
XML Retrieval, INEX?07, December 17, 207. 
VERCOUSTRE, A. M., THOM, J. A., AND J. PEHCEVSKI [208] Entity Ranking in Wikipedia. In Procedings of 
SAC?08, March 16?20, 208, Fortaleza, Ceara, Brazil. 
VI?GAS, F.B., WATENBERG, M., AND . KUSHAL. [204] Studying coperation and conflict betwen authors 
with history flow visualizations. In Procedings of SIGCHI?04, Viena, Austria, p. 575?582. New York, 
NY: ACM Pres. 
VI?GAS, F., WATENBERG, M., KRISS, J., AND F. VAN HAM. [207] Talk before You Type: Coordination in 
Wikipedia. In Procedings of the 40th Hawai International Conference on System Sciences. 
V?LKEL, M., KR?TZSCH, M., VRANDECIC, D., HALER, H. AND R. STUDER. [206] Semantic Wikipedia. In 
Procedings of the 15th International Conference on World Wide Web, WW?06, Edinburgh, Scotland, 
May 23?26, 206. 
VORHES, E. M. [199] Natural Language Procesing and Information Retrieval. In Pazienza, M. T. (editor) 
Information Extraction: Towards Scalable, Adaptable Systems, New York: Springer, p. 32?48. 
VORHES, E. M., AND HARMAN, D. [2000]. Overview of the eighth text retrieval conference (trec-8). In 
TREC, p. 1?24. 
VOSSEN, P., DIEZ-ORZAS, P., AND W. PETERS. [197] The Multilingual Design of EuroWordNet. In Vosen, P. 
et al. (editors) Procedings of the ACL/EACL?97 Workshop on Automatic Information Extraction and 
Building of Lexical Semantic Resources for NLP Aplications, Madrid, July 12, 197. 
VRANDECIC, D., KR?TZSCH, M., AND M. V?LKEL. [207] Wikipedia and the Semantic Web, Part I. In P. 
Ayers and N. Boalch (editors) Procedings of the 2nd International Wikimedia Conference 
Wikimania?06. Wikimedia Foundation, Cambridge, MA, USA. 
DE VRIES, A. P., THOM, J. A., VERCOUSTRE, A. M., CRASWEL, N., AND M. LALMAS. [207] INEX 207 
Entity ranking track guidelines. In Workshop Pre-Procedings of INEX 207. 
WANG, P., HU, J., ZENG H., CHEN, L., AND Z. CHEN. [207] Improving Text Clasification by Using 
Encyclopedia Knowledge. In Procedings of the 7th IEE International Conference on Data Mining, 
ICDM?07, 8?31 October 207, p.32?341. 
WANG, G., ZHANG, H., WANG, H. AND Y. YU [207a] Enhancing Relation Extraction by Eliciting Selectional 
Constraint Features from Wikipedia. In Procedings of the Natural Language Procesing and Information 
Systems Conference, p. 329?340. 
WANG, G., YU, Y., AND H. ZHU. [207b] PORE: Positive-Only Relation Extraction from Wikipedia Text. In 
Procedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web 
Conference, ISWC/ASWC?07, Busan, South Korea. 
WANG, Y., WANG, H., ZHU, H., AND Y. YU. [207] Exploit Semantic Information for Category Anotation 
Recomendation in Wikipedia. In Procedings of the Natural Language Procesing and Information 
Systems Conference, p. 48?60. 
WATANABE, Y., ASAHARA, M., AND Y. A. MATSUMOTO. [207] Graph-based Aproach to Named Entity 
Categorization in Wikipedia Using Conditional Random Fields. In Procedings of the Joint Conference on 
Empirical Methods in Natural Language Procesing and Computational Natural Language Learning, 
EMNLP-CoNL. 
WILKINSON, D.M., AND HUBERMAN, B.A. [207] Cooperation and Quality in Wikipedia. In Proceedings of 
the International Symposium on Wikis, p. 157-164. 
WU, F. AND D. WELD. [207] Autonomously Semantifying Wikipedia. In Procedings of the 16th ACM 
Conference on Information and Knowledge Management, CIKM?07, Lisbon, Portugal, November 6?8, 
207, p. 41?50. 
WU, F. AND . WELD. [208] Automaticaly Refining the Wikipedia Infobox Ontology. In Procedings of the 
17th International World Wide Web Conference, W?08. 
WU, F., HOFMAN, R., AND . ELD. [208] Information Extraction from Wikipedia: Moving Down the 
Long Tail. In Procedings of the 14th ACM SigKD International Conference on Knowledge Discovery 
and Data Mining (KD-08), Las Vegas, NV, August 24-27, 208, p. 635-644. 
YANG, X. F. AND J. SU [207] Coreference Resolution Using Semantic Relatednes Information from 
Automaticaly Discovered Paterns. In Procedings of the 45th Anual meting of the Asociation for 
Computational Linguistics, ACL?07, Prague, Czech Republic, p. 528?535. 
YANG, J., HAN, J., OH, I., AND M. KWAK. [207] Using Wikipedia technology for topic maps design. In 
Procedings of the ACM Southeast Regional Conference, p. 106?10. 
YU, J., THOM, J. A., AND A. TAM. [207] Ontology evaluation using Wikipedia categories for browsing. In 
Procedings of the 16th ACM Conference on Information and Knowledge Management, CIKM?07, 
Lisbon, Portugal, November 6?8, 207, p. 23?232. 
ZARAGOZA, H., RODE, H., MIKA, P., ATSERIAS, J., CIARAMITA, M., AND G. ATARDI. [207] Ranking Very 
Many Typed Entities on Wikipedia. In Procedings of the 16th ACM Conference on Information and 
Knowledge Management, CIKM?07, Lisbon, Portugal, November 6?8, 207, p. 1015?1018. 
ZESCH, T. AND I. GUREVYCH. [207] Analysis of the Wikipedia Category Graph for NLP Aplications. In 
Procedings of the TextGraphs-2 Workshop at the NACL-HLT?07, p. 1?8. 
ZESCH, T., GUREVYCH, I., AND M. M?HLH?USER. [207] Comparing Wikipedia and German WordNet by 
Evaluating Semantic Relatednes on Multiple Datasets. In Procedings of Human Language Technologies: 
The Anual Conference of the North American Chapter of the Asociation for Computational Linguistics, 
NACL-HLT?07, p. 205?208. 
ZESCH, T., GUREVYCH, I., AND M. M?HLH?USER. [208] Analyzing and Acesing Wikipedia as a Lexical 
Semantic Resource. In Procedings of the Bianual Conference of the Society for Computational 
Linguistics and Language Technology, p. 213?21. 
ZLATIC, V., BOZICEVIC, M., STEFANCIC, H., AND M. DOMAZET. [206] Wikipedias: Colaborative Web-based 
Encyclopedias as Complex Networks. Physical Review E, 74:01615. 
ZIRN, C., NASTASE, V., AND M. STRUBE. [208] Distinguishing betwen Instances and Clases in the 
Wikipedia Taxonomy. To apear in Procedings of the ESWC?08.