Name | DL | Torrents | Total Size | Joe's Recommended Mirror List [edit] | 233 | 8.28TB | 2181 | 0 | ML LD Prime Datasets [edit] | 6 | 3.26TB | 22 | 0 |
wiki-links (10 files)
data-00008-of-00010.gz | 184.39MB |
data-00009-of-00010.gz | 183.68MB |
data-00005-of-00010.gz | 183.76MB |
data-00006-of-00010.gz | 183.77MB |
data-00007-of-00010.gz | 183.87MB |
data-00002-of-00010.gz | 183.72MB |
data-00003-of-00010.gz | 184.04MB |
data-00004-of-00010.gz | 183.46MB |
data-00000-of-00010.gz | 183.50MB |
data-00001-of-00010.gz | 183.74MB |
Type: Dataset
Tags:
Bibtex:
Tags:
Bibtex:
@article{, title= {Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia (Original Dataset)}, author= {Sameer Singh and Amarnag Subramanya and Fernando Pereira and Andrew McCallum}, abstract= {Cross-document coreference resolution is the task of grouping the entity mentions in a collection of documents into sets that each represent a distinct entity. It is central to knowledge base construction and also useful for joint inference with other NLP components. Obtaining large, organic labeled datasets for training and testing cross-document coreference has previously been difficult. We use a method for automatically gathering massive amounts of naturally-occurring cross-document reference data to create the Wikilinks dataset comprising of 40 million mentions over 3 million entities. Our method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. In addition to providing large-scale labeled data without human effort, we are able to include many styles of text beyond newswire and many entity types beyond people. ### Introduction The Wikipedia links (WikiLinks) data consists of web pages that satisfy the following two constraints: a. contain at least one hyperlink that points to Wikipedia, and b. the anchor text of that hyperlink closely matches the title of the target Wikipedia page. We treat each page on Wikipedia as representing an entity (or concept or idea), and the anchor text as a mention of the entity. The WikiLinks data set was obtained by iterating over Google's web index. #### Content This dataset is accompanied by the following tech report: https://web.cs.umass.edu/publication/docs/2012/UM-CS-2012-015.pdf Please cite the above report if you use this data. The dataset is divided over 10 gzipped text files data-0000[0-9]-of-00010.gz. Each of these files can be viewed without uncompressing them using zcat. For example: zcat data-00001-of-00010.gz | head gives: URL ftp://217.219.170.14/Computer%20Group/Faani/vaset%20fani/second/sattari/word/2007/source/s%20crt.docx MENTION vacuum tube 421 http://en.wikipedia.org/wiki/Vacuum_tube MENTION vacuum tubes 10838 http://en.wikipedia.org/wiki/Vacuum_tube MENTION electron gun 598 http://en.wikipedia.org/wiki/Electron_gun MENTION fluorescent 790 http://en.wikipedia.org/wiki/Fluorescent MENTION oscilloscope 1307 http://en.wikipedia.org/wiki/Oscilloscope MENTION computer monitor 1503 http://en.wikipedia.org/wiki/Computer_monitor MENTION computer monitors 3066 http://en.wikipedia.org/wiki/Computer_monitor MENTION radar 1657 http://en.wikipedia.org/wiki/Radar MENTION plasma screens 2162 http://en.wikipedia.org/wiki/Plasma_screen Each file is in the following format: ------- URL\t<url>\n MENTION\t<mention>\t<byte_offset>\t<target_url>\n MENTION\t<mention>\t<byte_offset>\t<target_url>\n MENTION\t<mention>\t<byte_offset>\t<target_url>\n ... TOKEN\t<token>\t<byte_offset>\n TOKEN\t<token>\t<byte_offset>\n TOKEN\t<token>\t<byte_offset>\n ... \n\n URL\t<url>\n ... where each web-page is identified by its url (annotated by "URL"). For every mention (denoted by "MENTION"), we provide the actual mention string, the byte offset of the mention from the start of the page and the target url all separated by a tab. It is possible (and in many cases very likely) that the contents of a web-page may change over time. The dataset also contains information about the top 10 least frequent tokens on that page at the time it was crawled. These line started with a "TOKEN" and contain the string of the token and the byte offset from the start of the page. These token strings can be used as fingerprints to verify if the page used to generate the data has changed. Finally, pages are separated from each other by two blank lines. #### Basic Statistics Number of Document: 11 million Number of entities: 3 million Number of mentions: 40 million Finally please note that this dataset was created automatically from the web and therefore contains some amount of noise. Enjoy! Amar Subramanya (asubram@google.com) Sameer Singh (sameer@cs.umass.edu) Fernando Pereira (pereira@google.com) Andrew McCallum (mccallum@cs.umass.edu) }, keywords= {}, terms= {}, license= {Attribution 3.0 Unported (CC BY 3.0) Human-Readable Summary} }