Tags: nlp, english, natural language, corpus, news, text, newswire, Treebank, LDC, corpora, Penn Treebank, Penn, 2015, LDC2015T13, parsing, tagging, part of speech, WSJ, PTB
Abstract:
Penn Treebank Revised: English News Text Treebank - 2015
Metadata
- Item Name: English News Text Treebank: Penn Treebank Revised
- Author(s): Ann Bies, Justin Mott, Colin Warner
- LDC Catalog No.: LDC2015T13
- ISBN: 1-58563-724-6
- DOI: https://doi.org/10.35111/xpjy-at91
- Release Date: July 15, 2015
- Member Year(s): 2015
- DCMI Type(s): Text
- Data Source(s): newswire
- Application(s): parsing, tagging, part of speech tagging, natural language processing
- Language(s): English
- Language ID(s): eng
- License(s): LDC User Agreement for Non-Members
- Online Documentation: LDC2015T13 Documents
- Licensing Instructions: Subscription & Standard Members, and Non-Members
- Citation: Bies, Ann, Justin Mott, and Colin Warner. English News Text Treebank: Penn Treebank Revised LDC2015T13. Web Download. Philadelphia: Linguistic Data Consortium, 2015.
- Related Works: View
Introduction
English News Text Treebank: Penn Treebank Revised was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of a combination of automated and manual revisions of the Penn Treebank annotation of Wall Street Journal (WSJ) stories. The data is comprised of 1,203,648 word-level tokens in 49,191 sentence-level tokens -- in all 2,312 of the original Penn Treebank WSJ files.
Data
This release includes revised tokenization, part-of-speech, and syntactic treebank annotation intended to bring the full WSJ treebank section into compliance with the agreed-upon policies and updates implemented for current English treebank annotation specifications at LDC. Examples include English Web Treebank (LDC2012T13), OntoNotes (LDC2013T19), and English translation treebanks such as English Translation Treebank: An-Nahar Newswire (LDC2012T02). English Treebank Supplemental Guidelines are included in this release.
Samples
Please view this treebank and tokenized samples.
Updates
None at this time.
Year: 2015
URL: https://doi.org/10.35111/xpjy-at91
License: No license specified, the work may be protected by copyright.
Bibtex:
@article{, title= {Penn Treebank Revised: English News Text Treebank LDC2015T13}, journal= {}, author= {Ann Bies and Justin Mott and Colin Warner}, year= {2015}, url= {https://doi.org/10.35111/xpjy-at91}, doi= {10.35111/xpjy-at91}, isbn= {1-58563-724-6}, dcmi= {text}, languages= {english}, language= {english}, ldc= {LDC2015T13}, abstract= {# Penn Treebank Revised: English News Text Treebank - 2015 ## Metadata * Item Name: English News Text Treebank: Penn Treebank Revised * Author(s): Ann Bies, Justin Mott, Colin Warner * LDC Catalog No.: LDC2015T13 * ISBN: 1-58563-724-6 * DOI: https://doi.org/10.35111/xpjy-at91 * Release Date: July 15, 2015 * Member Year(s): 2015 * DCMI Type(s): Text * Data Source(s): newswire * Application(s): parsing, tagging, part of speech tagging, natural language processing * Language(s): English * Language ID(s): eng * License(s): LDC User Agreement for Non-Members * Online Documentation: LDC2015T13 Documents * Licensing Instructions: Subscription & Standard Members, and Non-Members * Citation: Bies, Ann, Justin Mott, and Colin Warner. English News Text Treebank: Penn Treebank Revised LDC2015T13. Web Download. Philadelphia: Linguistic Data Consortium, 2015. * Related Works: View ## Introduction English News Text Treebank: Penn Treebank Revised was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of a combination of automated and manual revisions of the [Penn Treebank](https://catalog.ldc.upenn.edu/LDC99T42) annotation of Wall Street Journal (WSJ) stories. The data is comprised of 1,203,648 word-level tokens in 49,191 sentence-level tokens -- in all 2,312 of the original Penn Treebank WSJ files. ## Data This release includes revised tokenization, part-of-speech, and syntactic treebank annotation intended to bring the full WSJ treebank section into compliance with the agreed-upon policies and updates implemented for current English treebank annotation specifications at LDC. Examples include English Web Treebank ([LDC2012T13](https://catalog.ldc.upenn.edu/LDC2012T13)), OntoNotes ([LDC2013T19](https://catalog.ldc.upenn.edu/LDC2013T19)), and English translation treebanks such as English Translation Treebank: An-Nahar Newswire ([LDC2012T02](https://catalog.ldc.upenn.edu/LDC2012T02)). English Treebank Supplemental Guidelines are included in this release. ## Samples Please view this [treebank](https://catalog.ldc.upenn.edu/desc/addenda/LDC2015T13.tree.txt) and [tokenized](https://catalog.ldc.upenn.edu/desc/addenda/LDC2015T13.txt) samples. ## Updates None at this time. }, keywords= {nlp, english, natural language, corpus, news, text, newswire, Treebank, LDC, corpora, PTB, Penn Treebank, Penn, 2015, LDC2015T13, parsing, tagging, part of speech, WSJ}, terms= {}, license= {}, superseded= {} }