Penn Treebank Revised: English News Text Treebank LDC2015T13
Ann Bies and Justin Mott and Colin Warner

LDC2015T13_Penn_Treebank_revised.tar.zst6.86MB
Type: Dataset
Tags: nlp, english, natural language, corpus, news, text, newswire, Treebank, LDC, corpora, Penn Treebank, Penn, 2015, LDC2015T13, parsing, tagging, part of speech, WSJ, PTB

Bibtex:
@article{,
title= {Penn Treebank Revised: English News Text Treebank LDC2015T13},
journal= {},
author= {Ann Bies and Justin Mott and Colin Warner},
year= {2015},
url= {https://doi.org/10.35111/xpjy-at91},
doi= {10.35111/xpjy-at91},
isbn= {1-58563-724-6},
dcmi= {text},
languages= {english},
language= {english},
ldc= {LDC2015T13},
abstract= {# Penn Treebank Revised: English News Text Treebank - 2015

## Metadata

* Item Name:	English News Text Treebank: Penn Treebank Revised
* Author(s):	Ann Bies, Justin Mott, Colin Warner
* LDC Catalog No.:	LDC2015T13
* ISBN:	1-58563-724-6
* DOI:	https://doi.org/10.35111/xpjy-at91
* Release Date:	July 15, 2015
* Member Year(s):	2015
* DCMI Type(s):	Text
* Data Source(s):	newswire
* Application(s):	parsing, tagging, part of speech tagging, natural language processing
* Language(s):	English
* Language ID(s):	eng
* License(s):	LDC User Agreement for Non-Members
* Online Documentation:	LDC2015T13 Documents
* Licensing Instructions:	Subscription & Standard Members, and Non-Members
* Citation:	Bies, Ann, Justin Mott, and Colin Warner. English News Text Treebank: Penn Treebank Revised LDC2015T13. Web Download. Philadelphia: Linguistic Data Consortium, 2015.
* Related Works:	View


## Introduction

English News Text Treebank: Penn Treebank Revised was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of a combination of automated and manual revisions of the [Penn Treebank](https://catalog.ldc.upenn.edu/LDC99T42) annotation of Wall Street Journal (WSJ) stories. The data is comprised of 1,203,648 word-level tokens in 49,191 sentence-level tokens -- in all 2,312 of the original Penn Treebank WSJ files.

## Data

This release includes revised tokenization, part-of-speech, and syntactic treebank annotation intended to bring the full WSJ treebank section into compliance with the agreed-upon policies and updates implemented for current English treebank annotation specifications at LDC. Examples include English Web Treebank ([LDC2012T13](https://catalog.ldc.upenn.edu/LDC2012T13)), OntoNotes ([LDC2013T19](https://catalog.ldc.upenn.edu/LDC2013T19)), and English translation treebanks such as English Translation Treebank: An-Nahar Newswire ([LDC2012T02](https://catalog.ldc.upenn.edu/LDC2012T02)). English Treebank Supplemental Guidelines are included in this release.

## Samples

Please view this [treebank](https://catalog.ldc.upenn.edu/desc/addenda/LDC2015T13.tree.txt) and [tokenized](https://catalog.ldc.upenn.edu/desc/addenda/LDC2015T13.txt) samples.

## Updates

None at this time.
},
keywords= {nlp, english, natural language, corpus, news, text, newswire, Treebank, LDC, corpora, PTB, Penn Treebank, Penn, 2015, LDC2015T13, parsing, tagging, part of speech, WSJ},
terms= {},
license= {},
superseded= {}
}

Hosted by users:

Send Feedback Start
   0.000005
DB Connect
   0.000502
Lookup hash in DB
   0.000943
Get torrent details
   0.000870
Get torrent details, finished
   0.000861
Get authors
   0.000100
Parse bibtex
   0.001215
Write header
   0.000808
get stars
   0.000680
home tab
   0.001132
render right panel
   0.000014
render ads
   0.000093
fetch current hosters
   0.001136
Done