Penn Treebank Revised: English News Text Treebank LDC2015T13
Ann Bies and Justin Mott and Colin Warner

LDC2015T13_Penn_Treebank_revised.tar.zst6.86MB
Type: Dataset
Tags: nlp, english, natural language, corpus, news, text, newswire, Treebank, LDC, corpora, Penn Treebank, Penn, 2015, LDC2015T13, parsing, tagging, part of speech, WSJ, PTB

Bibtex:
@article{,
title= {Penn Treebank Revised: English News Text Treebank LDC2015T13},
journal= {},
author= {Ann Bies and Justin Mott and Colin Warner},
year= {2015},
url= {https://doi.org/10.35111/xpjy-at91},
doi= {10.35111/xpjy-at91},
isbn= {1-58563-724-6},
dcmi= {text},
languages= {english},
language= {english},
ldc= {LDC2015T13},
abstract= {# Penn Treebank Revised: English News Text Treebank - 2015

## Metadata

* Item Name:	English News Text Treebank: Penn Treebank Revised
* Author(s):	Ann Bies, Justin Mott, Colin Warner
* LDC Catalog No.:	LDC2015T13
* ISBN:	1-58563-724-6
* DOI:	https://doi.org/10.35111/xpjy-at91
* Release Date:	July 15, 2015
* Member Year(s):	2015
* DCMI Type(s):	Text
* Data Source(s):	newswire
* Application(s):	parsing, tagging, part of speech tagging, natural language processing
* Language(s):	English
* Language ID(s):	eng
* License(s):	LDC User Agreement for Non-Members
* Online Documentation:	LDC2015T13 Documents
* Licensing Instructions:	Subscription & Standard Members, and Non-Members
* Citation:	Bies, Ann, Justin Mott, and Colin Warner. English News Text Treebank: Penn Treebank Revised LDC2015T13. Web Download. Philadelphia: Linguistic Data Consortium, 2015.
* Related Works:	View


## Introduction

English News Text Treebank: Penn Treebank Revised was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of a combination of automated and manual revisions of the [Penn Treebank](https://catalog.ldc.upenn.edu/LDC99T42) annotation of Wall Street Journal (WSJ) stories. The data is comprised of 1,203,648 word-level tokens in 49,191 sentence-level tokens -- in all 2,312 of the original Penn Treebank WSJ files.

## Data

This release includes revised tokenization, part-of-speech, and syntactic treebank annotation intended to bring the full WSJ treebank section into compliance with the agreed-upon policies and updates implemented for current English treebank annotation specifications at LDC. Examples include English Web Treebank ([LDC2012T13](https://catalog.ldc.upenn.edu/LDC2012T13)), OntoNotes ([LDC2013T19](https://catalog.ldc.upenn.edu/LDC2013T19)), and English translation treebanks such as English Translation Treebank: An-Nahar Newswire ([LDC2012T02](https://catalog.ldc.upenn.edu/LDC2012T02)). English Treebank Supplemental Guidelines are included in this release.

## Samples

Please view this [treebank](https://catalog.ldc.upenn.edu/desc/addenda/LDC2015T13.tree.txt) and [tokenized](https://catalog.ldc.upenn.edu/desc/addenda/LDC2015T13.txt) samples.

## Updates

None at this time.
},
keywords= {nlp, english, natural language, corpus, news, text, newswire, Treebank, LDC, corpora, PTB, Penn Treebank, Penn, 2015, LDC2015T13, parsing, tagging, part of speech, WSJ},
terms= {},
license= {},
superseded= {}
}

Hosted by users:

Send Feedback Start
   0.000006
DB Connect
   0.000590
Lookup hash in DB
   0.000966
Get torrent details
   0.000682
Get torrent details, finished
   0.000733
Get authors
   0.000139
Parse bibtex
   0.001252
Write header
   0.000739
get stars
   0.000496
home tab
   0.000782
render right panel
   0.000015
render ads
   0.000048
fetch current hosters
   0.001145
Done