Penn Treebank Revised: English News Text Treebank LDC2015T13
Ann Bies and Justin Mott and Colin Warner

LDC2015T13_Penn_Treebank_revised.tar.zst 6.86MB
Type: Dataset
Tags: nlp, english, natural language, corpus, news, text, newswire, Treebank, LDC, corpora, Penn Treebank, Penn, 2015, LDC2015T13, parsing, tagging, part of speech, WSJ, PTB
DOI: 10.35111/xpjy-at91
Abstract:

Penn Treebank Revised: English News Text Treebank - 2015

Metadata

  • Item Name: English News Text Treebank: Penn Treebank Revised
  • Author(s): Ann Bies, Justin Mott, Colin Warner
  • LDC Catalog No.: LDC2015T13
  • ISBN: 1-58563-724-6
  • DOI: https://doi.org/10.35111/xpjy-at91
  • Release Date: July 15, 2015
  • Member Year(s): 2015
  • DCMI Type(s): Text
  • Data Source(s): newswire
  • Application(s): parsing, tagging, part of speech tagging, natural language processing
  • Language(s): English
  • Language ID(s): eng
  • License(s): LDC User Agreement for Non-Members
  • Online Documentation: LDC2015T13 Documents
  • Licensing Instructions: Subscription & Standard Members, and Non-Members
  • Citation: Bies, Ann, Justin Mott, and Colin Warner. English News Text Treebank: Penn Treebank Revised LDC2015T13. Web Download. Philadelphia: Linguistic Data Consortium, 2015.
  • Related Works: View

Introduction

English News Text Treebank: Penn Treebank Revised was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of a combination of automated and manual revisions of the Penn Treebank annotation of Wall Street Journal (WSJ) stories. The data is comprised of 1,203,648 word-level tokens in 49,191 sentence-level tokens -- in all 2,312 of the original Penn Treebank WSJ files.

Data

This release includes revised tokenization, part-of-speech, and syntactic treebank annotation intended to bring the full WSJ treebank section into compliance with the agreed-upon policies and updates implemented for current English treebank annotation specifications at LDC. Examples include English Web Treebank (LDC2012T13), OntoNotes (LDC2013T19), and English translation treebanks such as English Translation Treebank: An-Nahar Newswire (LDC2012T02). English Treebank Supplemental Guidelines are included in this release.

Samples

Please view this treebank and tokenized samples.

Updates

None at this time.


Year: 2015

URL: https://doi.org/10.35111/xpjy-at91
License: No license specified, the work may be protected by copyright.

Bibtex:
@article{,
title= {Penn Treebank Revised: English News Text Treebank LDC2015T13},
journal= {},
author= {Ann Bies and Justin Mott and Colin Warner},
year= {2015},
url= {https://doi.org/10.35111/xpjy-at91},
doi= {10.35111/xpjy-at91},
isbn= {1-58563-724-6},
dcmi= {text},
languages= {english},
language= {english},
ldc= {LDC2015T13},
abstract= {# Penn Treebank Revised: English News Text Treebank - 2015

## Metadata

* Item Name:	English News Text Treebank: Penn Treebank Revised
* Author(s):	Ann Bies, Justin Mott, Colin Warner
* LDC Catalog No.:	LDC2015T13
* ISBN:	1-58563-724-6
* DOI:	https://doi.org/10.35111/xpjy-at91
* Release Date:	July 15, 2015
* Member Year(s):	2015
* DCMI Type(s):	Text
* Data Source(s):	newswire
* Application(s):	parsing, tagging, part of speech tagging, natural language processing
* Language(s):	English
* Language ID(s):	eng
* License(s):	LDC User Agreement for Non-Members
* Online Documentation:	LDC2015T13 Documents
* Licensing Instructions:	Subscription & Standard Members, and Non-Members
* Citation:	Bies, Ann, Justin Mott, and Colin Warner. English News Text Treebank: Penn Treebank Revised LDC2015T13. Web Download. Philadelphia: Linguistic Data Consortium, 2015.
* Related Works:	View


## Introduction

English News Text Treebank: Penn Treebank Revised was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of a combination of automated and manual revisions of the [Penn Treebank](https://catalog.ldc.upenn.edu/LDC99T42) annotation of Wall Street Journal (WSJ) stories. The data is comprised of 1,203,648 word-level tokens in 49,191 sentence-level tokens -- in all 2,312 of the original Penn Treebank WSJ files.

## Data

This release includes revised tokenization, part-of-speech, and syntactic treebank annotation intended to bring the full WSJ treebank section into compliance with the agreed-upon policies and updates implemented for current English treebank annotation specifications at LDC. Examples include English Web Treebank ([LDC2012T13](https://catalog.ldc.upenn.edu/LDC2012T13)), OntoNotes ([LDC2013T19](https://catalog.ldc.upenn.edu/LDC2013T19)), and English translation treebanks such as English Translation Treebank: An-Nahar Newswire ([LDC2012T02](https://catalog.ldc.upenn.edu/LDC2012T02)). English Treebank Supplemental Guidelines are included in this release.

## Samples

Please view this [treebank](https://catalog.ldc.upenn.edu/desc/addenda/LDC2015T13.tree.txt) and [tokenized](https://catalog.ldc.upenn.edu/desc/addenda/LDC2015T13.txt) samples.

## Updates

None at this time.
},
keywords= {nlp, english, natural language, corpus, news, text, newswire, Treebank, LDC, corpora, PTB, Penn Treebank, Penn, 2015, LDC2015T13, parsing, tagging, part of speech, WSJ},
terms= {},
license= {},
superseded= {}
}

Hosted by users:

Send Feedback Start
   0.000029
DB Connect
   0.000942
Lookup hash in DB
   0.005342
Get torrent details
   0.001516
Get torrent details, finished
   0.001251
Get authors
   0.000117
Parse bibtex
   0.001319
Write header
   0.000615
get stars
   0.000513
home tab
   0.006647
render right panel
   0.000071
render ads
   0.000117
fetch current hosters
   0.009236
Done