Penn Treebank III 3 LDC99T42
Mitchell P. Marcus and Beatrice Santorini and Mary Ann Marcinkiewicz and Ann Taylor

LDC99T42_Penn_Treebank_3.tar.zst29.83MB
Type: Dataset
Tags: Dataset, nlp, natural language, corpus, text, linguistics, Treebank, corpora, Penn Treebank, PTB

Bibtex:
@article{,
title= {Penn Treebank III 3 LDC99T42},
journal= {},
author= {Mitchell P. Marcus and Beatrice Santorini and Mary Ann Marcinkiewicz and Ann Taylor},
year= {1999},
isbn= {1-58563-163-9},
islrn= {141-282-691-413-2},
dcmi= {text},
language= {english},
doi= {10.35111/gq1x-j780},
url= {https://doi.org/10.35111/gq1x-j780},
abstract= {# Penn Treebank III

## Metadata

- _Item Name:_ Treebank-3
- _Author(s):_ Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor
- _LDC Catalog No.:_ LDC99T42
- _ISBN:_ 1-58563-163-9
- _ISLRN:_ 141-282-691-413-2
- _DOI:_ [https://doi.org/10.35111/gq1x-j780](https://doi.org/10.35111/gq1x-j780)
- _Member Year(s):_ 1999
- _DCMI Type(s):_ Text
- _Data Source(s):_ telephone speech, newswire, microphone speech, transcribed speech, varied
- _Project(s):_ TIDES, GALE
- _Application(s):_ parsing, natural language processing, tagging
- _Language(s):_ English
- _Language ID(s):_ eng
- _License(s):_ [LDC User Agreement for Non-Members](https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf)
- _Online Documentation:_ [LDC99T42 Documents](https://catalog.ldc.upenn.edu/docs/LDC99T42/)
- _Citation:_ Marcus, Mitchell P., et al. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium, 1999.

## Introduction

This release contains the following [Treebank-2](http://catalog.ldc.upenn.edu/LDC95T7) Material:

-   One million words of 1989 Wall Street Journal material annotated in Treebank II style.
-   A small sample of ATIS-3 material annotated in Treebank II style.
-   A fully tagged version of the Brown Corpus.

and the following new material:

-   Switchboard tagged, dysfluency-annotated, and parsed text
-   Brown parsed text

The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied.

## Data

The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 ([LDC95T7](https://catalog.ldc.upenn.edu/LDC95T7)) and Treebank-3 ([LDC99T42](https://catalog.ldc.upenn.edu/LDC99T42)) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file (pennTB\_tipster\_wsj\_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.

## Samples

Please view the following samples:

-   [Part-of-Speech Tags](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.pos.txt)
-   [Dysfluency Annotation](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.dff.txt)
-   [Dysfluency Annotation & Part-of-Speech Tags](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.mgd.txt)
-   [Dysfluency Annotation, Part-of-Speech Tags & Turns Joined](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.dps.txt)
-   [Syntactic Annotation](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.prd.txt)
-   [Syntactic Annotation & Part-of-Speech Tags](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.mrg.txt)

## Updates

After publication, it was discovered that not all of the postscript (\*.ps) files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please go to [addenda](https://catalog.ldc.upenn.edu/desc/addenda/LDC1999T42) for a list of the files available.

As of October 5, 2016 252 wsj files from [Treebank-2](http://catalog.ldc.upenn.edu/LDC95T7) were added that were previously missing.

As of February, 2017, 2,499 "raw" wsj files were added from Treebank-2 ([LDC95T7](https://catalog.ldc.upenn.edu/LDC95T7)).

Corpus downoads after these dates will include these missing files.},
keywords= {Dataset, nlp, natural language, corpus, text, linguistics, Treebank, corpora, Penn Treebank, PTB},
terms= {},
license= {},
superseded= {}
}

Hosted by users:

Send Feedback Start
   0.000006
DB Connect
   0.000413
Lookup hash in DB
   0.000751
Get torrent details
   0.000717
Get torrent details, finished
   0.000851
Get authors
   0.000131
Parse bibtex
   0.001787
Write header
   0.000736
get stars
   0.000435
home tab
   0.000778
render right panel
   0.000035
render ads
   0.000073
fetch current hosters
   0.000834
Done