Name: Penn Treebank III 3 LDC99T42
Creator: Mitchell P. Marcus and Beatrice Santorini and Mary Ann Marcinkiewicz and Ann Taylor
Published: 2022-08-15 00:11:56
License: https://academictorrents.com/nolicensespecified

LDC99T42_Penn_Treebank_3.tar.zst	29.83MB

Type: Dataset

Tags: Datasetnlpnatural languagecorpustextlinguisticsTreebankcorporaPenn TreebankPTB

Bibtex:

@article{,
title= {Penn Treebank III 3 LDC99T42},
journal= {},
author= {Mitchell P. Marcus and Beatrice Santorini and Mary Ann Marcinkiewicz and Ann Taylor},
year= {1999},
isbn= {1-58563-163-9},
islrn= {141-282-691-413-2},
dcmi= {text},
language= {english},
doi= {10.35111/gq1x-j780},
url= {https://doi.org/10.35111/gq1x-j780},
abstract= {# Penn Treebank III

## Metadata

- _Item Name:_ Treebank-3
- _Author(s):_ Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor
- _LDC Catalog No.:_ LDC99T42
- _ISBN:_ 1-58563-163-9
- _ISLRN:_ 141-282-691-413-2
- _DOI:_ [https://doi.org/10.35111/gq1x-j780](https://doi.org/10.35111/gq1x-j780)
- _Member Year(s):_ 1999
- _DCMI Type(s):_ Text
- _Data Source(s):_ telephone speech, newswire, microphone speech, transcribed speech, varied
- _Project(s):_ TIDES, GALE
- _Application(s):_ parsing, natural language processing, tagging
- _Language(s):_ English
- _Language ID(s):_ eng
- _License(s):_ [LDC User Agreement for Non-Members](https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf)
- _Online Documentation:_ [LDC99T42 Documents](https://catalog.ldc.upenn.edu/docs/LDC99T42/)
- _Citation:_ Marcus, Mitchell P., et al. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium, 1999.

## Introduction

This release contains the following [Treebank-2](http://catalog.ldc.upenn.edu/LDC95T7) Material:

-   One million words of 1989 Wall Street Journal material annotated in Treebank II style.
-   A small sample of ATIS-3 material annotated in Treebank II style.
-   A fully tagged version of the Brown Corpus.

and the following new material:

-   Switchboard tagged, dysfluency-annotated, and parsed text
-   Brown parsed text

The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied.

## Data

The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 ([LDC95T7](https://catalog.ldc.upenn.edu/LDC95T7)) and Treebank-3 ([LDC99T42](https://catalog.ldc.upenn.edu/LDC99T42)) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file (pennTB\_tipster\_wsj\_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.

## Samples

Please view the following samples:

-   [Part-of-Speech Tags](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.pos.txt)
-   [Dysfluency Annotation](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.dff.txt)
-   [Dysfluency Annotation & Part-of-Speech Tags](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.mgd.txt)
-   [Dysfluency Annotation, Part-of-Speech Tags & Turns Joined](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.dps.txt)
-   [Syntactic Annotation](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.prd.txt)
-   [Syntactic Annotation & Part-of-Speech Tags](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.mrg.txt)

## Updates

After publication, it was discovered that not all of the postscript (\*.ps) files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please go to [addenda](https://catalog.ldc.upenn.edu/desc/addenda/LDC1999T42) for a list of the files available.

As of October 5, 2016 252 wsj files from [Treebank-2](http://catalog.ldc.upenn.edu/LDC95T7) were added that were previously missing.

As of February, 2017, 2,499 "raw" wsj files were added from Treebank-2 ([LDC95T7](https://catalog.ldc.upenn.edu/LDC95T7)).

Corpus downoads after these dates will include these missing files.},
keywords= {Dataset, nlp, natural language, corpus, text, linguistics, Treebank, corpora, Penn Treebank, PTB},
terms= {},
license= {},
superseded= {}
}

Date	2025-06-10 09:12:44
Submitter Name	Denise DiPersio
Submitter Email	dipersio@ldc.upenn.edu
Provide a description of the content in question:	This distribution is not authorized by LDC. Please remove these, and any other datasets, from the site immediately. The data, annotations and related material in LDC Databases are protected under US copyright laws and under various legal agreements between LDC/University of Pennsylvania and source data providers. LDC/University of Pennsylvania owns the LDC Databases and/or has the right to distribute them. LDC’s membership agreements and license agreements provide that users cannot publish, retransmit, disclose, display, copy, reproduce or redistribute the LDC Databases to others outside their organizations.
How are you authorized to make the request?	LDC/University of Pennsylvania either owns the data or has the right to distribute it pursuant to legal agreements with data providers.
How is the content not covered under the Fair Use Act sections 107 or 108?	We are not qualified to say what would constitute a fair use of LDC Databases. However, in this instance, the user has either breached an existing license agreement or hacked our catalog to obtain these datasets and is offering them on the platform to the same community served by LDC
Provide a statement that the complaining party has a good faith belief.	We have a good faith belief that use of the material in the manner complained of is not authorized by the copyright owner(s), its agent, or the law.

Penn Treebank III 3 LDC99T42 Mitchell P. Marcus and Beatrice Santorini and Mary Ann Marcinkiewicz and Ann Taylor

Penn Treebank III 3 LDC99T42
Mitchell P. Marcus and Beatrice Santorini and Mary Ann Marcinkiewicz and Ann Taylor