abstract= {# Penn Treebank III

## Metadata

- _Item Name:_ Treebank-3
- _Author(s):_ Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor
- _LDC Catalog No.:_ LDC99T42
- _ISBN:_ 1-58563-163-9
- _ISLRN:_ 141-282-691-413-2
- _DOI:_ [](
- _Member Year(s):_ 1999
- _DCMI Type(s):_ Text
- _Data Source(s):_ telephone speech, newswire, microphone speech, transcribed speech, varied
- _Project(s):_ TIDES, GALE
- _Application(s):_ parsing, natural language processing, tagging
- _Language(s):_ English
- _Language ID(s):_ eng
- _License(s):_ [LDC User Agreement for Non-Members](
- _Online Documentation:_ [LDC99T42 Documents](
- _Citation:_ Marcus, Mitchell P., et al. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium, 1999.

## Introduction

This release contains the following [Treebank-2]( Material:

-   One million words of 1989 Wall Street Journal material annotated in Treebank II style.
-   A small sample of ATIS-3 material annotated in Treebank II style.
-   A fully tagged version of the Brown Corpus.

and the following new material:

-   Switchboard tagged, dysfluency-annotated, and parsed text
-   Brown parsed text

The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied.

## Data

The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 ([LDC95T7]( and Treebank-3 ([LDC99T42]( releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file (pennTB\_tipster\_wsj\_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.

## Samples

Please view the following samples:

-   [Part-of-Speech Tags](
-   [Dysfluency Annotation](
-   [Dysfluency Annotation & Part-of-Speech Tags](
-   [Dysfluency Annotation, Part-of-Speech Tags & Turns Joined](
-   [Syntactic Annotation](
-   [Syntactic Annotation & Part-of-Speech Tags](

## Updates

After publication, it was discovered that not all of the postscript (\*.ps) files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please go to [addenda]( for a list of the files available.

As of October 5, 2016 252 wsj files from [Treebank-2]( were added that were previously missing.

As of February, 2017, 2,499 "raw" wsj files were added from Treebank-2 ([LDC95T7](

Corpus downoads after these dates will include these missing files.},
