Name: The Pile An 800GB Dataset of Diverse Text for Language Modeling
Creator: EleutherAI
Published: 2021-03-01 01:37:09
License: https://academictorrents.com/nolicensespecified

Info hash

0d366035664fdf51cfbe9f733953ba325776e667

Last mirror activity

1035d,10:38:23 ago

Size

772.89GB (772,891,257,239 bytes)

Added

2021-03-01 01:37:09

Views

1051

Hits

1112

4618

Type

multi

Downloaded

465 time(s)

Uploaded by

joecohen

Folder

EleutherAI_ThePile_v1

Num files

51 files

File list [Hide list]

README.txt	0.10kB
pile/SHA256SUMS.txt	2.78kB
pile/test.jsonl.zst	460.25MB
pile/train/00.jsonl.zst	15.24GB
pile/train/01.jsonl.zst	15.21GB
pile/train/02.jsonl.zst	15.21GB
pile/train/03.jsonl.zst	15.19GB
pile/train/04.jsonl.zst	15.19GB
pile/train/05.jsonl.zst	15.21GB
pile/train/06.jsonl.zst	15.26GB
pile/train/07.jsonl.zst	15.31GB
pile/train/08.jsonl.zst	15.23GB
pile/train/09.jsonl.zst	15.22GB
pile/train/10.jsonl.zst	15.23GB
pile/train/11.jsonl.zst	15.22GB
pile/train/12.jsonl.zst	15.26GB
pile/train/13.jsonl.zst	15.21GB
pile/train/14.jsonl.zst	15.22GB
pile/train/15.jsonl.zst	15.28GB
pile/train/16.jsonl.zst	15.27GB
pile/train/17.jsonl.zst	15.31GB
pile/train/18.jsonl.zst	15.31GB
pile/train/19.jsonl.zst	15.28GB
pile/train/20.jsonl.zst	15.21GB
pile/train/21.jsonl.zst	15.31GB
pile/train/22.jsonl.zst	15.30GB
pile/train/23.jsonl.zst	15.29GB
pile/train/24.jsonl.zst	15.19GB
pile/train/25.jsonl.zst	15.20GB
pile/train/26.jsonl.zst	15.20GB
pile/train/27.jsonl.zst	15.22GB
pile/train/28.jsonl.zst	15.22GB
pile/train/29.jsonl.zst	15.22GB
pile/val.jsonl.zst	470.91MB
pile_preliminary_components/2020-09-08-arxiv-extracts-nofallback-until-2007-068.tar.gz	17.48GB
pile_preliminary_components/EuroParliamentProceedings_1996_2011.jsonl.zst	1.48GB
pile_preliminary_components/FreeLaw_Opinions.jsonl.zst	17.01GB
pile_preliminary_components/Literotica.jsonl.zst	4.43GB
pile_preliminary_components/NIH_ExPORTER_awarded_grant_text.jsonl.zst	630.78MB
pile_preliminary_components/PMC_extracts.tar.gz	28.28GB
pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst	6.90GB
pile_preliminary_components/PhilArchive.jsonl.zst	797.71MB
pile_preliminary_components/books1.tar.gz	2.40GB
pile_preliminary_components/books3.tar.gz	39.52GB
pile_preliminary_components/github.tar	113.35GB
pile_preliminary_components/hn.tar.gz	706.52MB
pile_preliminary_components/openwebtext2.jsonl.zst.tar	29.34GB
pile_preliminary_components/pile_uspto.tar	11.79GB
pile_preliminary_components/stackexchange_dataset.tar	36.80GB
pile_preliminary_components/ubuntu_irc_until_2020_9_1.jsonl.zst	2.04GB
pile_preliminary_components/yt_subs.jsonl.zst	1.78GB

Mirrors

0 complete, 0 downloading = 0 mirror(s) total [Log in to see full list]

Date	2023-08-24 13:54:33
Submitter Name	Thomas Heldrup, Vesterbrogade 15, 1 floor, 1620 Copenhagen V, Denmark
Submitter Email	Thomas.heldrup@rettighedsalliancen.dk
Provide a description of the content in question:	"The book ""Afrikas Horn"" by Wilbur Smith, published by Lindhardt og Ringhof A/S in Denmark. There are additional 108 works we represent that are infringed on the URL. On the following link you can see an official description of ""Afrikas Horn"" by Wilbur Smith: https://www.lindhardtogringhof.dk/afrikas-horn-3"
How are you authorized to make the request?	Authorised agent
How is the content not covered under the Fair Use Act sections 107 or 108?	The work originates from an illegal filesharing site called bibliotik.me (this explicit from the paper documenting "The Pil" found here: https://arxiv.org/abs/2101.00027. As the origin of the copy of the content is an illegal source the content cannot be claimed to fall under the Fair Use doctrine.
Provide a statement that the complaining party has a good faith belief.	I have good faith that the use of the work in this notice is not authorised by the copyright owner its agent, or the law.

The Pile An 800GB Dataset of Diverse Text for Language Modeling EleutherAI

The Pile An 800GB Dataset of Diverse Text for Language Modeling
EleutherAI