NNTP Discussion Archives
nyuuzyou

folder main (257 files)
filedata/articles_0a.parquet 797.63MB
filedata/articles_0b.parquet 801.09MB
filedata/articles_0c.parquet 795.18MB
filedata/articles_0d.parquet 797.39MB
filedata/articles_0e.parquet 795.71MB
filedata/articles_0f.parquet 794.47MB
filedata/articles_00.parquet 818.75MB
filedata/articles_01.parquet 797.43MB
filedata/articles_02.parquet 797.84MB
filedata/articles_03.parquet 794.41MB
filedata/articles_04.parquet 796.44MB
filedata/articles_05.parquet 793.85MB
filedata/articles_06.parquet 796.30MB
filedata/articles_07.parquet 796.68MB
filedata/articles_08.parquet 798.36MB
filedata/articles_09.parquet 799.20MB
filedata/articles_1a.parquet 796.58MB
filedata/articles_1b.parquet 799.90MB
filedata/articles_1c.parquet 797.23MB
filedata/articles_1d.parquet 796.84MB
filedata/articles_1e.parquet 797.02MB
filedata/articles_1f.parquet 793.08MB
filedata/articles_2a.parquet 795.72MB
filedata/articles_2b.parquet 796.88MB
filedata/articles_2c.parquet 794.51MB
filedata/articles_2d.parquet 799.53MB
filedata/articles_2e.parquet 803.85MB
filedata/articles_2f.parquet 800.11MB
filedata/articles_3a.parquet 794.45MB
filedata/articles_3b.parquet 796.33MB
filedata/articles_3c.parquet 796.64MB
filedata/articles_3d.parquet 796.02MB
filedata/articles_3e.parquet 794.07MB
filedata/articles_3f.parquet 796.40MB
filedata/articles_4a.parquet 795.82MB
filedata/articles_4b.parquet 798.94MB
filedata/articles_4c.parquet 796.15MB
filedata/articles_4d.parquet 795.83MB
filedata/articles_4e.parquet 797.13MB
filedata/articles_4f.parquet 795.45MB
filedata/articles_5a.parquet 797.98MB
filedata/articles_5b.parquet 799.98MB
filedata/articles_5c.parquet 798.02MB
filedata/articles_5d.parquet 796.24MB
filedata/articles_5e.parquet 795.11MB
filedata/articles_5f.parquet 797.80MB
filedata/articles_6a.parquet 798.62MB
filedata/articles_6b.parquet 795.84MB
filedata/articles_6c.parquet 796.27MB
Too many files! Click here to view them all.
Type: Dataset

Metadata:
@article{,
title= {NNTP Discussion Archives},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/nntp-text-387m},
abstract= {
# NNTP Discussion Archives

A large-scale collection of text discussions from public NNTP (Network News Transfer Protocol) newsgroups spanning over two decades.

## Dataset Statistics

| Metric | Value |
|--------|-------|
| Total messages | 386,629,949 |
| Unique newsgroups | 159,345 |
| Date range | 2002 - 2026 |
| Total size | ~191 GB (compressed) |
| File format | Parquet (ZSTD) |
| Number of files | 256 |
| Average content length | ~1,400 characters |

## Schema

| Column | Type | Description |
|--------|------|-------------|
| `message_id` | `string` | Original message identifier (unchanged) |
| `newsgroups` | `string` | Target newsgroup(s), comma-separated if cross-posted |
| `author` | `string` | Message author with email addresses redacted as `[email]` |
| `subject` | `string` | Subject line |
| `date` | `string` | RFC 2822 formatted date string |
| `content` | `string` | Message body with email addresses redacted as `[email]` |

## Top Newsgroups by Volume

| Newsgroup | Messages |
|-----------|----------|
| alt.atheism | 5,658,023 |
| free.usenet | 4,691,561 |
| alt.fan.rush-limbaugh | 4,659,639 |
| alt.politics | 3,919,772 |
| fr.soc.politique | 3,554,434 |
| it.sport.calcio.milan | 2,961,804 |
| it.politica | 2,802,687 |
| alt.politics.bush | 2,786,316 |
| talk.politics.misc | 2,784,668 |
| Other (159,336 groups) | 475,430,274 |

*Cross-posted messages are counted once per newsgroup, so totals exceed the 386M unique messages.*

## Data Processing

**Filtering:** Binary-focused groups (*.binaries.*, *.pictures.*, *.multimedia.*), binary posts with file-sharing indicators, messages exceeding 500KB, and unrecoverable encoding errors are excluded. Spam is almost **not filtered** - the dataset includes advertisements, phishing, and low-quality posts present in raw newsgroups.

**Encoding:** Messages are normalized to UTF-8 with the following decoding pipeline:
- Quoted-Printable: MIME-encoded content decoded to text
- Base64: Text base64 content decoded; binary base64 excluded
- Legacy encodings: Invalid UTF-8 sequences re-encoded using Windows-1252, ISO-8859-*, KOI8-R, Shift-JIS, GBK, and other legacy encoding detection
- MIME encoded-word headers decoded to UTF-8

**Deduplication:** Exact content duplicates removed via xxHash64 hashing (first occurrence retained).

**Privacy:** Email addresses in `author` and `content` fields redacted as `[email]`; `message_id` unchanged.

## Considerations

- Messages were posted to public newsgroups
- Content reflects unmoderated discussions and may contain controversial opinions},
keywords= {discussions, historical},
terms= {},
license= {},
superseded= {}
}

Citation:
nyuuzyou. (2026). NNTP Discussion Archives [Data set]. Academic Torrents. https://academictorrents.com/details/cac053d01e256ae3001bf40c5c98eefa86cdc870

Send Feedback Start
   0.000007
DB Connect
   0.000531
Lookup hash in DB
   0.000425
Get torrent details
   0.000134
Get torrent details, finished
   0.000225
Get authors
   0.000039
Parse bibtex
   0.000315
Write header
   0.000389
get stars
   0.000101
home tab
   0.000638
render right panel
   0.000006
render ads
   0.000392
fetch current hosters
   0.000241
related datasets
   0.012700
Done