NNTP Discussion Archives
nyuuzyou

folder main (257 files)
filedata/articles_0a.parquet 797.63MB
filedata/articles_0b.parquet 801.09MB
filedata/articles_0c.parquet 795.18MB
filedata/articles_0d.parquet 797.39MB
filedata/articles_0e.parquet 795.71MB
filedata/articles_0f.parquet 794.47MB
filedata/articles_00.parquet 818.75MB
filedata/articles_01.parquet 797.43MB
filedata/articles_02.parquet 797.84MB
filedata/articles_03.parquet 794.41MB
filedata/articles_04.parquet 796.44MB
filedata/articles_05.parquet 793.85MB
filedata/articles_06.parquet 796.30MB
filedata/articles_07.parquet 796.68MB
filedata/articles_08.parquet 798.36MB
filedata/articles_09.parquet 799.20MB
filedata/articles_1a.parquet 796.58MB
filedata/articles_1b.parquet 799.90MB
filedata/articles_1c.parquet 797.23MB
filedata/articles_1d.parquet 796.84MB
filedata/articles_1e.parquet 797.02MB
filedata/articles_1f.parquet 793.08MB
filedata/articles_2a.parquet 795.72MB
filedata/articles_2b.parquet 796.88MB
filedata/articles_2c.parquet 794.51MB
filedata/articles_2d.parquet 799.53MB
filedata/articles_2e.parquet 803.85MB
filedata/articles_2f.parquet 800.11MB
filedata/articles_3a.parquet 794.45MB
filedata/articles_3b.parquet 796.33MB
filedata/articles_3c.parquet 796.64MB
filedata/articles_3d.parquet 796.02MB
filedata/articles_3e.parquet 794.07MB
filedata/articles_3f.parquet 796.40MB
filedata/articles_4a.parquet 795.82MB
filedata/articles_4b.parquet 798.94MB
filedata/articles_4c.parquet 796.15MB
filedata/articles_4d.parquet 795.83MB
filedata/articles_4e.parquet 797.13MB
filedata/articles_4f.parquet 795.45MB
filedata/articles_5a.parquet 797.98MB
filedata/articles_5b.parquet 799.98MB
filedata/articles_5c.parquet 798.02MB
filedata/articles_5d.parquet 796.24MB
filedata/articles_5e.parquet 795.11MB
filedata/articles_5f.parquet 797.80MB
filedata/articles_6a.parquet 798.62MB
filedata/articles_6b.parquet 795.84MB
filedata/articles_6c.parquet 796.27MB
Too many files! Click here to view them all.
Type: Dataset

Bibtex:
@article{,
title= {NNTP Discussion Archives},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/nntp-text-387m},
abstract= {
# NNTP Discussion Archives

A large-scale collection of text discussions from public NNTP (Network News Transfer Protocol) newsgroups spanning over two decades.

## Dataset Statistics

| Metric | Value |
|--------|-------|
| Total messages | 386,629,949 |
| Unique newsgroups | 159,345 |
| Date range | 2002 - 2026 |
| Total size | ~191 GB (compressed) |
| File format | Parquet (ZSTD) |
| Number of files | 256 |
| Average content length | ~1,400 characters |

## Schema

| Column | Type | Description |
|--------|------|-------------|
| `message_id` | `string` | Original message identifier (unchanged) |
| `newsgroups` | `string` | Target newsgroup(s), comma-separated if cross-posted |
| `author` | `string` | Message author with email addresses redacted as `[email]` |
| `subject` | `string` | Subject line |
| `date` | `string` | RFC 2822 formatted date string |
| `content` | `string` | Message body with email addresses redacted as `[email]` |

## Top Newsgroups by Volume

| Newsgroup | Messages |
|-----------|----------|
| alt.atheism | 5,658,023 |
| free.usenet | 4,691,561 |
| alt.fan.rush-limbaugh | 4,659,639 |
| alt.politics | 3,919,772 |
| fr.soc.politique | 3,554,434 |
| it.sport.calcio.milan | 2,961,804 |
| it.politica | 2,802,687 |
| alt.politics.bush | 2,786,316 |
| talk.politics.misc | 2,784,668 |
| Other (159,336 groups) | 475,430,274 |

*Cross-posted messages are counted once per newsgroup, so totals exceed the 386M unique messages.*

## Data Processing

**Filtering:** Binary-focused groups (*.binaries.*, *.pictures.*, *.multimedia.*), binary posts with file-sharing indicators, messages exceeding 500KB, and unrecoverable encoding errors are excluded. Spam is almost **not filtered** - the dataset includes advertisements, phishing, and low-quality posts present in raw newsgroups.

**Encoding:** Messages are normalized to UTF-8 with the following decoding pipeline:
- Quoted-Printable: MIME-encoded content decoded to text
- Base64: Text base64 content decoded; binary base64 excluded
- Legacy encodings: Invalid UTF-8 sequences re-encoded using Windows-1252, ISO-8859-*, KOI8-R, Shift-JIS, GBK, and other legacy encoding detection
- MIME encoded-word headers decoded to UTF-8

**Deduplication:** Exact content duplicates removed via xxHash64 hashing (first occurrence retained).

**Privacy:** Email addresses in `author` and `content` fields redacted as `[email]`; `message_id` unchanged.

## Considerations

- Messages were posted to public newsgroups
- Content reflects unmoderated discussions and may contain controversial opinions},
keywords= {discussions, historical},
terms= {},
license= {},
superseded= {}
}


Send Feedback Start
   0.000008
DB Connect
   0.000478
Lookup hash in DB
   0.000383
Get torrent details
   0.000142
Get torrent details, finished
   0.000243
Get authors
   0.000039
Parse bibtex
   0.000258
Write header
   0.000259
get stars
   0.000113
home tab
   0.002260
render right panel
   0.000025
render ads
   0.000383
fetch current hosters
   0.000226
related datasets
   0.012062
Done