Public MediaWiki Collection
nyuuzyou

folder main (2 files)
filewikis.parquet 1.17GB
fileREADME.md 2.02kB
Type: Dataset
Tags:

Bibtex:
@article{,
title= {Public MediaWiki Collection},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/wikis},
abstract= {# Dataset Card for Public MediaWiki Collection

### Dataset Summary
This dataset contains 1,662,448 articles harvested from 930 random public MediaWiki instances found across the Internet. The collection was created by extracting current page content from these wikis, preserving article text, metadata, and structural information. The dataset represents a diverse cross-section of public wiki content spanning multiple domains, topics, and languages.

### Languages
The dataset is multilingual, covering 35+ languages found across the collected wiki instances.

## Dataset Structure

### Data Fields
This dataset includes the following fields:
- `id`: Unique identifier for the article (string)
- `title`: Title of the article (string)
- `text`: Main content of the article (string)
- `metadata`: Dictionary containing:
  - `templates`: List of templates used in the article
  - `categories`: List of categories the article belongs to
  - `wikilinks`: List of internal wiki links and their text
  - `external_links`: List of external links
  - `sections`: List of section titles and their levels

### Data Splits
All examples are in a single split.

## Additional Information

### License
This dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International (CC-BY-SA 4.0) license, consistent with the licensing of the source MediaWiki instances.

To learn more about CC-BY-SA 4.0, visit: https://creativecommons.org/licenses/by-sa/4.0/},
keywords= {},
terms= {},
license= {CC-BY-SA 4.0},
superseded= {}
}

Hosted by users
10 day statistics (1 downloads)
Average Time 10 mins, 06 secs
Average Speed 1.93MB/s
Best Time 10 mins, 06 secs
Best Speed 1.93MB/s
Worst Time 10 mins, 06 secs
Worst Speed 1.93MB/s

Send Feedback Start
   0.000006
DB Connect
   0.000470
Lookup hash in DB
   0.000407
Get torrent details
   0.000129
Get torrent details, finished
   0.000216
Get authors
   0.000026
Parse bibtex
   0.000113
Write header
   0.000171
get stars
   0.000109
home tab
   0.000252
render right panel
   0.000009
render ads
   0.000334
fetch current hosters
   0.000278
Start get stats
   0.000366
End get stats
   0.000002
related datasets
   0.006761
Done