Public MediaWiki Collection
nyuuzyou

folder main (2 files)
filewikis.parquet 1.17GB
fileREADME.md 2.02kB
Type: Dataset
Tags:

Bibtex:
@article{,
title= {Public MediaWiki Collection},
journal= {},
author= {nyuuzyou},
year= {},
url= {https://huggingface.co/datasets/nyuuzyou/wikis},
abstract= {# Dataset Card for Public MediaWiki Collection

### Dataset Summary
This dataset contains 1,662,448 articles harvested from 930 random public MediaWiki instances found across the Internet. The collection was created by extracting current page content from these wikis, preserving article text, metadata, and structural information. The dataset represents a diverse cross-section of public wiki content spanning multiple domains, topics, and languages.

### Languages
The dataset is multilingual, covering 35+ languages found across the collected wiki instances.

## Dataset Structure

### Data Fields
This dataset includes the following fields:
- `id`: Unique identifier for the article (string)
- `title`: Title of the article (string)
- `text`: Main content of the article (string)
- `metadata`: Dictionary containing:
  - `templates`: List of templates used in the article
  - `categories`: List of categories the article belongs to
  - `wikilinks`: List of internal wiki links and their text
  - `external_links`: List of external links
  - `sections`: List of section titles and their levels

### Data Splits
All examples are in a single split.

## Additional Information

### License
This dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International (CC-BY-SA 4.0) license, consistent with the licensing of the source MediaWiki instances.

To learn more about CC-BY-SA 4.0, visit: https://creativecommons.org/licenses/by-sa/4.0/},
keywords= {},
terms= {},
license= {CC-BY-SA 4.0},
superseded= {}
}

Hosted by users

Send Feedback Start
   0.000005
DB Connect
   0.000464
Lookup hash in DB
   0.000392
Get torrent details
   0.000124
Get torrent details, finished
   0.000196
Get authors
   0.000028
Parse bibtex
   0.000109
Write header
   0.000222
get stars
   0.000102
home tab
   0.000217
render right panel
   0.000007
render ads
   0.000325
fetch current hosters
   0.000283
related datasets
   0.007324
Done