<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:academictorrents="https://academictorrents.com" version="2.0">
<channel>
<title>Academic Torrents</title>
<description>Recent Torrents</description>
<link>https://academictorrents.com/</link>
<item>
<title>Phishing &amp; Malware Website Snapshots </title>
<category>Dataset</category>
<infohash>8beb63ba7bb1ed7affb2fbe77ec18f4ab6a55d20</infohash>
<guid>https://academictorrents.com/details/8beb63ba7bb1ed7affb2fbe77ec18f4ab6a55d20</guid>
<link>https://academictorrents.com/details/8beb63ba7bb1ed7affb2fbe77ec18f4ab6a55d20</link>
<description># Phishing &amp; Malware Website Snapshots 136,414 phishing and malware website snapshots captured by a headless Chromium browser between July 24 and August 15, 2024. URLs were confirmed or high-confidence phishing/malware at the time of collection, though some hosts had already been blocked or taken down when the snapshot was taken. Each row contains the full HTML source, extracted visible text, complete network traffic from HAR recording, parsed page features, and resource fingerprints. Collected in mid-2024, published March 2026. All of the phishing domains and infrastructure captured here are long dead. The value is in page content, kit patterns, and network behavior rather than live IOCs. 55,339 unique domains, 23,588 of which appear more than once. ## Use Cases - Training phishing/malware classifiers (URL-level, page-level, or multimodal) - Static rule generation for phishing kit detection - Threat intelligence on phishing infrastructure and hosting - Feature engineering for browser-based detection extensions - Academic research on web-based social engineering ## Schema Single flat Parquet table, 42 columns per row. Nested columns use Parquet  list&lt;struct&gt;  types. ### Identification &amp; Metadata | Column | Type | Description | |&amp;mdash;-|&amp;mdash;-|&amp;mdash;-| |  id  | string | Unique archive ID ( hostname_rand5 ) | |  domain  | string | Target hostname | |  url  | string | Original URL visited | |  scan_timestamp  | string | ISO timestamp of capture | |  language  | string | Declared page language | ### Page Content | Column | Type | Description | |&amp;mdash;-|&amp;mdash;-|&amp;mdash;-| |  html  | large_string | Full page HTML source | |  html_length  | int64 | HTML byte length | |  extracted_text  | large_string | Visible text via trafilatura | |  text_length  | int64 | Extracted text byte length | |  title  | string | Page  &lt;title&gt;  content | |  meta_tags  | list&lt;struct&gt; | All  &lt;meta&gt;  tags (name + content) | |  html_comments  | list&lt;string&gt; | HTML comments (useful for kit signatures) | ### Scripts &amp; Styles | Column | Type | Description | |&amp;mdash;-|&amp;mdash;-|&amp;mdash;-| |  external_script_urls  | list&lt;string&gt; | External  &lt;script src&gt;  URLs | |  inline_scripts  | list&lt;large_string&gt; | Full inline JavaScript | |  inline_scripts_count  | int32 | Number of inline  &lt;script&gt;  blocks | |  inline_scripts_total_bytes  | int64 | Total inline JS size | |  external_css_urls  | list&lt;string&gt; | External stylesheet URLs | |  inline_style_count  | int32 | Number of inline  &lt;style&gt;  blocks | ### Forms &amp; Inputs | Column | Type | Description | |&amp;mdash;-|&amp;mdash;-|&amp;mdash;-| |  form_actions  | list&lt;string&gt; |  &lt;form action&gt;  URLs | |  form_count  | int32 | Number of  &lt;form&gt;  elements | |  input_fields  | list&lt;struct&gt; | Input fields with type, name, placeholder | |  input_count  | int32 | Number of  &lt;input&gt;  elements | |  has_password_field  | bool | Page contains a password input | |  has_file_upload  | bool | Page contains a file upload input | ### Linked Resources | Column | Type | Description | |&amp;mdash;-|&amp;mdash;-|&amp;mdash;-| |  favicon_urls  | list&lt;string&gt; | Favicon link hrefs | |  favicon_hashes  | list&lt;struct&gt; | MD5/SHA256 of favicon content | |  anchor_hrefs  | list&lt;string&gt; | All  &lt;a href&gt;  targets | |  image_srcs  | list&lt;string&gt; | All  &lt;img src&gt;  URLs | |  iframe_srcs  | list&lt;string&gt; | All  &lt;iframe src&gt;  URLs | |  external_domains  | list&lt;string&gt; | Unique external domains referenced | ### Network &amp; HTTP | Column | Type | Description | |&amp;mdash;-|&amp;mdash;-|&amp;mdash;-| |  final_url  | string | URL after redirects | |  redirect_chain  | list&lt;string&gt; | Full redirect path | |  server_header  | string |  Server  response header | |  x_powered_by  | string |  X-Powered-By  response header | |  content_security_policy  | string |  Content-Security-Policy  header | |  http_status  | int32 | Final HTTP status code | |  network_requests  | list&lt;struct&gt; | All HAR entries (see below) | |  network_request_count  | int32 | Total network requests | |  resource_hashes  | list&lt;struct&gt; | MD5/SHA256 of served resources (see below) | ### Availability Flags | Column | Type | Description | |&amp;mdash;-|&amp;mdash;-|&amp;mdash;-| |  has_html  | bool | Row has HTML content | |  has_har  | bool | Row has HAR data | |  has_text  | bool | Row has extracted text | ### Nested Struct:  network_requests  Each entry:  method ,  url ,  url_domain ,  status ,  mime_type ,  response_size ,  server_ip ,  is_redirect ,  redirect_url ,  response_headers  (list of key/value structs),  request_cookies ,  response_cookies  (with name, domain, httpOnly, secure). ### Nested Struct:  resource_hashes  Each entry:  url ,  mime_type ,  resource_type  (script / style / image / font / document / other),  body_size ,  body_md5 ,  body_sha256 ,  is_favicon . ### Nested Struct:  input_fields  Each entry:  type ,  name ,  placeholder ,  id ,  class_name ,  required ,  autocomplete . ### Nested Struct:  meta_tags  Each entry:  name ,  content . ## Statistics | Metric | Value | |&amp;mdash;-|&amp;mdash;-| | Rows | 136,414 | | Unique domains | 55,339 | | Multi-snapshot domains | 23,588 | | Rows with HTML | 135,407 (99.3%) | | Rows with HAR data | 126,747 (92.9%) | | Rows with extracted text | 134,114 (98.3%) | | Rows with forms | 52,089 (38.2%) | | Rows with password fields | 33,199 (24.3%) | | Total resource hashes | 2,349,290 | | Median network requests/row | 11 | | Mean network requests/row | 20.2 | | Median HTML size | 25 KB | | Mean HTML size | 124 KB | | Total inline JS | 6.1 GB (uncompressed) | | Shards | 28 | | Total size on disk | 2.0 GB (zstd level 19) | ### Language Distribution (top 10) | Language | Rows | |&amp;mdash;-|&amp;mdash;-| | en | 50,925 | | en-US | 9,037 | | ru | 7,926 | | fr | 3,618 | | en-us | 1,759 | | zh-CN | 1,483 | | ja | 1,378 | | (empty) | 1,290 | | de | 1,226 | | fr-FR | 777 | ## Collection Method A headless Chromium browser (Playwright) running inside Docker, controlled by a Flask API, visited each URL. Per visit: 1. Navigate to URL with HAR recording active 2. Save rendered HTML ( index.html ) 3. Capture all network requests and responses ( requests.har , HAR 1.2 format) 4. Extract visible text via trafilatura ( trafilatura.txt ) 5. Compress into a  .7z  archive During conversion to Parquet, HAR response bodies were hashed (MD5 + SHA256) rather than stored. This reduced the dataset from ~76 GB of raw archives to 2 GB of compressed Parquet. Full HTML and inline JavaScript are preserved. Archives containing blocked or error pages were excluded. Filtered titles: "403 Forbidden", "Not found", "Attention Required! | Cloudflare", "Suspected phishing site | Cloudflare". Empty and corrupted archives were also removed. 31,793 archives were filtered in total. ## Safety This dataset contains snapshots of malicious websites. The HTML and scripts include: - Credential harvesting forms that mimic legitimate services - Obfuscated JavaScript for redirects, fingerprinting, or exploit delivery - References to attacker-controlled infrastructure Do not execute JavaScript or render HTML from this dataset in a browser without sandboxing. This data is for defensive security research. ## License CC0-1.0</description>
<size>2097072302</size>
</item><item>
<title>PPT Online</title>
<category>Dataset</category>
<infohash>9a63af9f7305cbf9f060f1e4080ef5d703a3a4f5</infohash>
<guid>https://academictorrents.com/details/9a63af9f7305cbf9f060f1e4080ef5d703a3a4f5</guid>
<link>https://academictorrents.com/details/9a63af9f7305cbf9f060f1e4080ef5d703a3a4f5</link>
<description>### Dataset Summary This dataset contains metadata about 1,418,349 PowerPoint (.ppt) files hosted on the ppt-online.org platform. PPT Online is a service designed to display PowerPoint presentations. The dataset includes information such as presentation titles, categories, file sizes, and content snippets. The majority of the presentations are in Russian, Ukrainian, Belarusian, Kazakh, and English, but other languages are also present. ### Languages The dataset is multilingual, with the primary languages being Russian, Ukrainian, Belarusian, Kazakh, and English. However, presentations in other languages are also included. ## Dataset Structure ### Data Fields This dataset includes the following fields: -  id : Unique identifier for the presentation (integer) -  title : Title of the PowerPoint presentation (string) -  category : Category or topic of the presentation (string) -  file_size : Size of the PowerPoint file (string) -  body_content : Snippet or summary of the presentation content. Generated by a service, quite low quality (string) ### Data Splits All examples are in a single split.</description>
<size>3053250450</size>
</item><item>
<title>March 2026 Public Data File from Crossref</title>
<category>Dataset</category>
<infohash>b5ee0e102689b3e67023dd024694c0f5f124646f</infohash>
<guid>https://academictorrents.com/details/b5ee0e102689b3e67023dd024694c0f5f124646f</guid>
<link>https://academictorrents.com/details/b5ee0e102689b3e67023dd024694c0f5f124646f</link>
<description>Note that this Crossref metadata is always openly available. The difference here is that we’ve done the time-saving work of putting all of the records registered through March 2026 into one file for download. To keep this metadata current, you can access new records via our public API at: https://api.crossref.org And, if you do use our API, we encourage you to read the section of the documentation on "etiquette". That is, how to use the API without making it impossible for others to use.</description>
<size>223053729528</size>
</item><item>
<title>[Sample Dataset] March 2026 Public Data File from Crossref</title>
<category>Dataset</category>
<infohash>f692802461d1277d14551526ecb292a9637af254</infohash>
<guid>https://academictorrents.com/details/f692802461d1277d14551526ecb292a9637af254</guid>
<link>https://academictorrents.com/details/f692802461d1277d14551526ecb292a9637af254</link>
<description>[Sample Dataset] March 2026 Public Data File from Crossref. This dataset includes 100 random JSON records from the Crossref metadata corpus.</description>
<size>30841803</size>
</item><item>
<title>Reddit comments/submissions 2026-02</title>
<category>Dataset</category>
<infohash>c5ba00048236b60f819dbf010e9034d24fc291fb</infohash>
<guid>https://academictorrents.com/details/c5ba00048236b60f819dbf010e9034d24fc291fb</guid>
<link>https://academictorrents.com/details/c5ba00048236b60f819dbf010e9034d24fc291fb</link>
<description>Reddit comments and submisReddit comments and submissions from 2026-02 Documentation, json schemas and more can be found at https://github.com/ArthurHeitmann/arctic_shift Helper scripts for processing files can be found at https://github.com/Watchful1/PushshiftDumpssions from 2026-01</description>
<size>60377154786</size>
</item><item>
<title>IUGC Ultrasound Dataset (MICCAI 2025)</title>
<category>Dataset</category>
<infohash>71dee5920278325bb73eb735cade3b0f3550e9f9</infohash>
<guid>https://academictorrents.com/details/71dee5920278325bb73eb735cade3b0f3550e9f9</guid>
<link>https://academictorrents.com/details/71dee5920278325bb73eb735cade3b0f3550e9f9</link>
<description>In 2018, the World Health Organization (WHO) published 56 recommendations to improve the quality of intrapartum care and enhance women’s childbirth experiences. In response, the WHO developed the Labour Care Guide (LCG) in 2020, a next-generation tool designed to promote evidence-based, respectful, and woman-centered care during labor and delivery. The LCG was created through expert consultations, primary research with maternity healthcare providers, and usability studies across multiple countries. It serves as a practical tool for monitoring labor progress and maternal and fetal well-being by recording key clinical parameters. When deviations from normal labor progression are detected, the LCG highlights these issues, prompting timely interventions to ensure safe and effective care. Intrapartum ultrasound for labor progression analysis is a crucial examination in labor management. The core operation in this analysis is the identification of landmarks from intrapartum ultrasound images. These landmarks serve as the basis for subsequent qualitative evaluations of angles and distances, which offer valuable diagnostic information regarding labor arrest and influence decisions about the timing and type of intervention. However, obtaining reliable landmark annotations typically demands experienced physicians, and even for proficient obstetricians, manual landmark identification is a time-consuming and labor-intensive endeavor. Consequently, the development of fully automatic and precise landmark localization techniques has been an area of significant and persistent need. The Intrapartum Ultrasound Grand Challenge (IUGC) 2025 is a collaborative initiative involving the "Deep Learning in Intrapartum Ultrasound Image Analysis" cooperative group and prominent clinical societies such as the International Society of Ultrasound in Obstetrics &amp; Gynecology (ISUOG), the World Association of Perinatal Medicine (WAPM), the Perinatal Medicine Foundation (PMF), and the National Institute for Health and Care-Excellence (NICE). The objective of this partnership is to formulate and promote clinically relevant challenges, thereby maximizing the potential clinical impact of innovative algorithmic contributions from participating teams. Since its inception at MICCAI 2023, the IUGC has advanced the Pubic Symphysis - Fetal Head Segmentation (PSFHS) by facilitating and benchmarking algorithmic progress and providing high-quality annotated image datasets. In MICCAI 2024, the IUGC expanded to incorporate multiple benchmarks: (1) The analysis objects were extended from images to videos; (2) The tasks were augmented from image segmentation to classification, segmentation, and biometric parameter measurement; (3) The quantitative parameters were increased from one (i.e., Angle of Progression (AOP)) to two (i.e., AOP and head - symphysis distance (HSD)); and (4) The data sources were broadened from being solely from Asia to include Asia, Europe, and Africa. This novel and inventive design has established a benchmarking ecosystem for the systematic comparison of algorithms across diverse tasks and clinical challenges. The significance of the IUGC 2025 lies in its concentration on addressing the actual clinical assessment of labor progress, covering (1) end-to-end measurements (which are currently indirect measurements based on segmentation results); (2) all fetal descent stations during the childbirth process (comprising five “minus”, one “zero", and three “plus” stations for reliable longitudinal assessment of labor progress); (3) computational tasks (such as regression, detection); and (4) learning methods (semi-supervised, weakly-supervised, and barely-supervised learning). In line with the IUGC s goal of addressing clinical requirements, authoritative and leading clinical organizations have allied with the IUGC. We have extended the IUGC 2024 Challenge from an indirect ultrasound measurement based on segmentation results to an end-to-end measurement based on landmarks. Specifically, we provide 300 labeled cases and 31,421 unlabeled cases in the training set, 100 visible cases for validation, and 501 hidden cases for testing. The targets are the coordinates of three landmarks and the corresponding biometric parameter. In addition to the typical Mean Radial Error (MRE) and the absolute difference between predicted and manually measured parameters, our evaluation metrics also emphasize inference speed. In summary, the IUGC 2025 challenge exhibits three primary characteristics: (1) Task: Employing semi-supervised landmark detection. (2) Dataset: Curating a large-scale and diverse fetal ultrasound dataset that accounts for all fetal descent stations during the childbirth process. It comprises 28,919 ultrasound images from over 20 medical groups. (3) Evaluation measures: Focusing on detection accuracy. (4) Multiple raters independently annotate a subset of test cases to compare algorithmic performance against human expert inter-rater variability.</description>
<size>1204118461</size>
</item><item>
<title>Wikipedia European languages 2026-03-01</title>
<category>Dataset</category>
<infohash>357aed6775e72b4bac4688497590a262d87d2e2a</infohash>
<guid>https://academictorrents.com/details/357aed6775e72b4bac4688497590a262d87d2e2a</guid>
<link>https://academictorrents.com/details/357aed6775e72b4bac4688497590a262d87d2e2a</link>
<description>Wikipedia database dumps of European language wikis of 10k articles or more. enwiki excluded. Wikipedia Multistream 2026-03-01. These 68 languages are included: Albanian, Alemannic, Aragonese, Asturian, Basque, Bavarian, Belarusian, Benetian, Bosnian, Breton, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, Emilian-Romagnol, Esperanto, Estonian, Faroese, Finnish, French, Galician, German, Greek, Hungarian, Icelandic, Irish, Italian, Ladin, Latin, Latvian, Ligurian, Limburgish, Lithuanian, Lombard, Low German, Luxembourgish, Macedonian, Maltese, Neapolitan, North Frisian, Norwegian, Nynorsk, Occitan, Piedmontese, Polish, Portuguese, Romanian, Romansh, Rusyn, Samogitian, Scots, Scottish Gaelic, Serbian, Serbo-Croatian, Sicilian, Silesian, Slovak, Slovenian, Spanish, Swedish, Ukrainian, Upper Sorbian, Walloon, Welsh, West Frisian, Yiddish.</description>
<size>52792595523</size>
</item><item>
<title>Public MediaWiki Collection</title>
<category>Dataset</category>
<infohash>3ec30d9d8817f62d338ae76783d24ba207b6e9de</infohash>
<guid>https://academictorrents.com/details/3ec30d9d8817f62d338ae76783d24ba207b6e9de</guid>
<link>https://academictorrents.com/details/3ec30d9d8817f62d338ae76783d24ba207b6e9de</link>
<description># Dataset Card for Public MediaWiki Collection ### Dataset Summary This dataset contains 1,662,448 articles harvested from 930 random public MediaWiki instances found across the Internet. The collection was created by extracting current page content from these wikis, preserving article text, metadata, and structural information. The dataset represents a diverse cross-section of public wiki content spanning multiple domains, topics, and languages. ### Languages The dataset is multilingual, covering 35+ languages found across the collected wiki instances. ## Dataset Structure ### Data Fields This dataset includes the following fields: -  id : Unique identifier for the article (string) -  title : Title of the article (string) -  text : Main content of the article (string) -  metadata : Dictionary containing: -  templates : List of templates used in the article -  categories : List of categories the article belongs to -  wikilinks : List of internal wiki links and their text -  external_links : List of external links -  sections : List of section titles and their levels ### Data Splits All examples are in a single split. ## Additional Information ### License This dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International (CC-BY-SA 4.0) license, consistent with the licensing of the source MediaWiki instances. To learn more about CC-BY-SA 4.0, visit: https://creativecommons.org/licenses/by-sa/4.0/</description>
<size>1167900670</size>
</item><item>
<title>9111.ru Questions Dataset</title>
<category>Dataset</category>
<infohash>3fa77d9c4028fd6aa8a6dbdad67a218fc1ad7a5d</infohash>
<guid>https://academictorrents.com/details/3fa77d9c4028fd6aa8a6dbdad67a218fc1ad7a5d</guid>
<link>https://academictorrents.com/details/3fa77d9c4028fd6aa8a6dbdad67a218fc1ad7a5d</link>
<description># Dataset Card for 9111.ru Questions ### Dataset Summary This dataset includes legal questions and answers from the Russian law forum [9111.ru](https://9111.ru). It contains inquiries from users and corresponding responses from lawyers. The dataset was created by processing around 21 million questions, providing a significant corpus of legal discussions. ### Languages The dataset is mostly in Russian, but there may be other languages present. ## Dataset Structure ### Data Fields This dataset includes the following fields: -  id : Identifier for the item (integer) -  title : Title of the question (string) -  description : Description of the question (string) -  answers : An array of answer objects (array) -  user_name : Name of the user who answered (string) -  status : Status of the user (string) -  rating : Rating of the user (integer) -  text : Text of the answer (string) ### Data Format The dataset is stored as Apache Parquet files with zstd compression (level 19), split into 3 shards: -  questions-00000-of-00003.parquet  -  questions-00001-of-00003.parquet  -  questions-00002-of-00003.parquet  ### Data Splits All examples are in the train split, there is no validation split.</description>
<size>2938274461</size>
</item><item>
<title>Fandom.com Community Database Dumps Dataset</title>
<category>Dataset</category>
<infohash>0a0ad3dd44e05af1725fd8d17f5aeba856078d5f</infohash>
<guid>https://academictorrents.com/details/0a0ad3dd44e05af1725fd8d17f5aeba856078d5f</guid>
<link>https://academictorrents.com/details/0a0ad3dd44e05af1725fd8d17f5aeba856078d5f</link>
<description># Dataset Card for Fandom.com Community Database Dumps ### Dataset Summary This dataset contains 7,040,984 current pages from all available [Fandom.com community wiki dumps](https://community.fandom.com/wiki/Help:Database_download) as of February 18, 2025. The dataset was created by processing the "Current pages" database dumps from all available Fandom.com wikis. These dumps contain only the current versions of pages without edit history and includes article text, metadata, and structural information across multiple languages. ### Languages The dataset is multilingual, covering [40+ languages](https://community.fandom.com/wiki/Help:Language_codes). ## Dataset Structure ### Data Fields This dataset includes the following fields: -  id : Unique identifier for the article (string) -  title : Title of the article (string) -  text : Main content of the article (string) -  metadata : Dictionary containing: -  templates : List of templates used in the article -  categories : List of categories the article belongs to -  wikilinks : List of internal wiki links and their text -  external_links : List of external links -  sections : List of section titles and their levels ### Data Splits All examples are in a single split. ## Additional Information ### License This dataset inherits the licenses from the source Fandom communities, which use Creative Commons Attribution-ShareAlike 3.0 (CC-BY-SA 3.0). To learn more about CC-BY-SA 3.0, visit: https://creativecommons.org/licenses/by-sa/3.0/</description>
<size>6224651822</size>
</item><item>
<title>ca-on_province_of_ontario-2024A000235_drape_eastern_ontario_orthoimagery_2024_16cm_v0.1.0-beta.pmtiles</title>
<category>Dataset</category>
<infohash>adb5741cdcb9352848cc80c976629b44720a04c2</infohash>
<guid>https://academictorrents.com/details/adb5741cdcb9352848cc80c976629b44720a04c2</guid>
<link>https://academictorrents.com/details/adb5741cdcb9352848cc80c976629b44720a04c2</link>
<description>High‑resolution aerial imagery from Ontario’s DRAPE 2024, packaged for fast web maps and offline use. Smooth panning, crisp detail, open data. Want to preview the file? Go to https://source.coop/dataforcanada/d4c-datapkg-orthoimagery/processed/ca-on_province_of_ontario-2024A000235_drape_eastern_ontario_orthoimagery_2024_16cm_v0.1.0-beta.pmtiles We are aware that there is a nodata issue with the product and will fix it in the next release.</description>
<size>215750229448</size>
</item><item>
<title>Street-Level Imagery Dataset</title>
<category>Dataset</category>
<infohash>207ba45161f6ba12114cb6d97ad25d222d5125c9</infohash>
<guid>https://academictorrents.com/details/207ba45161f6ba12114cb6d97ad25d222d5125c9</guid>
<link>https://academictorrents.com/details/207ba45161f6ba12114cb6d97ad25d222d5125c9</link>
<description># Street-Level Imagery Dataset Metadata for street-level imagery across Eastern Europe and Northern Asia. Each record includes image URLs, coordinates, camera orientation, timestamps, and links to similar images captured at the same location over time. ## Summary | Statistic | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | Total Records | 934,191 | | Unique Images | 905,940 | | Time Span | 2016–2025 | | File Format | Parquet | ## Geographic Coverage | Boundary | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;-| | Minimum Longitude | 20.49° E | | Maximum Longitude | 152.32° E | | Minimum Latitude | 38.55° N | | Maximum Latitude | 69.05° N | Coverage spans urban centers and rural routes. Density is higher in populated areas. ## Camera Specifications **Directions** | Direction | Count | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | Front | 740,079 | | Right | 194,112 | **Resolutions** | Preview Size | Full Size | Count | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | 284×160 | 1920×1080 | 932,171 | | 90×160 | 1080×1920 | 1,886 | | 284×160 | 1536×864 | 77 | | 213×160 | 2016×1512 | 41 | ## Data Structure ### Fields | Field | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  id  | string | Unique image identifier | |  sourceId  | string | Source device or session identifier | |  heading  | float64 | Camera heading (0–360°) | |  cameraDirection  | string | Mount position ( front  or  right ) | |  timestamp  | string | ISO 8601 capture time | |  imagePreview  | struct | Thumbnail URL and dimensions | |  imageFull  | struct | Full resolution URL and dimensions | |  pos  | array | [longitude, latitude] | |  geometry  | struct | GeoJSON Point geometry | |  similar  | array | Related images at location | |  targetGeometry  | struct | Optional target reference | ### Image URL Schema Two resolution variants per entry:    json  "imagePreview":  "url": "https://...", "width": 284, "height": 160 , "imageFull":  "url": "https://...", "width": 1920, "height": 1080       ### Temporal Links Records reference similar images from other timestamps at the same coordinates. Average of 14.3 links per location. ## Limitations Image URLs may [rot](https://en.wikipedia.org/wiki/Link_rot). Coverage concentrates in urban areas, and historical density varies by location. ## License Research use permitted. Comply with source terms of service and local data regulations.</description>
<size>651707506</size>
</item><item>
<title>Subreddit comments/submissions 2005-06 to 2025-12</title>
<category>Dataset</category>
<infohash>3e3f64dee22dc304cdd2546254ca1f8e8ae542b4</infohash>
<guid>https://academictorrents.com/details/3e3f64dee22dc304cdd2546254ca1f8e8ae542b4</guid>
<link>https://academictorrents.com/details/3e3f64dee22dc304cdd2546254ca1f8e8ae542b4</link>
<description>This is the top 40,000 subreddits from reddit s history in separate files. You can use your torrent client to only download the subreddit s you re interested in. These are from the pushshift dumps from 2005-06 to 2025-12 which can be found here https://academictorrents.com/details/3d426c47c767d40f82c7ef0f47c3acacedd2bf44 These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here https://github.com/Watchful1/PushshiftDumps If you have questions, please reply to this reddit post or DM u/Watchful on reddit or respond to this post https://www.reddit.com/r/pushshift/comments/1r5z42j/comment/o5mjcvn/</description>
<size>3965777405390</size>
</item><item>
<title>Lung Ultrasound Dataset (LUS-Dataset-Katumba)</title>
<category>Dataset</category>
<infohash>e6e9a5594174aaffee53b8f086e3bf86c02c45ad</infohash>
<guid>https://academictorrents.com/details/e6e9a5594174aaffee53b8f086e3bf86c02c45ad</guid>
<link>https://academictorrents.com/details/e6e9a5594174aaffee53b8f086e3bf86c02c45ad</link>
<description>This dataset contains a curated benchmark collection of 1,062 labelled lung ultrasound (LUS) images collected from patients at Mulago National Referral Hospital and Kiruddu Referral Hospital in Kampala, Uganda. The images were acquired and annotated by senior radiologists to support the development and evaluation of artificial intelligence (AI) models for pulmonary disease diagnosis. Each image is categorized into one of three classes: Probably COVID-19 (COVID-19), Diseased Lung but Probably Not COVID-19 (Other Lung Disease), and Healthy Lung. The dataset addresses key challenges in LUS interpretation, including inter-operator variability, low signal-to-noise ratios, and reliance on expert sonographers. It is particularly suitable for training and testing convolutional neural network (CNN)-based models for medical image classification tasks in low-resource settings. The images are provided in standard formats such as PNG or JPEG, with corresponding labels stored in structured files like CSV or JSON to facilitate ease of use in machine learning workflows. In this second version of the dataset, we have extended the resource by including a folder containing the original unprocessed raw data, as well as the scripts used to process, clean, and sort the data into the final labelled set. These additions promote transparency and reproducibility, allowing researchers to understand the full data pipeline and adapt it for their own applications. This resource is intended to advance research in deep learning for lung ultrasound analysis and to contribute toward building more accessible and reliable diagnostic tools in global health.     Katumba, Andrew; Murindanyi, Sudi; Okila, Nixson; Nakatumba-Nabende, Joyce; Mwikirize, Cosmas; Serugunda, Jonathan; Bugeza, Samuel; Oriekot, Anthony; Bosa, Juliet; Nabawanuka, Eva (2025), “A Dataset of Lung Ultrasound Images for Automated AI-based Lung Disease Classification”, Mendeley Data, V2, doi: 10.17632/hb3p34ytvx.2</description>
<size>281447804</size>
</item><item>
<title>Wikipedia Asian languages 2026-02-01</title>
<category>Dataset</category>
<infohash>1bde58b51e4aad60f03ce1b688b691552fb3041e</infohash>
<guid>https://academictorrents.com/details/1bde58b51e4aad60f03ce1b688b691552fb3041e</guid>
<link>https://academictorrents.com/details/1bde58b51e4aad60f03ce1b688b691552fb3041e</link>
<description>Wikipedia database dumps of Asian language wikis of 10k articles or more. Wikipedia Multistream 2026-02-01. These 85 languages are included: Acehnese, Armenian, Assamese, Azerbaijani, Balinese, Bangla, Banjar, Banyumasan, Bashkir, Bishnupriya, Buginese, Burmese, Cantonese, Cebuno, Central Bikol, Central Kurdhish, Chechen, Chinese, Chuvash, Classical Chinese, Dimli, Eastern Mari, Georgian, Gilaki, Gorontalo, Gujarati, Hakka, Hebrew, Hindi, Iloko, Indonesian, Japanese, Javanese, Kannada, Kara-Kalpak, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Maithili, Malay, Malayalam, Manipuri, Marathi, Mazanderani, Minangkabau, Mindong, Mingrelian, Minnan, Mongolian, Nepali, Newari, Odia, Ossetic, Pampangan, Pashto, Persian, Punjabi, Russian, Sanskrit, Santali, Saraiki, Shan, Sindhi, Sinhala, South Azerbaijani, Sundanese, Tagalog, Tajik, Talysh, Tamil, Tatar, Telugu, Thai, Turkish, Urdu, Uzbek, Vietnamese, Waray, Western Armenian, Western Mari, Western Punjabi, Wu, Yakut.</description>
<size>31208780863</size>
</item><item>
<title>Reddit comments/submissions 2026-01</title>
<category>Dataset</category>
<infohash>8412b89151101d88c915334c45d9c223169a1a60</infohash>
<guid>https://academictorrents.com/details/8412b89151101d88c915334c45d9c223169a1a60</guid>
<link>https://academictorrents.com/details/8412b89151101d88c915334c45d9c223169a1a60</link>
<description>Reddit comments and submisReddit comments and submissions from 2026-01 Documentation, json schemas and more can be found at https://github.com/ArthurHeitmann/arctic_shift Helper scripts for processing files can be found at https://github.com/Watchful1/PushshiftDumpssions from 2026-01</description>
<size>61629104259</size>
</item><item>
<title>Begemot.ai Dataset</title>
<category>Dataset</category>
<infohash>3ada9903be4621cf7e34cd5cf44f191b4124ccfe</infohash>
<guid>https://academictorrents.com/details/3ada9903be4621cf7e34cd5cf44f191b4124ccfe</guid>
<link>https://academictorrents.com/details/3ada9903be4621cf7e34cd5cf44f191b4124ccfe</link>
<description># Dataset Card for Begemot.ai ### Dataset Summary This dataset has 2,728,999 educational project descriptions in Russian. They were generated using AI on the Begemot.ai website. The content includes project titles, descriptions, chapters and chapter content on various educational topics. ### Languages The dataset is primarily in Russian (ru). ## Dataset Structure ### Data Fields This dataset includes the following fields: -  id : Unique identifier for the project (integer) -  url : URL of the project page (string) -  title : Title of the educational project (string) -  type : Type of project (string) -  description : Detailed description of the project (string) -  chapters : List of chapter titles (list of strings) -  chapter_content : JSON string mapping chapter titles to their content ### Data Splits All examples are in a single split.</description>
<size>1708074039</size>
</item><item>
<title>OpenPOCUS - Lung Ultrasound Image Database</title>
<category>Dataset</category>
<infohash>63ad0470f43e022cc73407be9c760449d947cb97</infohash>
<guid>https://academictorrents.com/details/63ad0470f43e022cc73407be9c760449d947cb97</guid>
<link>https://academictorrents.com/details/63ad0470f43e022cc73407be9c760449d947cb97</link>
<description>https://i.imgur.com/s0eFv64.png Background Lung ultrasound (LUS) offers advantages over traditional imaging for diagnosing pulmonary conditions, with superior accuracy compared to chest X-ray and similar performance to CT at lower cost. Despite these benefits, widespread adoption is limited by operator dependency, moderate interrater reliability, and training requirements. Deep learning (DL) could potentially address these challenges, but development of effective algorithms is hindered by the scarcity of comprehensive image repositories with proper metadata.</description>
<size>5256546243</size>
</item><item>
<title>Russian Educational Text Collection</title>
<category>Dataset</category>
<infohash>1f6b373346a0fa34de6b4d916984d698e0a623b3</infohash>
<guid>https://academictorrents.com/details/1f6b373346a0fa34de6b4d916984d698e0a623b3</guid>
<link>https://academictorrents.com/details/1f6b373346a0fa34de6b4d916984d698e0a623b3</link>
<description># Dataset Card for Russian Educational Text Collection ### Dataset Summary This dataset contains approximately 1.38M educational texts primarily in Russian with some content in Ukrainian and English. The content is extracted from presentations and documents, including educational presentations, essays, and various academic documents covering diverse topics from natural sciences to literature. ### Languages - Russian (ru) - primary language - Ukrainian (uk) - secondary language - English (en) - secondary language With Russian being the predominant language in the dataset, while Ukrainian and English content appears less frequently. ## Dataset Structure ### Data Fields The dataset is split into two parquet files: - presentations (1,335,171 entries): -  title : Title of the presentation (string) -  slide_text : Array of slide contents (list of strings) - documents (47,474 entries): -  title : Title of the document (string) -  document_text : Full text content of the document (string) ## Additional Information ### License This dataset is dedicated to the public domain under the Creative Commons Zero (CC0) license. This means you can: * Use it for any purpose, including commercial projects * Modify it however you like * Distribute it without asking permission No attribution is required, but it s always appreciated!</description>
<size>304218686</size>
</item><item>
<title>Animations Dataset</title>
<category>Dataset</category>
<infohash>8799f1e66bf0a63a77e89a7917fbe281c13bcd9f</infohash>
<guid>https://academictorrents.com/details/8799f1e66bf0a63a77e89a7917fbe281c13bcd9f</guid>
<link>https://academictorrents.com/details/8799f1e66bf0a63a77e89a7917fbe281c13bcd9f</link>
<description># Dataset Card for Animations Dataset ### Dataset Summary This dataset contains 50,849 animations with their associated metadata and source images. Each animation consists of multiple frames composed of simple sketch-level drawings, text elements, and potentially embedded images. The dataset provides complete information about each animation, including frame components, source images, timing between frames, and canvas settings. This makes it suitable for various tasks such as animation analysis, generation, and modification. ### Languages The dataset is primarily monolingual: - English (en): Any text elements within animations are predominantly in English. ## Dataset Structure ### Data Files The dataset is stored as Parquet files with ZSTD compression: -  train-00000.parquet  through  train-00003.parquet  - Total: 4 shards, ~4.2 GB compressed ### Data Fields Each row in the Parquet files contains the following columns: | Column | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  id  |  string  | Unique identifier (UUID) for the animation | |  settings  |  string  | JSON object containing canvas configuration | |  dtimes  |  list[int64]  | Time delays between frames in milliseconds | |  frames_data  |  string  | JSON array describing each frame s elements | |  images  |  list[binary]  | PNG images used in the animation (decoded bytes) | #### Settings Object The  settings  JSON contains: -  canvas_width ,  canvas_height : Dimensions of the animation canvas -  fillcolor : Background color of the canvas (if specified) -  default_font : Default font used for text elements -  default_font_size : Default font size #### Frames Data Structure The  frames_data  JSON is an array of arrays, where each inner array represents a frame s elements: -  type_for_loader : Element type (e.g., "text", "image") -  data : Object containing element properties: -  type : Element type -  centerx ,  centery : Position coordinates on the canvas -  text : Text content (for text elements) -  font ,  size : Font properties -  rotate_angle ,  angle : Rotation properties -  strokeColor ,  fillColor ,  textColor : Color properties -  src : Index into the  images  array (for image elements) -  children_data : Array of child elements (if any) ### Data Splits | Split | Number of Examples | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| |  train  | 50,849 |</description>
<size>4374676620</size>
</item><item>
<title>Sprite Compositing &amp; Animation Dataset</title>
<category>Dataset</category>
<infohash>df2a3742526f44dac4dbb80299333e84132c5b45</infohash>
<guid>https://academictorrents.com/details/df2a3742526f44dac4dbb80299333e84132c5b45</guid>
<link>https://academictorrents.com/details/df2a3742526f44dac4dbb80299333e84132c5b45</link>
<description># Sprite Compositing &amp; Animation Dataset A diverse dataset of image sequences with their source sprite assets. Contains animations, slideshows, and composited scenes created from transparent PNG sprites with additional effects and overlays. ## Dataset Statistics | Metric | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;-| | Total animations | 50,849 | | Total frames | 1,191,969 | | Total source sprites | 312,926 | | Avg frames/animation | 23.4 | | Avg sprites/animation | 6.2 | | Frame resolution | 800 x 450-600 | | Sprite resolution | Variable | | Total size | ~27 GB | | Format | Parquet (ZSTD compressed) | ## Schema | Column | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  id  |  string  | Unique identifier (UUID) | |  source_images  |  list[binary]  | PNG bytes of source sprite assets (RGBA), sorted by index. Can be empty. | |  frames  |  list[binary]  | PNG bytes of final rendered frames (RGB 800x450-600), sorted by sequence order | |  num_sources  |  int64  | Number of source sprites (0 for rows without source assets) | |  num_frames  |  int64  | Number of frames in the sequence | ## Data Structure Each sample contains: 1. **Source Images** ( source_images ): Transparent PNG sprites/assets (RGBA mode) used to compose the final frames. Variable sizes. May include characters, objects, etc. 2. **Frames** ( frames ): Final rendered image sequence (RGB mode, typically 800x450-600). These are the result of compositing source sprites with additional effects like: - Text overlays - Drawings and sketches - Backgrounds - Animations and transitions - Visual effects ### Content Variations - **Animations**: Smooth frame-by-frame animations of sprites (e.g., character movement) - **Slideshows**: Discrete scene transitions using source assets - **Composited Scenes**: Source sprites combined with text, drawings, and effects - **Sketches**: Hand-drawn or illustrated frames with optional sprite references **Note**: Not all frames are strict animations - many are slideshows or scene compositions where source assets are combined with additional elements.</description>
<size>27130846899</size>
</item><item>
<title>NNTP Discussion Archives</title>
<category>Dataset</category>
<infohash>cac053d01e256ae3001bf40c5c98eefa86cdc870</infohash>
<guid>https://academictorrents.com/details/cac053d01e256ae3001bf40c5c98eefa86cdc870</guid>
<link>https://academictorrents.com/details/cac053d01e256ae3001bf40c5c98eefa86cdc870</link>
<description># NNTP Discussion Archives A large-scale collection of text discussions from public NNTP (Network News Transfer Protocol) newsgroups spanning over two decades. ## Dataset Statistics | Metric | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;-| | Total messages | 386,629,949 | | Unique newsgroups | 159,345 | | Date range | 2002 - 2026 | | Total size | ~191 GB (compressed) | | File format | Parquet (ZSTD) | | Number of files | 256 | | Average content length | ~1,400 characters | ## Schema | Column | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  message_id  |  string  | Original message identifier (unchanged) | |  newsgroups  |  string  | Target newsgroup(s), comma-separated if cross-posted | |  author  |  string  | Message author with email addresses redacted as  [email]  | |  subject  |  string  | Subject line | |  date  |  string  | RFC 2822 formatted date string | |  content  |  string  | Message body with email addresses redacted as  [email]  | ## Top Newsgroups by Volume | Newsgroup | Messages | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | alt.atheism | 5,658,023 | | free.usenet | 4,691,561 | | alt.fan.rush-limbaugh | 4,659,639 | | alt.politics | 3,919,772 | | fr.soc.politique | 3,554,434 | | it.sport.calcio.milan | 2,961,804 | | it.politica | 2,802,687 | | alt.politics.bush | 2,786,316 | | talk.politics.misc | 2,784,668 | | Other (159,336 groups) | 475,430,274 | *Cross-posted messages are counted once per newsgroup, so totals exceed the 386M unique messages.* ## Data Processing **Filtering:** Binary-focused groups (*.binaries.*, *.pictures.*, *.multimedia.*), binary posts with file-sharing indicators, messages exceeding 500KB, and unrecoverable encoding errors are excluded. Spam is almost **not filtered** - the dataset includes advertisements, phishing, and low-quality posts present in raw newsgroups. **Encoding:** Messages are normalized to UTF-8 with the following decoding pipeline: - Quoted-Printable: MIME-encoded content decoded to text - Base64: Text base64 content decoded; binary base64 excluded - Legacy encodings: Invalid UTF-8 sequences re-encoded using Windows-1252, ISO-8859-*, KOI8-R, Shift-JIS, GBK, and other legacy encoding detection - MIME encoded-word headers decoded to UTF-8 **Deduplication:** Exact content duplicates removed via xxHash64 hashing (first occurrence retained). **Privacy:** Email addresses in  author  and  content  fields redacted as  [email] ;  message_id  unchanged. ## Considerations - Messages were posted to public newsgroups - Content reflects unmoderated discussions and may contain controversial opinions</description>
<size>204065504201</size>
</item><item>
<title>Reddit comments/submissions 2005-06 to 2025-12</title>
<category>Dataset</category>
<infohash>3d426c47c767d40f82c7ef0f47c3acacedd2bf44</infohash>
<guid>https://academictorrents.com/details/3d426c47c767d40f82c7ef0f47c3acacedd2bf44</guid>
<link>https://academictorrents.com/details/3d426c47c767d40f82c7ef0f47c3acacedd2bf44</link>
<description>Reddit comments and submissions from 2005-06 to 2025-12 collected by pushshift and u/RaiderBDev. These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here https://github.com/Watchful1/PushshiftDumps The more recent dumps are collected by u/RaiderBDev</description>
<size>3804096351995</size>
</item><item>
<title>RU-OK? Uptime measurements of Russian/Belarusian DDoS targets of IT ARMY</title>
<category>Dataset</category>
<infohash>87b8ba53e3f7d58ac3845f1be81f49682c2c68f2</infohash>
<guid>https://academictorrents.com/details/87b8ba53e3f7d58ac3845f1be81f49682c2c68f2</guid>
<link>https://academictorrents.com/details/87b8ba53e3f7d58ac3845f1be81f49682c2c68f2</link>
<description>In 2022, Russia began a full-scale invasion of Ukraine in the escalating Russo-Ukrainian war. Ukrainian ingenuity quickly led to the creation of a volunteer cyberwarfare organization, [IT Army of Ukraine](https://en.wikipedia.org/wiki/IT_Army_of_Ukraine), which conducted both defensive and offensive operations. Notably, they invited anyone with an internet connection to DDoS an ever-growing list of Russian and Belarusian websites, with the goal of disrupting infrastructure and draining Russia’s own cyberwarfare capabilities. I made a very quick project to assess the status of Russian and Belarusian internet properties (via [RIPE Atlas](https://atlas.ripe.net/)) being targeted by hacktivists. Specifically, I evaluated almost every target listed by the IT ARMY Telegram group with many unique probes between 2022-02-27 (the day after IT ARMY was created) and 2022-05-30 to check for service availability. I wanted to check connectivity from within Russia’s borders because I saw many mixed reports across Twitter and Reddit, with international parties (Americans, Ukrainians, etc.) claiming many sites had been knocked offline, where Russians chimed in that many sites remained online for them. The truth is more complex - some sites were significantly disrupted and took time to recover glovally, while others had existing mitigations in place, others seemed to deprioritize or sinkhole international traffic, etc. This research was included in several news articles around the world: * Ukraine’s IT army is doing well, hitting Russia with ‘cost and chaos’ - [VentureBeat](https://venturebeat.com/2022/03/04/ukraines-it-army-is-doing-well-hitting-russia-with-cost-and-chaos/) * Ukraine deserves an IT army. We have to live with the fallout - [VentureBeat](https://venturebeat.com/2022/03/04/ukraine-deserves-an-it-army-we-have-to-live-with-the-fallout/) * Ukraine: We’ve repelled ‘nonstop’ DDoS attacks from Russia - [VentureBeat](https://venturebeat.com/2022/03/07/ukraine-weve-repelled-nonstop-ddos-attacks-from-russia/) * Guerre en Ukraine : les cyberattaques contre la Russie, le « cri de colère » d’une armée de volontaires - [Le Monde](https://www.lemonde.fr/pixels/article/2022/03/25/guerre-en-ukraine-face-a-la-russie-les-cyberattaques-en-forme-de-cri-de-colere-d-une-armee-de-volontaires_6119064_4408996.html) * Ukraine Demanded Cloudflare Stop Protecting Russians From Cyberattacks. Cloudflare Said No - [Forbes](https://www.forbes.com/sites/thomasbrewster/2022/03/07/cloudflare-rejects-ukraines-call-to-stop-protecting-russians-from-cyberattacks/) The data and methodology for RU-OK was originally published on my GitHub, where I hope it will remain. However, I’ve received the occasional nastygram about this research and recently received a takedown request against from a Russian cybersecurity firm, claiming that sensitive information is being stored in my repository. There isn’t, of course, and all the data is public measurements against public endpoints. Still I’m concerned that fraudulent reports could result in my repo getting deleted, so I’m creating a censorship-resistant copy and distributing it on my blog and on Academic Torrents. It’s long overdue anyway. I encourage anyone curious to take a dig through the data, as you can watch both the immediate impact of DDoS attacks as well as Russian government and company resilience change over several months as this attack became commonplace.</description>
<size>1609147173</size>
</item><item>
<title>Wikipedia Wikidata 2026-01-01</title>
<category>Dataset</category>
<infohash>91f29e60cc4a65747a346109ef49a48808c6a2cd</infohash>
<guid>https://academictorrents.com/details/91f29e60cc4a65747a346109ef49a48808c6a2cd</guid>
<link>https://academictorrents.com/details/91f29e60cc4a65747a346109ef49a48808c6a2cd</link>
<description>Database dump of the Wikidata wiki. Wikipedia Multistream 2026-01-01.</description>
<size>175980564292</size>
</item><item>
<title>Wikipedia Commons 2026-01-01</title>
<category>Dataset</category>
<infohash>83f2cfd35db16f696000bd3dee56e3837fe3e60c</infohash>
<guid>https://academictorrents.com/details/83f2cfd35db16f696000bd3dee56e3837fe3e60c</guid>
<link>https://academictorrents.com/details/83f2cfd35db16f696000bd3dee56e3837fe3e60c</link>
<description>Database dump of Wikipedia Commons. Wikipedia Multistream 2026-01-01.</description>
<size>106644000008</size>
</item><item>
<title>Reddit comments/submissions 2025-12</title>
<category>Dataset</category>
<infohash>481bf2eac43172ae724fd6c75dbcb8e27de77734</infohash>
<guid>https://academictorrents.com/details/481bf2eac43172ae724fd6c75dbcb8e27de77734</guid>
<link>https://academictorrents.com/details/481bf2eac43172ae724fd6c75dbcb8e27de77734</link>
<description>Reddit comments and submissions from 2025-12 Documentation, json schemas and more can be found at https://github.com/ArthurHeitmann/arctic_shift Helper scripts for processing files can be found at https://github.com/Watchful1/PushshiftDumps</description>
<size>56418645823</size>
</item><item>
<title>GitGud Code Dataset</title>
<category>Dataset</category>
<infohash>221571632238b826f0aa6ec4f370af633575cae4</infohash>
<guid>https://academictorrents.com/details/221571632238b826f0aa6ec4f370af633575cae4</guid>
<link>https://academictorrents.com/details/221571632238b826f0aa6ec4f370af633575cae4</link>
<description># GitGud Code Dataset ## Dataset Description This dataset was compiled from code repositories hosted on [GitGud.io](https://gitgud.io), a GitLab-based code hosting platform. GitGud.io serves as an alternative git hosting service used by various developer communities and open-source projects. ### Dataset Summary | Statistic | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | **Total Files** | 16,322,315 | | **Total Repositories** | 7,204 | | **Total Size** | 17.46 GB (compressed Parquet) | | **Programming Languages** | 2,185 | | **File Format** | Parquet with Zstd compression (17 files) | ### Key Features - **Diverse code corpus**: Contains code from over 7,000 repositories across various domains - **Wide language coverage**: Spans 2,185 programming languages and file types detected by file extension mapping - **Rich metadata**: Includes repository name, file path, detected language, license information, and file size - **Quality filtered**: Filtering applied to remove binary files, overly long lines, and license files ### Languages The dataset includes 2,185 programming languages and file types. The top 30 languages by file count: | Rank | Language | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | 1 | tw (Twine) | 3,301,366 | | 2 | XML | 3,281,566 | | 3 | svg | 1,744,500 | | 4 | C# | 1,367,799 | | 5 | JavaScript | 1,252,710 | | 6 | C++ | 731,619 | | 7 | erb | 710,279 | | 8 | JSON | 398,139 | | 9 | Text | 377,948 | | 10 | twee | 300,576 | | 11 | csv | 205,230 | | 12 | HTML | 170,711 | | 13 | Markdown | 160,735 | | 14 | TypeScript | 147,173 | | 15 | Lua | 117,079 | | 16 | PHP | 116,059 | | 17 | none | 111,791 | | 18 | pal | 110,626 | | 19 | CSS | 108,664 | | 20 | Python | 106,261 | | 21 | dm | 98,333 | | 22 | Ruby | 93,685 | | 23 | _comment | 91,730 | | 24 | Java | 81,190 | | 25 | YAML | 63,289 | | 26 | ActionScript | 62,210 | | 27 | Git | 43,748 | | 28 | mdwn | 42,654 | | 29 | mk | 41,789 | | 30 | INI | 39,760 | ### Licenses The dataset includes files from repositories with various licenses: | License | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | mit | 9,517,343 | | bsd-3-clause | 3,315,732 | | unknown | 2,935,736 | | mpl-2.0 | 338,040 | | gpl-2.0 | 79,415 | | lgpl-2.1 | 38,429 | | gpl-3.0 | 25,964 | | apache-2.0 | 20,562 | | cc-by-4.0 | 18,703 | | agpl-3.0 | 15,367 | | cc-by-nc-4.0 | 6,362 | | wtfpl | 6,163 | | bsd-2-clause | 3,749 | | zlib | 482 | | unlicense | 261 | | cc-by-sa-4.0 | 7 | ## Dataset Structure ### Data Fields | Field | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  code  | string | Content of the source file (UTF-8 encoded) | |  repo_name  | string | Name of the GitGud repository (format:  username/repo ) | |  path  | string | Path of the file within the repository (relative to repo root) | |  language  | string | Programming language detected by file extension mapping | |  license  | string | License of the repository (SPDX identifier or "unknown") | |  size  | int64 | Size of the source file in bytes | ### Data Format - **Format**: Apache Parquet with Zstd compression (level 19) - **File Structure**: 17 files ( gitgud-00000.parquet  to  gitgud-00016.parquet ) - **Rows per shard**: ~1,000,000 (except last shard: 322,315) ### Data Splits All examples are in the train split. There is no validation or test split. ### Example Data Point       code :  using System;nusing System.Collections.Generic;n... ,  repo_name :  username/game-mod ,  path :  src/GameMod/Player.cs ,  language :  C# ,  license :  mit ,  size : 2048      ## Dataset Creation ### Pipeline Overview The dataset was created through a multi-stage pipeline: 1. **Repository Discovery**: Scraping public repository URLs from GitGud.io s GitLab API v4 endpoint using multiple sort orderings ( id ,  name ,  path ,  updated_at ,  star_count ,  last_activity_at ,  similarity ) 2. **Branch Enumeration**: Fetching all branches for each repository via the GitLab API 3. **Archive Download**: Downloading  .tar.gz  archives for each repository/branch combination 4. **Content Extraction**: Extracting and filtering source code files from archives 5. **Parquet Generation**: Writing filtered records to Parquet shards with Zstd compression ### Language Detection Programming languages are detected using file extension mapping. The pipeline maps ~80 programming languages by their file extensions, including: - **Major languages**: Python, JavaScript, TypeScript, C, C++, C#, Java, Go, Rust, Ruby, PHP - **Configuration**: JSON, YAML, TOML, XML, INI - **Markup**: HTML, CSS, Markdown, LaTeX - **Game development**: GLSL, HLSL, GDScript - **And many more** Files with unrecognized extensions are labeled with the extension itself (without the dot prefix). Files without extensions are labeled as "none" or by special filename matching (e.g., "Dockerfile", "Makefile"). ### License Detection Licenses are detected by: 1. Scanning for license files ( LICENSE ,  LICENSE.txt ,  LICENSE.md ,  COPYING ,  COPYING.txt ,  COPYING.md ) 2. Matching license text against known patterns (MIT, Apache 2.0, GPL variants, BSD, Creative Commons, MPL, ISC, Unlicense, Artistic, WTFPL, Zlib, etc.) 3. Defaulting to "unknown" if no license can be detected ### File Filtering Filtering is applied to ensure data quality: #### Size Limits | Limit | Value | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | Max repository archive size | 64 MB | | Max line length | 1,000 characters | #### Content Filtering - **Binary Detection**: Files with null bytes in the first 1KB are excluded - **UTF-8 Validation**: Files must be decodable as UTF-8 (with fallback to latin-1, cp1252, iso-8859-1) - **Long Lines**: Files with any line exceeding 1,000 characters are excluded - **License Files**: License files (LICENSE, COPYING, etc.) are excluded from the dataset (but used for license detection) ### Source Data All data originates from public repositories hosted on [GitGud.io](https://gitgud.io). ## Considerations for Using the Data ### Personal and Sensitive Information The dataset may contain: - Email addresses in code comments or configuration files - API keys or credentials that were accidentally committed - Personal information in comments or documentation Users should exercise caution and implement appropriate filtering when using this data. ### Licensing Information This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. The license field in each data point indicates the license of the source repository.</description>
<size>18751337953</size>
</item><item>
<title>Mos.Hub Code Dataset</title>
<category>Dataset</category>
<infohash>991f0d7eaa11bfda7f08e9bd82466458982cd430</infohash>
<guid>https://academictorrents.com/details/991f0d7eaa11bfda7f08e9bd82466458982cd430</guid>
<link>https://academictorrents.com/details/991f0d7eaa11bfda7f08e9bd82466458982cd430</link>
<description># Mos.Hub Code Dataset ## Dataset Description This dataset was compiled from code repositories hosted on [Mos.Hub](https://hub.mos.ru) (hub.mos.ru), a code hosting platform operated by the Moscow Government. Mos.Hub is a service for storing and working with source code, based on the Git version control system, primarily used by Russian developers and government-related projects. ### Dataset Summary | Statistic | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | **Total Files** | 15,740,580 | | **Total Repositories** | 16,130 | | **Total Size** | 529 MB (compressed Parquet) | | **Uncompressed Size** | ~29 GB | | **Programming Languages** | 297 | | **File Format** | Parquet (single file) | ### Key Features - **Russian code corpus**: Contains code from repositories hosted on Moscow s official code platform, featuring Russian comments and documentation - **Diverse language coverage**: Spans 297 programming languages identified by [github-linguist](https://github.com/github-linguist/linguist) - **Quality filtered**: Binary files and low-quality content have been removed ### Languages The dataset includes 297 programming languages. The top 30 languages by file count: | Rank | Language | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | 1 | Ruby | 8,333,731 | | 2 | JavaScript | 1,786,730 | | 3 | YAML | 1,757,614 | | 4 | Vue | 699,171 | | 5 | Markdown | 639,585 | | 6 | Haml | 538,837 | | 7 | GraphQL | 269,485 | | 8 | JSON | 214,354 | | 9 | PHP | 191,150 | | 10 | SVG | 172,884 | | 11 | Shell | 172,451 | | 12 | Go | 88,089 | | 13 | Ignore List | 87,432 | | 14 | SCSS | 80,716 | | 15 | Python | 77,532 | | 16 | C++ | 63,177 | | 17 | HTML+ERB | 62,605 | | 18 | Text | 48,400 | | 19 | Jest Snapshot | 43,638 | | 20 | HTML | 42,489 | | 21 | C | 38,354 | | 22 | reStructuredText | 26,342 | | 23 | Rust | 24,818 | | 24 | E-mail | 23,993 | | 25 | XML | 22,715 | | 26 | Java | 14,807 | | 27 | Gettext Catalog | 14,429 | | 28 | C# | 13,405 | | 29 | CSS | 12,657 | | 30 | Protocol Buffer Text Format | 12,181 | ## Dataset Structure ### Data Fields | Field | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  file_text  | string | Content of the source file (UTF-8 encoded) | |  language  | string | Programming language as identified by [github-linguist](https://github.com/github-linguist/linguist) | |  file_name  | string | Name of the source file | ### Data Format - **Format**: Apache Parquet - **File Structure**: Single file ( data.parquet ) ### Data Splits All examples are in the train split. There is no validation or test split. ### Example Data Point       file_text :  package mainnnimport "fmt"nnfunc main() n    fmt.Println("Hello")nn ,  language :  Go ,  file_name :  main.go       ## Dataset Creation ### Source Data All data originates from public repositories hosted on [Mos.Hub](https://hub.mos.ru). ### Language Detection Programming languages are detected using [github-linguist](https://github.com/github-linguist/linguist), GitHub s library for detecting programming languages. ### Filtering - **Deduplication**: The dataset has been deduplicated to ensure unique code files - **Binary Files**: Binary files have been removed from the dataset - **UTF-8 Validation**: Files must be valid UTF-8 encoded text ## Considerations for Using the Data ### Personal and Sensitive Information The dataset may contain: - Email addresses in code comments or configuration files - API keys or credentials that were accidentally committed - Personal information in comments or documentation Users should exercise caution and implement appropriate filtering when using this data. ### Licensing Information This dataset has been compiled with an analysis of the licenses used in the repositories to ensure ethical collection and use of the data. Users of this dataset should respect the rights of the authors and use the data responsibly.</description>
<size>554021034</size>
</item><item>
<title>Google Code Archive Dataset</title>
<category>Dataset</category>
<infohash>a342da363792ac5fa018039d5a57c81be74e4b52</infohash>
<guid>https://academictorrents.com/details/a342da363792ac5fa018039d5a57c81be74e4b52</guid>
<link>https://academictorrents.com/details/a342da363792ac5fa018039d5a57c81be74e4b52</link>
<description>## Dataset Description This dataset was compiled from the [Google Code Archive](https://code.google.com/archive/), a preserved snapshot of projects hosted on Google Code, Google s open-source project hosting service that operated from 2006 to 2016. Google Code was one of the major code hosting platforms of its era, hosting hundreds of thousands of open-source projects before its shutdown. The archive provides a unique historical record of open-source development during a formative period of modern software engineering. ### Dataset Summary | Statistic | Value | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;-| | **Total Files** | 65,825,565 | | **Total Repositories** | 488,618 | | **Total Size** | 47 GB (compressed Parquet) | | **Programming Languages** | 454 | | **File Format** | Parquet with Zstd compression (71 files) | ### Key Features - **Historical open-source corpus**: Contains code from over 488K repositories hosted on Google Code during 2006-2016 - **Diverse language coverage**: Spans 454 programming languages identified by [go-enry](https://github.com/go-enry/go-enry) (based on GitHub Linguist rules) - **Rich metadata**: Includes repository name, file path, detected language, license information, and file size - **Quality filtered**: Extensive filtering to remove vendor code, build artifacts, generated files, and low-quality content - **Era-specific patterns**: Captures coding conventions and library usage from the pre-modern era of software development ### Languages The dataset includes 454 programming languages. The top 30 languages by file count: | Rank | Language | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | 1 | Java | 16,331,993 | | 2 | PHP | 12,764,574 | | 3 | HTML | 5,705,184 | | 4 | C++ | 5,090,685 | | 5 | JavaScript | 4,937,765 | | 6 | C | 4,179,202 | | 7 | C# | 3,872,245 | | 8 | Python | 2,207,240 | | 9 | CSS | 1,697,385 | | 10 | Objective-C | 1,186,050 | | 11 | Shell | 639,183 | | 12 | Java Server Pages | 541,498 | | 13 | ActionScript | 540,557 | | 14 | Makefile | 481,563 | | 15 | ASP.NET | 381,389 | | 16 | Smarty | 339,555 | | 17 | Ruby | 331,743 | | 18 | Go | 316,427 | | 19 | Perl | 307,960 | | 20 | Vim Script | 216,236 | | 21 | Lua | 215,226 | | 22 | HTML+PHP | 150,781 | | 23 | HTML+Razor | 149,131 | | 24 | MATLAB | 145,686 | | 25 | Batchfile | 138,523 | | 26 | Pascal | 135,992 | | 27 | Visual Basic .NET | 118,732 | | 28 | TeX | 110,379 | | 29 | Less | 98,221 | | 30 | Unix Assembly | 94,758 | ### Licenses The dataset includes files from repositories with various licenses as specified in the Google Code Archive: | License | File Count | |&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;| | Apache License 2.0 (asf20) | 21,568,143 | | GNU GPL v3 (gpl3) | 14,843,470 | | GNU GPL v2 (gpl2) | 6,824,185 | | Other Open Source (oos) | 5,433,436 | | MIT License (mit) | 4,754,567 | | GNU LGPL (lgpl) | 4,073,137 | | BSD License (bsd) | 3,787,348 | | Artistic License (art) | 1,910,047 | | Eclipse Public License (epl) | 1,587,289 | | Mozilla Public License 1.1 (mpl11) | 580,102 | | Multiple Licenses (multiple) | 372,457 | | Google Summer of Code (gsoc) | 63,292 | | Public Domain (publicdomain) | 28,092 | ## Dataset Structure ### Data Fields | Field | Type | Description | |&amp;mdash;&amp;mdash;&amp;mdash;-|&amp;mdash;&amp;mdash;&amp;mdash;|&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;-| |  code  | string | Content of the source file (UTF-8 encoded) | |  repo_name  | string | Name of the Google Code project | |  path  | string | Path of the file within the repository (relative to repo root) | |  language  | string | Programming language as identified by [go-enry](https://github.com/go-enry/go-enry) | |  license  | string | License of the repository (Google Code license identifier) | |  size  | int64 | Size of the source file in bytes | ### Data Format - **Format**: Apache Parquet with Zstd compression - **File Structure**: 71 files ( google_code_0000.parquet  to  google_code_0070.parquet ) ### Data Splits All examples are in the train split. There is no validation or test split. ### Example Data Point</description>
<size>50126651493</size>
</item></channel>
</rss>
