catalog.archives.gov-lgbt

folder catalog.archives.gov-lgbt (11 files)
filesteps.txt 0.80kB
filetifs.tar.zst 984.02GB
filesearch-results.tar.zst 4.38MB
filesearch-result-urls.txt.zst 1.02kB
fileREADME 1.32kB
filepdfs.tar.zst 114.94GB
filekeywords.txt 0.58kB
fileothers.tar.zst 46.67MB
filegenerate-urls.py 1.47kB
filejpgs.tar.zst 328.99GB
filedownload-urls.txt.zst 418.40kB
Type: Dataset

Bibtex:
@article{,
title= {catalog.archives.gov-lgbt},
journal= {},
author= {},
year= {},
url= {},
abstract= {A partial mirror of catalog.archives.gov, filtered for the LGBTQ-related
keywords found in keywords.txt. The list of keywords was obtained from
catalog-links that were in turn manually collected from
https://www.archives.gov/research/lgbt

For each keyword in keywords.txt, we fire off a search and attempt to download
all metadata (folder "search-results") and attachments (folders "tifs",
"jpgs", "pdfs", "others").

Folders are packed as ZStandard-compressed tarballs to save space and to
reduce overhead in torrent metadata. All data unpacked is approximately 3 TB,
tifs being 2.6 TB of that.

Overview:

search-results.tar.zst contains all JSON metadata that would be available on
search results pages. This includes descriptions, authorship, year of each
record found, and a list of download URLs for PDFs etc. It's best to download
those files first to determine whether this dataset contains something
specific you need.

pdfs.tar.zst, tifs.tar.zst, jpgs.tar.zst, others.tar.zst contain the actual
downloads, segmented by file-type for compression purposes. download-urls.txt.zst contains the list of AWS S3 urls that were downloaded into those folders.

generate-urls.py was used to scrape the catalog for metadata. The detailed procedure for scraping is outlined in steps.txt

Data captured around 2025-02-23.},
keywords= {united states,usa,archives.gov,nara},
terms= {},
license= {},
superseded= {}
}



Send Feedback Start
   0.000010
DB Connect
   0.001056
Lookup hash in DB
   0.001082
Get torrent details
   0.000497
Get torrent details, finished
   0.000731
Get authors
   0.000001
Select authors
   0.000434
Parse bibtex
   0.000219
Write header
   0.000799
get stars
   0.000327
home tab
   0.000997
render right panel
   0.000012
render ads
   0.001109
fetch current hosters
   0.001156
Done