Introduction
What the Internet Archive is, and the handful of ideas archive is built around.
The Internet Archive is a non-profit digital library: millions of books, movies, audio recordings, software, and web pages, all freely accessible. archive is a command-line client for it. This page is the mental model; the quick start is the hands-on version.
Items, identifiers, and files
Everything on archive.org is an item, and every item has a unique
identifier: a short slug like nasa or goody. An item is really a
directory of files plus a metadata record describing it.
archive item nasa # a friendly summary of the item
archive files nasa # the files inside it
archive metadata nasa # the raw metadata document
The identifier is the one thing you need to address an item. Most commands take it as their first argument.
Mediatypes and collections
Each item has a mediatype (texts, movies, audio, image, software,
web, data, or collection) and belongs to one or more collections.
Collections are themselves items with mediatype:collection, which is why you
can search inside one with a Lucene query:
archive search 'collection:nasa AND mediatype:image' -n 10
Search is Lucene
archive search speaks the same query language as the website's Advanced Search:
field-scoped Lucene over a Solr index. mediatype:texts, subject:mathematics,
date:[2010-01-01 TO 2012-12-31], and free text all work. Results come back as
a stream of documents you can sort, project, and page through.
The Wayback Machine is separate
Web captures live in the Wayback Machine, a different service with its own
APIs: an availability lookup, the CDX capture-history server, and Save Page Now.
archive folds them into the wayback command group. A Wayback URL is addressed
by the original URL plus a timestamp, not by an item identifier.
archive wayback available example.com
archive wayback get example.com -t 2010 --text
Anonymous by default
Reading public data needs no account. Credentials (an IAS3 access/secret pair from your archive.org account) are only required to upload, delete, or read your task queue. See configuration for how to store them.
Output is yours to shape
Every command renders through one output layer, so the same data is a table for
reading, JSON or JSONL for piping, CSV/TSV for a spreadsheet, or a bare list of
URLs or identifiers for xargs. Pick with -o; project columns with
--fields. See output formats.