What is Web Archiving

Keeping copies of a web page / site.

https://archive.org/web/ aka The Wayback Machine is the most well know web archiver.

When is web archiving used?

https://specs.webrecorder.net/use-cases/0.1.0/#researcher-saves-an-article - use cases

My specific use cases are

  • for Human Rights Investigators who need to archive social media images / videos for later use to prove what has happened.
  • journalists who need to archive images/video to make sure they can use them in their stories (ie may disappear in a few days)


Web archive formats

  • WARC Web ARChive .warc, .warc.gz - Supported - the standard to follow.


https://github.com/iipc/awesome-web-archiving - curated by International Internet Preservation Consortium https://netpreserve.org/

  • Acquisition
  • Replay
  • Search and discovery
  • utilities
  • WARC I/O libraries


https://archiveweb.page/ - Chrome extension from Webrecorder Project. Can download WARC.

https://conifer.rhizome.org/ - history

https://github.com/internetarchive/heritrix3/wiki - Internet Archives web crawler

Existing data crawls

https://commoncrawl.org/ lots of data in warc format. They adhere to robots.txt so wont crawl facebook etc.


*HERE** how to parse and get images out of a WARC 1.1 file? could we use it for facebook?

https://replayweb.page/ - replay. from Webrecorder Project.

Archive Websites

https://web.archive.org The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001 source

Archive-it is the domamin level version of above

https://archive.ph/ aka archive.today

https://commoncrawl.org/the-data/get-started/ - download WARC