Web archiving
What is Web Archiving
Keeping copies of a web page / site.
https://archive.org/web/ aka The Wayback Machine is the most well know web archiver.
When is web archiving used?
https://specs.webrecorder.net/use-cases/0.1.0/#researcher-saves-an-article - use cases
My specific use cases are
- for Human Rights Investigators who need to archive social media images / videos for later use to prove what has happened.
- journalists who need to archive images/video to make sure they can use them in their stories (ie may disappear in a few days)
Formats
Web archive formats
- WARC Web ARChive .warc, .warc.gz - Supported - the standard to follow.
awesome-web-archiving
https://github.com/iipc/awesome-web-archiving - curated by International Internet Preservation Consortium https://netpreserve.org/
- Acquisition
- Replay
- Search and discovery
- utilities
- WARC I/O libraries
Acquisition
https://archiveweb.page/ - Chrome extension from Webrecorder Project. Can download WARC.
https://conifer.rhizome.org/ - history
https://github.com/internetarchive/heritrix3/wiki - Internet Archives web crawler
Existing data crawls
https://commoncrawl.org/ lots of data in warc format. They adhere to robots.txt so wont crawl facebook etc.
Replay
*HERE** how to parse and get images out of a WARC 1.1 file? could we use it for facebook?
https://replayweb.page/ - replay. from Webrecorder Project.
Archive Websites
https://web.archive.org The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001 source
Archive-it is the domamin level version of above
https://archive.ph/ aka archive.today
https://commoncrawl.org/the-data/get-started/ - download WARC