What is Web Archiving
Keeping copies of a web page / site.
https://archive.org/web/ aka The Wayback Machine is the most well know web archiver.
When is web archiving used?
My specific use cases are
- for Human Rights Investigators who need to archive social media images / videos for later use to prove what has happened.
- journalists who need to archive images/video to make sure they can use them in their stories (ie may disappear in a few days)
Web archive formats
- WARC Web ARChive .warc, .warc.gz - Supported - the standard to follow.
- Search and discovery
- WARC I/O libraries
https://archiveweb.page/ - Chrome extension from Webrecorder Project. Can download WARC.
https://github.com/internetarchive/heritrix3/wiki - Internet Archives web crawler
Existing data crawls
https://commoncrawl.org/ lots of data in warc format. They adhere to robots.txt so wont crawl facebook etc.
*HERE** how to parse and get images out of a WARC 1.1 file? could we use it for facebook?
https://replayweb.page/ - replay. from Webrecorder Project.
https://web.archive.org The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001 source
Archive-it is the domamin level version of above
https://archive.ph/ aka archive.today
https://commoncrawl.org/the-data/get-started/ - download WARC