Warc parse and save images
WARC - Web ARChive file format
- Is it useful to store a warc file for every url we are archiving?
- Can we use warc file acquision tools to help in our archiving of difficult sites like Facebook images?
- Parse browsertrix output to get raw image?
What is the Web ARChive file format?
https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml
iso spec which you have to buy!
The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information
WARC/1.1 as on Jan 2017 WARC/1.0 as of Nov 2008
WARC File and Record
- WARC file is the concatenation of one more more WARC records
Record
1.record header
- warcinfo (used all the time)
- info
- response (used eg for favicon.ico)
- resource
- request
- metadata
- revisit
- conversion
-
continuation
- record content block which can contain any format eg binary image, html
- 2 newlines
Confusing that the request/response seems the wrong way around in my examples
- warcinfo
- response
- request
- warcinfo
Example
Am using the archivewebpage chrome extension to make a WARC file of http://brokenlinkcheckerchecker.com/pagea
Parsing a WARC
https://github.com/webrecorder/pywb and docs https://pywb.readthedocs.io/en/latest/index.html
Replay and recording on web archives - archivewebpage uses this under the hood.
https://github.com/internetarchive/warc 2012 last updated! no support for WARC/1.1
Python Parser
https://github.com/lxucs/commoncrawl-warc-retrieval a very simple parser. And blog
WARCIO - parse and save jpegs
https://github.com/webrecorder/warcio 2020. Supports WARC/1.1. Part of the webrecroder project.
Makes it easy to read a WARC file eg
from warcio.archiveiterator import ArchiveIterator
# pip install pillow
from PIL import Image
from io import BytesIO
from urllib.parse import urlparse
import os
import os.path
import uuid
# can handle gzipped warc files too
input = '/mnt/c/warc-in/building18_0.warc.gz'
with open(input, 'rb') as stream:
for record in ArchiveIterator(stream):
if record.rec_type == 'response':
# http://brokenlinkcheckerchecker.com/img/flower3.jpg
uri = record.rec_headers.get_header('WARC-Target-URI')
ct = record.http_headers.get_header('Content-Type')
if ct == 'image/jpeg':
status = record.http_headers.statusline
if status=='200 OK':
o = urlparse(uri)
# /img/flower3.jpg
print(o.path)
# flower3.jpg
filename = os.path.basename(o.path)
print(filename)
content = record.content_stream().read()
img_bytes_io = BytesIO()
img_bytes_io.write(content)
# check if already saved this filename
if os.path.isfile(f'/mnt/c/warc-out/{filename}'):
filename=str(uuid.uuid4())
with Image.open(img_bytes_io) as img:
img.save(f'/mnt/c/warc-out/{filename}', format='JPEG')
I’m using archivepageweb to save a warc file which works well.
groups.google.io potentially an even easier way to write out the jpg. Just pipe the bytes straight to file