Warc parse and save images

WARC - Web ARChive file format

Is it useful to store a warc file for every url we are archiving?
Can we use warc file acquision tools to help in our archiving of difficult sites like Facebook images?
Parse browsertrix output to get raw image?

What is the Web ARChive file format?

https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml

iso spec which you have to buy!

The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information

http://bibnum.bnf.fr/WARC/

WARC/1.1 as on Jan 2017 WARC/1.0 as of Nov 2008

WARC File and Record

WARC file is the concatenation of one more more WARC records

Record

1.record header

warcinfo (used all the time)
info
response (used eg for favicon.ico)
resource
request
metadata
revisit
conversion
continuation
record content block which can contain any format eg binary image, html
2 newlines

spec

Confusing that the request/response seems the wrong way around in my examples

warcinfo
response
request
warcinfo

Example

Am using the archivewebpage chrome extension to make a WARC file of http://brokenlinkcheckerchecker.com/pagea

Parsing a WARC

https://github.com/webrecorder/pywb and docs https://pywb.readthedocs.io/en/latest/index.html

Replay and recording on web archives - archivewebpage uses this under the hood.

https://github.com/internetarchive/warc 2012 last updated! no support for WARC/1.1

Python Parser

https://github.com/lxucs/commoncrawl-warc-retrieval a very simple parser. And blog

WARCIO - parse and save jpegs

https://github.com/webrecorder/warcio 2020. Supports WARC/1.1. Part of the webrecroder project.

Makes it easy to read a WARC file eg

from warcio.archiveiterator import ArchiveIterator

# pip install pillow
from PIL import Image

from io import BytesIO

from urllib.parse import urlparse
import os
import os.path
import uuid

# can handle gzipped warc files too
input = '/mnt/c/warc-in/building18_0.warc.gz'

with open(input, 'rb') as stream:
    for record in ArchiveIterator(stream):
        if record.rec_type == 'response':
            # http://brokenlinkcheckerchecker.com/img/flower3.jpg
            uri = record.rec_headers.get_header('WARC-Target-URI')
            ct = record.http_headers.get_header('Content-Type')

            if ct == 'image/jpeg':
                status = record.http_headers.statusline
                if status=='200 OK':
                    
                    o = urlparse(uri)
                    # /img/flower3.jpg
                    print(o.path)
                    # flower3.jpg
                    filename = os.path.basename(o.path)
                    print(filename)

                    content = record.content_stream().read()
                    img_bytes_io = BytesIO()
                    img_bytes_io.write(content)

                    # check if already saved this filename 
                    if os.path.isfile(f'/mnt/c/warc-out/{filename}'):
                        filename=str(uuid.uuid4())

                    with Image.open(img_bytes_io) as img:
                        img.save(f'/mnt/c/warc-out/{filename}', format='JPEG')

BytesIO code

I’m using archivepageweb to save a warc file which works well.

groups.google.io potentially an even easier way to write out the jpg. Just pipe the bytes straight to file