WARC - Web ARChive file format

  • Is it useful to store a warc file for every url we are archiving?
  • Can we use warc file acquision tools to help in our archiving of difficult sites like Facebook images?
  • Parse browsertrix output to get raw image?

What is the Web ARChive file format?

https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml

iso spec which you have to buy!

The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information

http://bibnum.bnf.fr/WARC/

WARC/1.1 as on Jan 2017 WARC/1.0 as of Nov 2008

WARC File and Record

  • WARC file is the concatenation of one more more WARC records

Record

1.record header

  • warcinfo (used all the time)
  • info
  • response (used eg for favicon.ico)
  • resource
  • request
  • metadata
  • revisit
  • conversion
  • continuation

  • record content block which can contain any format eg binary image, html
  • 2 newlines

spec

Confusing that the request/response seems the wrong way around in my examples

  • warcinfo
  • response
  • request
  • warcinfo

Example

Am using the archivewebpage chrome extension to make a WARC file of http://brokenlinkcheckerchecker.com/pagea

Parsing a WARC

https://github.com/webrecorder/pywb and docs https://pywb.readthedocs.io/en/latest/index.html

Replay and recording on web archives - archivewebpage uses this under the hood.

https://github.com/internetarchive/warc 2012 last updated! no support for WARC/1.1

Python Parser

https://github.com/lxucs/commoncrawl-warc-retrieval a very simple parser. And blog

WARCIO - parse and save jpegs

https://github.com/webrecorder/warcio 2020. Supports WARC/1.1. Part of the webrecroder project.

Makes it easy to read a WARC file eg

from warcio.archiveiterator import ArchiveIterator

# pip install pillow
from PIL import Image

from io import BytesIO

from urllib.parse import urlparse
import os
import os.path
import uuid

# can handle gzipped warc files too
input = '/mnt/c/warc-in/building18_0.warc.gz'

with open(input, 'rb') as stream:
    for record in ArchiveIterator(stream):
        if record.rec_type == 'response':
            # http://brokenlinkcheckerchecker.com/img/flower3.jpg
            uri = record.rec_headers.get_header('WARC-Target-URI')
            ct = record.http_headers.get_header('Content-Type')

            if ct == 'image/jpeg':
                status = record.http_headers.statusline
                if status=='200 OK':
                    
                    o = urlparse(uri)
                    # /img/flower3.jpg
                    print(o.path)
                    # flower3.jpg
                    filename = os.path.basename(o.path)
                    print(filename)

                    content = record.content_stream().read()
                    img_bytes_io = BytesIO()
                    img_bytes_io.write(content)

                    # check if already saved this filename 
                    if os.path.isfile(f'/mnt/c/warc-out/{filename}'):
                        filename=str(uuid.uuid4())

                    with Image.open(img_bytes_io) as img:
                        img.save(f'/mnt/c/warc-out/{filename}', format='JPEG')

BytesIO code

I’m using archivepageweb to save a warc file which works well.

groups.google.io potentially an even easier way to write out the jpg. Just pipe the bytes straight to file