browsertrix-crawler (BC) is a browser based crawling system using Puppeteer which is a headless Chrome API.

Part of the https://webrecorder.net/ suite of tools which includes

Single Page from a static site

Lets explore what BC can do

# --text automated text extraction for full text search
# --collection is where to save
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl \
--url https://davemateer.com/2022/09/22/mssql-php-local-on-wsl  \
--generateWACZ --text --collection mssql11

alt text

The output of the crawler

alt text

Main output of the crawler - warc files.

alt text

first file (unzipped) looking at the warc

ReplayWeb.page

alt text

https://replayweb.page/ is a browser based viewer that loads web archive files provided. docs

It can get confused so to delete the service worker, click the button from F12 dev toolbar as shown above.

alt text

then it can render the saved webpage.

Formats

Web archive formats supported by webrecorder.net suite of tools include

  • WARC Web ARChive .warc, .warc.gz - Supported - the standard to follow.
  • WACZ - Newly proposed format - .wacz - In Progress.

Simple site - 1 page only and combine warc

docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl \
--url http://brokenlinkcheckerchecker.com/pagec-single-image \
--scopeType page \
--combineWarc \
--collection blcc-single

A single page to crawl only

Combine warcs into a single file which is useful for parsing all the images out of.

Facebook - not logged in - lady simple URL

So I can save the facebook lady page using webrecorder locally to a warc file which I can parse out the images from.

Can I do this with browsertrix?

# FB lady works without login
# notice URL is simplified
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl \
--url https://www.facebook.com/photo/?fbid=1329142910787472 \
--scopeType page \
--combineWarc \
--collection facebook-lady10

Yes this works fine and gets the image I’m interested in so I can warc parse and save the image

It also works fine when logged in (see below)

Facebook - not logged in - lady full url

if I want more safety with passing special characters to docker then the easiest thing to do is real from a file:

# FB lady with full URL via url.txt
docker run -v $PWD/crawls:/crawls/ \
-v $PWD/url.txt:/app/url.txt \
-it webrecorder/browsertrix-crawler crawl \
--urlFile /app/url.txt \
--scopeType page \
--combineWarc \
--collection facebook-lady11

and

# url.txt
https://www.facebook.com/102940249227144/photos/pcb.109374295250406/109374168583752/?type=3&theater

https://www.facebook.com/permalink.php?story_fbid=pfbid02XiNQpuUtgmovNKF6E5N6xmoxn7LwBjY1T5E9SjyFH4BuRXKtrk8t5Gdxmi4ve5Gdl&id=102940249227144 building damage

docker run -v $PWD/crawls:/crawls/ -v $PWD/url.txt:/app/url.txt -it webrecorder/browsertrix-crawler crawl --urlFile /app/url.txt --scopeType page --combineWarc --collection building14

if I put this in and not logged in, all more CPU cores peg.

restarting docker and deleting containers doesn’t work. I have to restart the bash session. Sometimes it does work. Trying to stop it using –rm below

# remove the container after it stops
docker run -v $PWD/crawls:/crawls/ -v $PWD/url.txt:/app/url.txt -it webrecorder/browsertrix-crawler crawl --urlFile /app/url.txt --scopeType page --combineWarc --collection building14 --rm

alt text

Load Error: Navigation timeout of 90000 ms exceeded

Page Load Failed: Reason: Error: Timeout hit: 180000

Essentially it is 2 times the --timeout default of 90seconds

and it doesn’t get the images!

Facebook - building damage - logged in using profiles feature

Browsertrix profiles a file is made with temporary session cookies.

#fb login create profile
docker run -p 9222:9222 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles/ -it webrecorder/browsertrix-crawler create-login-profile --interactive --url "https://facebook.com/"

it creates a state of the browser.. so 16MB profile.. stored in /crawls/profiles/profile.tar.gz

# run archive on page using logged in profile
# and 30 sec timeout
docker run -v $PWD/crawls:/crawls/ -v $PWD/url.txt:/app/url.txt --rm -it webrecorder/browsertrix-crawler crawl --urlFile /app/url.txt --scopeType page --combineWarc  --profile /crawls/profiles/profile.tar.gz --timeout 30 --collection lady20

Success and have the images we want.

But the problem here is we get other images in other posts on the logged in users timeline.

https://www.facebook.com/khitthitnews/posts/pfbid0PTvT6iAccWqatvbDQNuqpFwL5WKzHuLK4QjP97Fwut637CV3XXQU53z1s2bJMAKwl

If we get the photos URL it works well

https://www.facebook.com/khitthitnews/photos/pcb.1646726145764725/1646726009098072/

Facebook - accept all cookies - works

However a better way is just to accept the cookies, then we don’t have the rest of the timeline posts. So using the profile feature to just accept cookies

# fb accept cookies 
docker run -p 9222:9222 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles/ -it webrecorder/browsertrix-crawler create-login-profile --interactive --url "https://facebook.com/"

# **THIS IS NEWWER AND WORKS**
docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles/ -it webrecorder/browsertrix-crawler create-login-profile --url "https://example.com/"

then

docker run -v $PWD/crawls:/crawls/ -v $PWD/url.txt:/app/url.txt --rm -it webrecorder/browsertrix-crawler crawl --urlFile /app/url.txt --scopeType page --combineWarc  --profile /crawls/profiles/profile.tar.gz --timeout 30 --collection 40

OLD

Twitter

https://twitter.com/bellingcat/status/1572958778177515520 this will not work and will stall with a message note: waitForNetworkIdle timed out, ignoring

# wont work
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl 
--url https://twitter.com/bellingcat/status/1572958778177515520  
--generateWACZ --text --collection a
docker run -v $PWD/crawls/profiles:/crawls/profiles -it webrecorder/browsertrix-crawler create-login-profile --url "https://twitter.com/login"

Specs

https://github.com/webrecorder/specs Use cases for decentralised Web Archives is relevant.

https://specs.webrecorder.net/use-cases/0.1.0/#researcher-saves-an-article nicer view of above

Archiveweb.page (Chrome extension)

https://archiveweb.page/en/install/

This can save to the .wacz file format.

alt text