In this article I’m going to discuss the technical problems of getting public information from a website ie not from API

My specific use case is I do work with a Human Rights Organisation who document human rights issues. Often there are social media images/video which can be used as future evidence.

Depending on the region eg Ukraine, Myanmar there are more popular networks. I’m more focussed on Myanmar so in order of importance:

Twitter
Facebook
Telegram
YouTube

Investigators find the social media articles then have to:

Create a new row in a Google Sheet
Enter the URL into Link Column in spreadsheet
Manually download the image or video from the social media site
Create a new folder in Google Drive with a Case Number
Copy downloaded media to Google Drive
Make a hash of the image using this to prove it hasn’t been altered after that date
Tweet the hash (treating Twitter like Blockchain as it should never disappear)

Then later on there are checks where the hash is duplicated etc

With some automation after manually

Creating a new row and entering the case number
Enter URL into a special link columnb

The automation does:

Script reads url from sheet
Downloads the image(s) or video(s) from website eg facebook
Creates a new folder in Google Drive (todo)
Saves original media to cloud storage in case number folder (todo)
Enters archive location URL of google cloud file in spreadsheet
Takes a screenshot of the social media site
Creates thumbnails of videos
Puts in duration in seconds of video

Technologies used in the auto-archiver from Bellingcat are:

Google Drive and Docs API to read/write to the DB
Python 3.9 app running on a server on a cron job every minute to check for new rows (arhive column is blank)
https://github.com/JustAnotherArchivist/snscrape 1.1k stars - twitter
FFmpeg for creating thumbnails and yt-dlp
Firefox and Geckodriver for screenshots 5.8k stars
Digital Ocean Spaces - S3 compatible storage

Bellingcat auto-archiver write up