Setup ArchiveBox on RHEL-compatible distros

published on 2023-02-15 by hyperreal

From their website: ArchiveBox “is a powerful, self-hosted internet archiving solution to collect, save, and view sites you want to preserve offline.” It offers a command-line tool, web service, and desktop apps for Linux, macOS, and Windows.

There are several ways to install ArchiveBox. The developers recommend to install with docker-compose, but this gets a little cumbersome when you’re running a RHEL-compatible Linux distro that has strong opinions on container management and prefers Podman over the “old way” (aka Docker). I’ve personally found it easier to install it with Python’s Pip tool and have my web server reverse proxy the ArchiveBox server.

Prerequisites

  • Preferably a filesystem with compression and deduplication capabilities, such as BTRFS or ZFS, but any journaling filesystem will work fine if you have another way to backup the archives.
  • Minimum of 500MB of RAM, but 2GB or more is recommended for chrome-based archiving methods.

Installing dependencies

To get started, we’ll install Pip and the Python development package:

sudo dnf install python3-pip python3-devel

Next, we’ll install required dependencies, some of which may already be available on your system:

sudo dnf install wget curl git

Then, we’ll install optional dependencies. If you want to use chrome-based archiving methods, such as fetching PDFs, screenshots, and the DOM of web pages, you’ll need to install the Chromium package. If you want to archive YouTube videos, you’ll need the youtube-dl package. If you’re on a RHEL-compatible distro that is not Fedora, you’ll need to install the epel-release package to get chromium and youtube-dl:

sudo dnf install epel-release
sudo dnf install chromium youtube-dl

Now we’ll run Pip with sudo to install ArchiveBox into /usr/local/bin:

Create the archivebox user

Create a user to run the archivebox server.

sudo useradd -c "ArchiveBox" -m -s /bin/bash -U archivebox

Login to the archivebox user account and install ArchiveBox with pip:

su - archivebox
pip install --user archivebox

Initializing the ArchiveBox database

Create a directory in the archivebox user’s home directory to store the archive data:

mkdir data
cd data

Run the initialization:

archivebox init --setup

The setup wizard will prompt you to enter a username, email address, and password. This will allow you to login to your ArchiveBox web dashboard.

Now we need to create a systemd service for the ArchiveBox server.

[Unit]
Description=Archivebox server
After=network.target network-online.target
Requires=network-online.target

[Service]
Type=simple
User=archivebox
Group=archivebox
Restart=always
ExecStart=bash -c '/home/archivebox/.local/bin/archivebox server 0.0.0.0:8000
WorkingDirectory=/home/archivebox/data

[Install]
WantedBy=multi-user.target

Reload the daemons:

sudo systemctl daemon-reload

Enable and start the archivebox.service:

sudo systemctl enable --now archivebox.service

If you’re running a web server already, you can reverse proxy the archivebox server on port 8000. I use Caddy, so this is what I have in my Caddyfile:

archive.hyperreal.coffee {
    reverse_proxy 0.0.0.0:8000
}

If you’re not already running a web server, then you might need to open port 8000 in your firewalld’s default zone:

sudo firewall-cmd --zone=public --permanent --add-port=8000/tcp
sudo firewall-cmd --reload

You should now be able to access your ArchiveBox instance from your web server domain or from your localhost by pointing your web browser at http://localhost:8000. Familiarize yourself with the ArchiveBox web dashboard.

Here are a few examples of what you can do with the ArchiveBox command-line tool:

archivebox add "https://techne.hyperreal.coffee"
archivebox add < ~/Downloads/bookmarks.html
curl https://example.com/some/rss/feed.xml | archivebox add

You can specify the depth if you want to archive all URLs within the web page of the given URL:

archivebox add --depth=1 https://example.com/some/feed.RSS

You can run archivebox on a cron schedule:

archivebox schedule --every=day --depth=1 http://techrights.org/feed/

’Dassit! Enjoy ArchiveBox :-)