Setup Archivebox on RHEL-compatible distros

2023-02-15

From their website: ArchiveBox “is a powerful, self-hosted internet archiving solution to collect, save, and view sites you want to preserve offline.” It offers a command-line tool, web service, and desktop apps for Linux, macOS, and Windows.

There are several ways to install ArchiveBox. The developers recommend to install with docker-compose, but this gets a little cumbersome when you’re running a RHEL-compatible Linux distro that has strong opinions on container management and prefers Podman over the “old way” (aka Docker). I’ve personally found it easier to install it with Python’s Pip tool and have my web server reverse proxy the ArchiveBox server.

Prerequisites

Preferably a filesystem with compression and deduplication capabilities, such as BTRFS or ZFS, but any journaling filesystem will work fine if you have another way to backup the archives.
Minimum of 500MB of RAM, but 2GB or more is recommended for chrome-based archiving methods.

Installing dependencies

To get started, we’ll install Pip and the Python development package:

1sudo dnf install python3-pip python3-devel

Next, we’ll install required dependencies, some of which may already be available on your system:

1sudo dnf install wget curl git

Then, we’ll install optional dependencies. If you want to use chrome-based archiving methods, such as fetching PDFs, screenshots, and the DOM of web pages, you’ll need to install the Chromium package. If you want to archive YouTube videos, you’ll need the youtube-dl package. If you’re on a RHEL-compatible distro that is not Fedora, you’ll need to install the epel-release package to get chromium and youtube-dl:

1sudo dnf install epel-release
2sudo dnf install chromium youtube-dl

Now we’ll run Pip with sudo to install ArchiveBox into /usr/local/bin:

Create the archivebox user

Create a user to run the archivebox server.

1sudo useradd -c "ArchiveBox" -m -s /bin/bash -U archivebox

1su - archivebox
2pip install --user archivebox

Initializing the ArchiveBox database

Create a directory in the archivebox user’s home directory to store the archive data:

1mkdir data
2cd data

Run the initialization:

1archivebox init --setup

The setup wizard will prompt you to enter a username, email address, and password. This will allow you to login to your ArchiveBox web dashboard.

Now we need to create a systemd service for the ArchiveBox server.

 1[Unit]
 2Description=Archivebox server
 3After=network.target network-online.target
 4Requires=network-online.target
 5
 6[Service]
 7Type=simple
 8User=archivebox
 9Group=archivebox
10Restart=always
11ExecStart=bash -c '/home/archivebox/.local/bin/archivebox server 0.0.0.0:8000
12WorkingDirectory=/home/archivebox/data
13
14[Install]
15WantedBy=multi-user.target

Reload the daemons:

1sudo systemctl daemon-reload

Enable and start the archivebox.service:

1sudo systemctl enable --now archivebox.service

If you’re running a web server already, you can reverse proxy the archivebox server on port 8000. I use Caddy, so this is what I have in my Caddyfile:

1archive.hyperreal.coffee {
2    reverse_proxy 0.0.0.0:8000
3}

If you’re not already running a web server, then you might need to open port 8000 in your firewalld’s default zone:

1sudo firewall-cmd --zone=public --permanent --add-port=8000/tcp
2sudo firewall-cmd --reload

You should now be able to access your ArchiveBox instance from your web server domain or from your localhost by pointing your web browser at http://localhost:8000. Familiarize yourself with the ArchiveBox web dashboard.

Here are a few examples of what you can do with the ArchiveBox command-line tool:

1archivebox add "https://techne.hyperreal.coffee"
2archivebox add < ~/Downloads/bookmarks.html
3curl https://example.com/some/rss/feed.xml | archivebox add

You can specify the depth if you want to archive all URLs within the web page of the given URL:

1archivebox add --depth=1 https://example.com/some/feed.RSS

You can run archivebox on a cron schedule:

1archivebox schedule --every=day --depth=1 http://techrights.org/feed/

‘Dassit! Enjoy ArchiveBox :-)

#archivebox #archiving #rhel #fedora #selfhosting

Reply to this post by email ↪