Setup ArchiveBox on RHEL-compatible distros
published on 2023-02-15 by hyperreal
From their website: ArchiveBox “is a powerful, self-hosted internet archiving solution to collect, save, and view sites you want to preserve offline.” It offers a command-line tool, web service, and desktop apps for Linux, macOS, and Windows.
There are several ways to install ArchiveBox. The developers recommend to install with docker-compose, but this gets a little cumbersome when you’re running a RHEL-compatible Linux distro that has strong opinions on container management and prefers Podman over the “old way” (aka Docker). I’ve personally found it easier to install it with Python’s Pip tool and have my web server reverse proxy the ArchiveBox server.
Prerequisites
- Preferably a filesystem with compression and deduplication capabilities, such as BTRFS or ZFS, but any journaling filesystem will work fine if you have another way to backup the archives.
- Minimum of 500MB of RAM, but 2GB or more is recommended for chrome-based archiving methods.
Installing dependencies
To get started, we’ll install Pip and the Python development package:
sudo dnf install python3-pip python3-devel
Next, we’ll install required dependencies, some of which may already be available on your system:
sudo dnf install wget curl git
Then, we’ll install optional dependencies. If you want to use chrome-based archiving methods, such as fetching PDFs, screenshots, and the DOM of web pages, you’ll need to install the Chromium package. If you want to archive YouTube videos, you’ll need the youtube-dl package. If you’re on a RHEL-compatible distro that is not Fedora, you’ll need to install the epel-release package to get chromium and youtube-dl:
sudo dnf install epel-release sudo dnf install chromium youtube-dl
Now we’ll run Pip with sudo to install ArchiveBox into /usr/local/bin
:
Create the archivebox user
Create a user to run the archivebox server.
sudo useradd -c "ArchiveBox" -m -s /bin/bash -U archivebox
Login to the archivebox user account and install ArchiveBox with pip:
su - archivebox pip install --user archivebox
Initializing the ArchiveBox database
Create a directory in the archivebox user’s home directory to store the archive data:
mkdir data
cd data
Run the initialization:
archivebox init --setup
The setup wizard will prompt you to enter a username, email address, and password. This will allow you to login to your ArchiveBox web dashboard.
Now we need to create a systemd service for the ArchiveBox server.
[Unit] Description=Archivebox server After=network.target network-online.target Requires=network-online.target [Service] Type=simple User=archivebox Group=archivebox Restart=always ExecStart=bash -c '/home/archivebox/.local/bin/archivebox server 0.0.0.0:8000 WorkingDirectory=/home/archivebox/data [Install] WantedBy=multi-user.target
Reload the daemons:
sudo systemctl daemon-reload
Enable and start the archivebox.service:
sudo systemctl enable --now archivebox.service
If you’re running a web server already, you can reverse proxy the archivebox server on port 8000. I use Caddy, so this is what I have in my Caddyfile:
archive.hyperreal.coffee { reverse_proxy 0.0.0.0:8000 }
If you’re not already running a web server, then you might need to open port 8000 in your firewalld’s default zone:
sudo firewall-cmd --zone=public --permanent --add-port=8000/tcp sudo firewall-cmd --reload
You should now be able to access your ArchiveBox instance from your web server domain or from your localhost by pointing your web browser at http://localhost:8000. Familiarize yourself with the ArchiveBox web dashboard.
Here are a few examples of what you can do with the ArchiveBox command-line tool:
archivebox add "https://techne.hyperreal.coffee"
archivebox add < ~/Downloads/bookmarks.html
curl https://example.com/some/rss/feed.xml | archivebox add
You can specify the depth if you want to archive all URLs within the web page of the given URL:
archivebox add --depth=1 https://example.com/some/feed.RSS
You can run archivebox on a cron schedule:
archivebox schedule --every=day --depth=1 http://techrights.org/feed/
’Dassit! Enjoy ArchiveBox :-)