There are many good reasons for wanting to save a copy of a web page for later, offline use. There are also numerous tools that specialize in helping people do this.
One potential drawback of these web clippers is that they often push you towards using a specific piece of software tied to the web clipping tool. Also, getting your saved content out of the associated software may not be as easy as you would hope.
However, there are more modular, customizable solutions that only rely on foundational technologies of the Web and GNU/Linux.
Note: If you are not familiar with the GNU/Linux command line interface, or you intend to use a script obtained from this site, review the Conventions page before proceeding.
The Output Format
In regards to text processing, the Portable Document Format (PDF) is often not a very good input format, but it does excel at being a flexible output format. No matter what operating system you use, it is likely to have built-in tools that can read and search PDF files.
Being able to easily convert a web page into a PDF file (preferably in a scriptable, command line-based manner) would get us closer to creating a web page saving pipeline. The Chromium web browser can be used in a headless manner to accomplish this.
Here is the description of the chromium Debian package:
$ apt show chromium
Package: chromium
Version: 145.0.7632.75-1~deb13u1
Priority: optional
Section: web
Maintainer: Debian Chromium Team <chromium@packages.debian.org>
Installed-Size: 294 MB
Provides: gnome-www-browser, www-browser
Depends: libasound2t64 (>= 1.0.17), libatk-bridge2.0-0t64 (>= 2.5.3), libatk1.0-0t64 (>= 2.32.0), libatspi2.0-0t64 (>= 2.9.90), libc6 (>= 2>
Recommends: chromium-sandbox
Suggests: chromium-l10n, chromium-shell, chromium-driver
Conflicts: libgl1-mesa-swx11, libnettle4, libsecret-1-0 (<< 0.18)
Breaks: chromium-lwn4chrome (<= 1.0-2), chromium-tt-rss-notifier (<= 0.5.2-2)
Homepage: http://www.chromium.org/Home
Download-Size: 81.8 MB
APT-Sources: http://security.debian.org/debian-security trixie-security/main amd64 Packages
Description: web browser
Web browser that aims to build a safer, faster, and more stable internet
browsing experience.
.
This package contains the web browser component.
Notice: There is 1 additional record. Please use the '-a' switch to see it
You can use chromium to convert a web page to a PDF file like so:
$ chromium \
--headless \
--incognito \
--print-to-pdf="${HOME}/Downloads/psf_mission.pdf" \
'https://www.python.org/psf/mission/'
Command Input
Previously, we had to construct a chromium command to get our output file. Specifically, we needed to provide at least two arguments:
- The web page address
- The name of the output PDF file
If we plan on building a pipeline for saving web pages as PDF files, we should have an easier way of getting this information. A little bit of JavaScript can help, in the form of a bookmarklet:
javascript:(
function () {
const new_title = document.title.toLowerCase()
.replace(/[^a-zA-Z0-9]/g, "_")
.replace(/_{2,}/g, "_");
alert(`${new_title} ${window.location.href}`);
}
)();
The bookmarklet above:
- Grabs a web page's title
- Converts it to lowercase
- Cleans up the title to make it suitable as a filename
- Concatenates the file name with the web page's address
So, for the prior web page example, if you navigate to https://www.python.org/psf/mission/ and deploy the bookmarklet, you should get an output string in an alert box like this:
mission_python_software_foundation https://www.python.org/psf/mission/
Now, you can easily produce and copy the arguments required for the chromium command to convert a web page into a PDF file.
The Script
You may find that continually creating chromium commands is cumbersome. Being able to easily create the command's arguments helps, but a more streamlined solution would be preferable.
A Bash script can help automate the chromium part of the pipeline, as well as address the few cases where chromium is not successful in its task:
For example, you may occasionally come across a web page where chromium does not render the content at all, or the command appears to take too long to generate the output file (i.e., it hangs). For these rare cases, you may have to revert to manually saving the page as a PDF file using your operating system's built-in PDF saving functionality.
The above script accounts for this in two ways:
- If
chromiumcan generate a PDF file within 30 seconds, it is opened in your PDF viewing application of choice for 5 seconds. This allows you to preview the output file to ensure thatchromiumsuccessfully created satisfactory output. - If
chromiumcannot generate a PDF file within 30 seconds, an error message is shown, and the web page's associated output filename (i.e., the cleaned up web page title) is placed on the system clipboard viawl-copy, which makes it a little easier to create the output file when you manually save the web page as a PDF file via your operating system.
Now, you should be able to start the script, use the JavaScript bookmarklet to generate and copy the chromium arguments, and then paste them into the terminal emulator running the script.
FLOSS Tools Maintain Data Portability
Free/Libre Open Source Software (FLOSS) helps you create flexible, sustainable solutions. Creating your own solution from pre-existing FLOSS tools may not always be the fastest path, but it is a rewarding experience and, importantly, helps you maintain control over the data that your solution generates.