There are many good reasons for wanting to save a copy of a web page for later, offline use. There are also numerous tools that specialize in helping people do this.
One potential drawback of these web clippers is that they often push you towards using a specific piece of software tied to the web clipping tool. Also, getting your saved content out of the associated software may not be as easy as you would hope.
However, there are more modular, customizable solutions that only rely on foundational technologies of the Web and GNU/Linux.
Note: If you are not familiar with the GNU/Linux command line interface, review the Conventions page before proceeding.
The Output Format
In regards to text processing, the Portable Document Format (PDF) is often not a very good input format, but it does excel at being a flexible output format. No matter what operating system you use, it is likely to have built-in tools that can read and search PDF files.
Being able to easily convert a web page into a PDF file (preferably in a scriptable, command line-based manner) would get us closer to creating a web page saving pipeline. The wkhtmltopdf
program is a tool that can enable this kind of workflow.
Here is the description of the wkhtmltopdf
Debian package:
$ apt show wkhtmltopdf
Package: wkhtmltopdf
Version: 0.12.6-1
Priority: optional
Section: utils
Maintainer: Emmanuel Bouthenot <kolter@debian.org>
Installed-Size: 805 kB
Depends: libc6 (>= 2.14), libgcc-s1 (>= 3.0), libqt5core5a (>= 5.14.1), libqt5gui5 (>= 5.2.0) | libqt5gui5-gles (>= 5.2.0), libqt5network5 (>= 5.14.1), libqt5printsupport5 (>= 5.2.0), libqt5svg5 (>= 5.6.0~beta), libqt5webkit5 (>= 5.212.0~alpha3), libqt5widgets5 (>= 5.0.2), libstdc++6 (>= 5)
Recommends: xserver | xvfb
Homepage: https://wkhtmltopdf.org/
Tag: implemented-in::c++, interface::graphical, interface::x11,
role::program, uitoolkit::qt, use::converting, works-with-format::html,
works-with::text, x11::application
Download-Size: 171 kB
APT-Manual-Installed: yes
APT-Sources: http://deb.debian.org/debian bullseye/main amd64 Packages
Description: Command line utilities to convert html to pdf or image using WebKit
wkhtmltopdf is a command line program which permits one to create a
pdf or an image from an url, a local html file or stdin. It produces a pdf or
an image like rendered with the WebKit engine.
.
This program requires an X11 server to run.
.
It is not built against a forked version of Qt hence some options are not
supported.
You can use wkhtmltopdf
to convert a web page to a PDF file like so:
$ wkhtmltopdf \
'https://www.python.org/psf/mission/' \
'psf_mission.pdf'
The default settings for wkhtmltopdf
usually produce satisfactory output (hyperlinks are preserved, as well), but you can always tweak the command to adjust how the output PDF file looks. Run man 1 wkhtmltopdf
for more information on how to customize wkhtmltopdf
commands.
Command Input
Previously, we had to construct a wkhtmltopdf
command to get our output file. Specifically, we needed to provide at least two arguments:
- The web page address
- The name of the output PDF file
If we plan on building a pipeline for saving web pages as PDF files, we should have an easier way of getting this information. A little bit of JavaScript can help, in the form of a bookmarklet:
javascript: (() => {
var new_title = document.title.toLowerCase()
.replace(/[^a-zA-Z0-9]/g, "_")
.replace(/_{2,}/g, "_");
alert(new_title + " " + window.location.href);
})();
The bookmarklet above:
- Grabs a web page's title
- Converts it to lowercase
- Cleans up the title to make it suitable as a filename
- Concatenates the file name with the web page's address
So, for the prior web page example, if you navigate to https://www.python.org/psf/mission/ and deploy the bookmarklet, you should get an output string in an alert box like this:
mission_python_software_foundation https://www.python.org/psf/mission/
Now, you can easily produce and copy the arguments required for the wkhtmltopdf
command to convert a web page into a PDF file.
The Script
You may find that continually creating wkhtmltopdf
commands is cumbersome. Being able to easily create the command's arguments helps, but a more streamlined solution would be preferable.
A Bash script can help automate the wkhtmltopdf
part of the pipeline, as well as address the few cases where wkhtmltopdf
is not successful in its task:
For example, you may occasionally come across a web page where wkhtmltopdf
does not render the content at all, or the command appears to take too long to generate the output file (i.e., it hangs). For these rare cases, you may have to revert to manually saving the page as a PDF file using your operating system's built-in PDF saving functionality.
The above script accounts for this in two ways:
- If
wkhtmltopdf
can generate a PDF file within 30 seconds, it is opened in your PDF viewing application of choice for 5 seconds. This allows you to preview the output file to ensure thatwkhtmltopdf
successfully created satisfactory output. - If
wkhtmltopdf
cannot generate a PDF file within 30 seconds, an error message is shown, and the web page's associated output filename (i.e., the cleaned up web page title) is placed on the system clipboard viaxclip
, which makes it a little easier to create the output file when you manually save the web page as a PDF file via your operating system.
Now, you should be able to start the script, use the JavaScript bookmarklet to generate and copy the wkhtmltopdf
arguments, and then paste them into the terminal emulator running the script.
FLOSS Tools Maintain Data Portability
Free/Libre Open Source Software (FLOSS) helps you create flexible, sustainable solutions. Creating your own solution from pre-existing FLOSS tools may not always be the fastest path, but it is a rewarding experience and, importantly, helps you maintain control over the data that your solution generates.