There are many good reasons for wanting to save a copy of a web page for later, offline use. There are also numerous tools that specialize in helping people do this.
One potential drawback of these web clippers is that they often push you towards using a specific piece of software tied to the web clipping tool. Also, getting your saved content out of the associated software may not be as easy as you would hope.
However, there are more modular, customizable solutions that only rely on foundational technologies of the Web and GNU/Linux.
Note: If you are not familiar with the GNU/Linux command line interface, review the Conventions page before proceeding.
The Output Format
In regards to text processing, the Portable Document Format (PDF) is often not a very good input format, but it does excel at being a flexible output format. No matter what operating system you use, it is likely to have built-in tools that can read and search PDF files.
Being able to easily convert a web page into a PDF file (preferably in a scriptable, command line-based manner) would get us closer to creating a web page saving pipeline. The
wkhtmltopdf program is a tool that can enable this kind of workflow.
Here is the description of the
wkhtmltopdf Debian package:
$ apt show wkhtmltopdf Package: wkhtmltopdf Version: 0.12.6-1 Priority: optional Section: utils Maintainer: Emmanuel Bouthenot <firstname.lastname@example.org> Installed-Size: 805 kB Depends: libc6 (>= 2.14), libgcc-s1 (>= 3.0), libqt5core5a (>= 5.14.1), libqt5gui5 (>= 5.2.0) | libqt5gui5-gles (>= 5.2.0), libqt5network5 (>= 5.14.1), libqt5printsupport5 (>= 5.2.0), libqt5svg5 (>= 5.6.0~beta), libqt5webkit5 (>= 5.212.0~alpha3), libqt5widgets5 (>= 5.0.2), libstdc++6 (>= 5) Recommends: xserver | xvfb Homepage: https://wkhtmltopdf.org/ Tag: implemented-in::c++, interface::graphical, interface::x11, role::program, uitoolkit::qt, use::converting, works-with-format::html, works-with::text, x11::application Download-Size: 171 kB APT-Manual-Installed: yes APT-Sources: http://deb.debian.org/debian bullseye/main amd64 Packages Description: Command line utilities to convert html to pdf or image using WebKit wkhtmltopdf is a command line program which permits one to create a pdf or an image from an url, a local html file or stdin. It produces a pdf or an image like rendered with the WebKit engine. . This program requires an X11 server to run. . It is not built against a forked version of Qt hence some options are not supported.
You can use
wkhtmltopdf to convert a web page to a PDF file like so:
$ wkhtmltopdf \ 'https://www.python.org/psf/mission/' \ 'psf_mission.pdf'
The default settings for
wkhtmltopdf usually produce satisfactory output (hyperlinks are preserved, as well), but you can always tweak the command to adjust how the output PDF file looks. Run
man 1 wkhtmltopdf for more information on how to customize
Previously, we had to construct a
wkhtmltopdf command to get our output file. Specifically, we needed to provide at least two arguments:
- The web page address
- The name of the output PDF file
The bookmarklet above:
- Grabs a web page's title
- Converts it to lowercase
- Cleans up the title to make it suitable as a filename
- Concatenates the file name with the web page's address
So, for the prior web page example, if you navigate to https://www.python.org/psf/mission/ and deploy the bookmarklet, you should get an output string in an alert box like this:
Now, you can easily produce and copy the arguments required for the
wkhtmltopdf command to convert a web page into a PDF file.
You may find that continually creating
wkhtmltopdf commands is cumbersome. Being able to easily create the command's arguments helps, but a more streamlined solution would be preferable.
A Bash script can help automate the
wkhtmltopdf part of the pipeline, as well as address the few cases where
wkhtmltopdf is not successful in its task:
For example, you may occasionally come across a web page where
wkhtmltopdf does not render the content at all, or the command appears to take too long to generate the output file (i.e., it hangs). For these rare cases, you may have to revert to manually saving the page as a PDF file using your operating system's built-in PDF saving functionality.
The above script accounts for this in two ways:
wkhtmltopdfcan generate a PDF file within 30 seconds, it is opened in your PDF viewing application of choice for 5 seconds. This allows you to preview the output file to ensure that
wkhtmltopdfsuccessfully created satisfactory output.
wkhtmltopdfcannot generate a PDF file within 30 seconds, an error message is shown, and the web page's associated output filename (i.e., the cleaned up web page title) is placed on the system clipboard via
xclip, which makes it a little easier to create the output file when you manually save the web page as a PDF file via your operating system.
wkhtmltopdf arguments, and then paste them into the terminal emulator running the script.
FLOSS Tools Maintain Data Portability
Free/Libre Open Source Software (FLOSS) helps you create flexible, sustainable solutions. Creating your own solution from pre-existing FLOSS tools may not always be the fastest path, but it is a rewarding experience and, importantly, helps you maintain control over the data that your solution generates.