Page Body

Save Web Pages as PDF Files With JavaScript, Chromium, and Bash

There are many good reasons for wanting to save a copy of a web page for later, offline use. There are also numerous tools that specialize in helping people do this.

One potential drawback of these web clippers is that they often push you towards using a specific piece of software tied to the web clipping tool. Also, getting your saved content out of the associated software may not be as easy as you would hope.

However, there are more modular, customizable solutions that only rely on foundational technologies of the Web and GNU/Linux.

Note: If you are not familiar with the GNU/Linux command line interface, or you intend to use a script obtained from this site, review the Conventions page before proceeding.

The Output Format

In regards to text processing, the Portable Document Format (PDF) is often not a very good input format, but it does excel at being a flexible output format. No matter what operating system you use, it is likely to have built-in tools that can read and search PDF files.

Being able to easily convert a web page into a PDF file (preferably in a scriptable, command line-based manner) would get us closer to creating a web page saving pipeline. The Chromium web browser can be used in a headless manner to accomplish this.

Here is the description of the chromium Debian package:

$ apt show chromium
Package: chromium
Version: 145.0.7632.75-1~deb13u1
Priority: optional
Section: web
Maintainer: Debian Chromium Team <chromium@packages.debian.org>
Installed-Size: 294 MB
Provides: gnome-www-browser, www-browser
Depends: libasound2t64 (>= 1.0.17), libatk-bridge2.0-0t64 (>= 2.5.3), libatk1.0-0t64 (>= 2.32.0), libatspi2.0-0t64 (>= 2.9.90), libc6 (>= 2>
Recommends: chromium-sandbox
Suggests: chromium-l10n, chromium-shell, chromium-driver
Conflicts: libgl1-mesa-swx11, libnettle4, libsecret-1-0 (<< 0.18)
Breaks: chromium-lwn4chrome (<= 1.0-2), chromium-tt-rss-notifier (<= 0.5.2-2)
Homepage: http://www.chromium.org/Home
Download-Size: 81.8 MB
APT-Sources: http://security.debian.org/debian-security trixie-security/main amd64 Packages
Description: web browser
 Web browser that aims to build a safer, faster, and more stable internet
 browsing experience.
 .
 This package contains the web browser component.

Notice: There is 1 additional record. Please use the '-a' switch to see it

You can use chromium to convert a web page to a PDF file like so:

$ chromium \
    --headless \
    --incognito \
    --print-to-pdf="${HOME}/Downloads/psf_mission.pdf" \
    'https://www.python.org/psf/mission/'

Command Input

Previously, we had to construct a chromium command to get our output file. Specifically, we needed to provide at least two arguments:

  1. The web page address
  2. The name of the output PDF file

If we plan on building a pipeline for saving web pages as PDF files, we should have an easier way of getting this information. A little bit of JavaScript can help, in the form of a bookmarklet:

javascript:(
  function () {
    const new_title = document.title.toLowerCase()
                                    .replace(/[^a-zA-Z0-9]/g, "_")
                                    .replace(/_{2,}/g, "_");
    alert(`${new_title} ${window.location.href}`);
  }
)();

chromium Arguments

The bookmarklet above:

  1. Grabs a web page's title
  2. Converts it to lowercase
  3. Cleans up the title to make it suitable as a filename
  4. Concatenates the file name with the web page's address

So, for the prior web page example, if you navigate to https://www.python.org/psf/mission/ and deploy the bookmarklet, you should get an output string in an alert box like this:

mission_python_software_foundation https://www.python.org/psf/mission/

Now, you can easily produce and copy the arguments required for the chromium command to convert a web page into a PDF file.

The Script

You may find that continually creating chromium commands is cumbersome. Being able to easily create the command's arguments helps, but a more streamlined solution would be preferable.

A Bash script can help automate the chromium part of the pipeline, as well as address the few cases where chromium is not successful in its task:

For example, you may occasionally come across a web page where chromium does not render the content at all, or the command appears to take too long to generate the output file (i.e., it hangs). For these rare cases, you may have to revert to manually saving the page as a PDF file using your operating system's built-in PDF saving functionality.

The above script accounts for this in two ways:

  1. If chromium can generate a PDF file within 30 seconds, it is opened in your PDF viewing application of choice for 5 seconds. This allows you to preview the output file to ensure that chromium successfully created satisfactory output.
  2. If chromium cannot generate a PDF file within 30 seconds, an error message is shown, and the web page's associated output filename (i.e., the cleaned up web page title) is placed on the system clipboard via wl-copy, which makes it a little easier to create the output file when you manually save the web page as a PDF file via your operating system.

Now, you should be able to start the script, use the JavaScript bookmarklet to generate and copy the chromium arguments, and then paste them into the terminal emulator running the script.

FLOSS Tools Maintain Data Portability

Free/Libre Open Source Software (FLOSS) helps you create flexible, sustainable solutions. Creating your own solution from pre-existing FLOSS tools may not always be the fastest path, but it is a rewarding experience and, importantly, helps you maintain control over the data that your solution generates.

Enjoyed this post?

Subscribe to the feed for the latest updates.