Following my pursuit of paragliding, I am studying atmosphere. Recently, I discovered an impressive collection of Russian-language books in the Digital public library of the Russian State Hydrometeorological University. I prefer reading electronic books on an electronic ink screen. My older than a decade 9.7″ e-reader Kindle DX still works, and I continue to use it to read PDF files.
I crop page margins to make it comfortable to read a PDF file. If it is a book scan, each page can contain a book opening, i.e. a pair of pages. In this case, I need to split the image. I used to make these operations using briss1. Although it hasn’t been updated since 2013, briss does its job.
Before jumping to briss with my next PDF file, I searched for other free, open-source alternatives and discovered ScanTailor Advanced. An extremely detailed publication by Nate Craun covers multiple aspects of book scanning on Linux, including using ScanTailor. My posting would have been redundant if Nate’s text had been fresher than from 2013. Some commands didn’t work for me, so I had to update them.
ScanTailor can’t process PDF files. Instead, it should be TIFF or JPEG images. I use the command-line tool pdfimages
to convert a PDF file to images. The following command extracts pages from the meteorology.pdf
file and saves them to the meteorology
directory as individual files named likepage_001.jpg
. With the -j
option, the pages are saved in JPEG format, and the resulting file is identical to the JPEG data in the PDF file.
pdfimages -j meteorology.pdf meteorology/page_nnn
Then, I proceed to ScanTailor: fix rotation, split pages, and crop margins. Nate advises setting DPI to 600 and saving pages as black-white. With my PDF files, this recommendation didn’t work for me: letters look uncomfortable for the eyes, and some formulas aren’t readable at all. I suspect the problem is with the original files that were heavily compressed, so it is impossible to distinguish the background from the letters in black and white. Consequently, I saved files as grey-scale with 300 DPI.
The last step is to export individual processed pages to a single PDF file:
magick -quality 5 -compress jpeg *.tif ../meteorology_cropped.pdf
Options -quality 5 -compress jpeg
instruct to compress the page images using the JPEG algorithm with the compression level 5 from 100. Initially, I tried using level 90 or 70, but the resulting file size was several times larger than that of the original file, e.g. 100+ MB vs 20 MB. To my surprise, level 5 didn’t reduce the observed quality of the final PDF file.
TODOs
- Add the table of contents to the PDF files. Nate says
jpdfbookmarks
should work. - Currently, on the list of books on my Kindle, instead of the authors’ names, I see ImageMagick’s website address.