Process a book scan for e-reader – Alexander Matrunich is Here

Following my pursuit of paragliding, I am studying atmosphere. Recently, I discovered an impressive collection of Russian-language books in the Digital public library of the Russian State Hydrometeorological University. I prefer reading electronic books on an electronic ink screen. My older than a decade 9.7″ e-reader Kindle DX still works, and I continue to use it to read PDF files.

I crop page margins to make it comfortable to read a PDF file. If it is a book scan, each page can contain a book opening, i.e. a pair of pages. In this case, I need to split the image. I used to make these operations using briss¹. Although it hasn’t been updated since 2013, briss does its job.

Before jumping to briss with my next PDF file, I searched for other free, open-source alternatives and discovered ScanTailor Advanced. An extremely detailed publication by Nate Craun covers multiple aspects of book scanning on Linux, including using ScanTailor. My posting would have been redundant if Nate’s text had been fresher than from 2013. Some commands didn’t work for me, so I had to update them.

ScanTailor can’t process PDF files. Instead, it should be TIFF or JPEG images. I use the command-line tool pdfimages to convert a PDF file to images. The following command extracts pages from the meteorology.pdf file and saves them to the meteorology directory as individual files named likepage_001.jpg. With the -j option, the pages are saved in JPEG format, and the resulting file is identical to the JPEG data in the PDF file.

pdfimages -j meteorology.pdf meteorology/page_nnn

Then, I proceed to ScanTailor: fix rotation, split pages, and crop margins. Nate advises setting DPI to 600 and saving pages as black-white. With my PDF files, this recommendation didn’t work for me: letters look uncomfortable for the eyes, and some formulas aren’t readable at all. I suspect the problem is with the original files that were heavily compressed, so it is impossible to distinguish the background from the letters in black and white. Consequently, I saved files as grey-scale with 300 DPI.

The last step is to export individual processed pages to a single PDF file:

magick -quality 5 -compress jpeg *.tif ../meteorology_cropped.pdf

Options -quality 5 -compress jpeg instruct to compress the page images using the JPEG algorithm with the compression level 5 from 100. Initially, I tried using level 90 or 70, but the resulting file size was several times larger than that of the original file, e.g. 100+ MB vs 20 MB. To my surprise, level 5 didn’t reduce the observed quality of the final PDF file.

TODOs

Add the table of contents to the PDF files. Nate says jpdfbookmarks should work.
Currently, on the list of books on my Kindle, instead of the authors’ names, I see ImageMagick’s website address.

While preparing this publication, I stumbled upon Briss 2.0, a descendant of briss. I haven’t tried it. ↩︎

Gramps инструкция на русском — Инструкции для вас on Gramps: приложение для исследований и анализа в генеалогии2025-05-24
[…] Источник […]
Генеалогические онлайн-сервисы для семей с корнями в России, Украине и Белоруссии | Alexander Matrunich is here on Gramps: приложение для исследований и анализа в генеалогии2022-09-06
[…] выглянуть за пределы своего генеалогического огорода Gramps и расширить подходы к изучению семейной истории с […]
Alexander Matrunich on Свободный софт для социологических исследований2022-02-10
Дела давно минувших дней: десять лет назад что ли :) Видео точно не было, а текст, наверно, не сохранился. Если…
Александр on Свободный софт для социологических исследований2022-02-10
Добрый день, а запись семинара где-то можно посмотреть? В видео или тексте?
Александр on Управляй идеями с FreeMind2022-02-10
Спасибо, весьма полезно.