Books digitialization

During my hunt for literature about aircraft design I found several interesting old books scanned to PDF online, these raw scans are hard to read, especially when both pages are scanned on a single PDF page. I've spent some time looking for a best way to clean up the scans and transform them to a good readable form, preferably into a printable quality - that means removing all artefacts, yellowish background, etc.

Basic PDF manipulation

There are many useful PDF manipulation tools in the Linux environment that could be used to prepare the PDF for further processing:

Rotate PDF pages

The pdftk is a great tool for basic PDF manipulation, for the book scans processing, we can fix the page rotation by running something like:

pdftk input.pdf cat 1-endeast output output.pdf

The pdftk takes the input.pdf file and reads all pages, pages from first (1) to last (end) are selected and rotated clockwise (east) and the result is written to output.pdf.

Change pages order

Sometimes a scan can have a pages in a wrong order, the left side first, second right, using the pdftk it's quite easy to change the page order by shuffling even and odd pages:

pdftk input.pdf shuffle even odd output out.pdf
Split PDF page into two

If a book was scanned by putting the open book on scanner and putting both left and right sides into a single PDF page, you can simply split the PDF page into halves using the MuPDF tooling:

mutool poster -y 2 input.pdf output.pdf

By this command we are telling the mutool to split each pdf page into 2 vertical parts, 50% of height each. Not perfect, but if the scans are symmetrical, it will work well.

Book scanning tools

Unpaper

The Unpaper is a great tool to clean the scanned documents, it doesn't work on PDF directly, you need to convert the data from pdf to jpeg or other supported formats.

OcrMyPdf

The OcrMyPdf is a great tool to annotate your scanned PDFs with the actual test data, e.g. for searching. It can do a much more as it utilizes the Unpaper tool. The image to text transformation is done using the Tesseract. If your scan quality is reasonably high, you can run it directly like:

ocrmypdf -l ces input.pdf output.pdf

The -l argument select the language the text is written in, you'll need to install a tesseract language pack for your language before running this command.

ScanTailor

All the tools above are nice, but far from what you can achieve with the mighty ScanTailor Advanced. It doesn't work with the PDF directly, so you need to convert the data from pdf to png for example:

gs -sDEVICE=pngalpha -r400 -o %02d.png input.pdf

This command will generate numerically named images that can be imported directly to ScanTaylor project. The tool itself is a bit non-intuitive on the first try, but the process is simple:

  • Use the Fix Orientation to rotate the page into expected position.
  • Click Apply to -> All pages to propagate the change to all imported pages
  • The current page thumbnail in sidebar is ok, but other pages show a question marks meaning the change was not applied. Either click on the page or run the batch processing by clicking the green play button next to Fix Orientation tool.
  • Now you can move to another tool.

Tools

The tools are applied in successive order as listed in the left sidebar, you can't change order of the steps. The tools available are:

Fix orientation

= page rotation.

Split pages

= split image into two pages, the autodetect algorithm detect the edge between pages quite nicely usually. split

Deskew

= fix text rotation, can automatically straighten the text on page, but in non-perfect scans a manual intervention is usually necessary. deskew

Select Content

= selects content of the page, excluding the book edges, etc. The border generated here shall only contain the text, page number, etc, that you want int the final book, the border shall be as close to text edges as possible. content

Margins

= generates empty space around the content selected before, also this tool allows to align the selected content on the page of constant size thus generating a consistent pages on the output. margins

Output

= output formatting, all the changes are now applied to the selected content - background color is removed, some filtering can be done, the warped scans can be straightened, etc. output

Merging output to PDF

Once you are finished with all the changes, click the green play button next to Output step and the final data are generated into the out folder. The only remaining step is to make a PDF from them again:

convert `ls -v1 out/*.tif | tr '\n' ' '` output.pdf

Previous Post