During my hunt for literature about aircraft design I found several interesting old books scanned to PDF online, these raw scans are hard to read, especially when both pages are scanned on a single PDF page. I've spent some time looking for a best way to clean up the scans and transform them to a good readable form, preferably into a printable quality - that means removing all artefacts, yellowish background, etc.
There are many useful PDF manipulation tools in the Linux environment that could be used to prepare the PDF for further processing:
The pdftk is a great tool for basic PDF manipulation, for the book scans processing, we can fix the page rotation by running something like:
pdftk input.pdf cat 1-endeast output output.pdf
The pdftk takes the input.pdf file and reads all pages, pages from first (1) to last (end) are selected and rotated clockwise (east) and the result is written to output.pdf.
Sometimes a scan can have a pages in a wrong order, the left side first, second right, using the pdftk it's quite easy to change the page order by shuffling even and odd pages:
pdftk input.pdf shuffle even odd output out.pdf
If a book was scanned by putting the open book on scanner and putting both left and right sides into a single PDF page, you can simply split the PDF page into halves using the MuPDF tooling:
mutool poster -y 2 input.pdf output.pdf
By this command we are telling the mutool to split each pdf page into 2 vertical parts, 50% of height each. Not perfect, but if the scans are symmetrical, it will work well.
The Unpaper is a great tool to clean the scanned documents, it doesn't work on PDF directly, you need to convert the data from pdf to jpeg or other supported formats.
The OcrMyPdf is a great tool to annotate your scanned PDFs with the actual test data, e.g. for searching. It can do a much more as it utilizes the Unpaper tool. The image to text transformation is done using the Tesseract. If your scan quality is reasonably high, you can run it directly like:
ocrmypdf -l ces input.pdf output.pdf
The -l
argument select the language the text is written in, you'll need to install a tesseract language pack for your language before running this command.
All the tools above are nice, but far from what you can achieve with the mighty ScanTailor Advanced. It doesn't work with the PDF directly, so you need to convert the data from pdf to png for example:
gs -sDEVICE=pngalpha -r400 -o %02d.png input.pdf
This command will generate numerically named images that can be imported directly to ScanTaylor project. The tool itself is a bit non-intuitive on the first try, but the process is simple:
The tools are applied in successive order as listed in the left sidebar, you can't change order of the steps. The tools available are:
= page rotation.
= split image into two pages, the autodetect algorithm detect the edge between pages quite nicely usually.
= fix text rotation, can automatically straighten the text on page, but in non-perfect scans a manual intervention is usually necessary.
= selects content of the page, excluding the book edges, etc. The border generated here shall only contain the text, page number, etc, that you want int the final book, the border shall be as close to text edges as possible.
= generates empty space around the content selected before, also this tool allows to align the selected content on the page of constant size thus generating a consistent pages on the output.
= output formatting, all the changes are now applied to the selected content - background color is removed, some filtering can be done, the warped scans can be straightened, etc.
Once you are finished with all the changes, click the green play button next to Output step and the final data are generated into the out folder. The only remaining step is to make a PDF from them again:
convert `ls -v1 out/*.tif | tr '\n' ' '` output.pdf