« Digitalizing an Old Book | Main | Spammed! »

June 23, 2007

Digitalizing an Old Book. II

Converting to Djvu

I scanned Russian math book (310 pp) and got 23 MB pdf file (155 double pages). My experiments show that

  • One should not try to decrease further the size of this foo.pdf file since it will increase the size of the resulting djvu file;
  • The optimal way is to run djvudigital on my Mac which produced 7.2 MB foo.djvu;
  • And after transfer file to VPC and run Lizardech DjvuExpress Trial for OCR which increased file to 8.1 MB;
  • Despite the Russian text OCR was pretty good which demonstrates superiority of built-in ABBYY Fine Reader OCR engine (any2djvu.djvuzone.org cannot handle non-English text well.

My later experience:

One needs to remember: for serious scanning you need a serious scanning s/w; scanning directly to Acrobat is not an appropriate for the serious job:

  • Scanning preferences of Acrobat are not in Preferences; when you select File > Create PDF > From Scanner you can adjust Image Settings: Compression (Color/Grayscale, Monochrome, Size/Quality) and Filtering (Deskew, Background removal, Edge Shadow removal, Despecle, Halo removal). So no way to select type of material, scanning area, type of scanning (color, gray, black/white; as the result scanner tries to preview each page before  scanning and determine its type and geometry and select an appropriate mode. This makes the process way longer and the guess is often wrong. Scanning directly to Acrobat works well if you want to scan few separate pages rather than a book, and these pages are black and white (not yellow due to old age)
  • Also Acrobat itself can transform color pdf to grey but to make it b/w one needs third party utilities (standalone applications or Acrobat plugins) which are much more expensive than the good scanning s/w.
  • On the other hand, using vuescan (available for Mac/Linux/Windows) or other good scanning s/w I can select geometry, type of scanning (b/w) and the white/black threshold manually (but I can change it for each page) and then no need for preview (so process is way faster). Also resultion (300 dpi recommended, everything above 600 dpi is downsampled to 600 dpi for OCR) precisely rather than use slidebar on quality/size scale.
  • To make things worse Acrobat scans blindly: you do not see what you got until you finish the process. Further, it does not save - you need to finish scanning and save. Sure you can interrupt scanning and see/save the result but it makes the process even longer and more cumbersome.
  • In the contrast, vuescan shows you each page (so you can change black-white threshold) and saves automatically when you move to the next page. It saves each page as a separate file (pdf/jpeg/tiff/raw) with the default names crop0001.pdf, crop0002.pdf, … which  one can easily and automatically combine using either Acrobat or Ghostscript (v. 8.5 is fine)

Posted by Victor at June 23, 2007 03:37 AM