Tuesday, February 9, 2010

Assembling PDF from full-page scans using LaTeX

On Friday, I requested a journal article from the library, which doesn't have subscription to an electronic copy of the journal prior to 1991, but has a hard copy in storage. Over the weekend, they fished it out and gave me a photocopy. Unfortunately, some text near the book binding were not legible in the photocopy because the margin was too narrow. They let me check out the book for a day and figure out what I want to do with it.

I wanted to have a PDF copy, so this is what I tried:
  • I tried using a flat-bed scanner with Adobe Acrobat, but the text near the binding showed up black.
  • I tried using BookEye scanner, which us essentially a lamp and a digital camera. I could only photograph both pages in a single image since the image splitting function in the scanner simply chopped off the text near the binding. The page still showed up curved. Using a small piece of acrylics plastic, I was able to flatten one page at a time, so I photographed the two-page spread twice, once with the left page flattened, and once with the right page flattened. I was able to recover the missing text close to the binding, but the PDF has duplicate pages, half of each page should be discarded.
  • I tried using a Canon photocopier to scan for me. This produced the highest quality scan, but I was not able to flatten the page enough to recover text from the book binding.
So I took the scanned PDF from the Canon photocopy machine and the PDF from BookEye, extracted the images, and then used the BookEye images to patch the missing parts of the Canon scan. I also cleaned up the images. The result is several PNG files. Now I want PDF.

It turns out it is easy with pdfLaTeX. I made a .tex file like this:


And ran pdfLaTeX on it to generate the PDF. The main idea is to set page size to letter, set page margin to 0 , then include the image files while setting the width and height to that of the paper using LaTeX measurement macros. I think there are still rough corners of this approach because LaTeX complains about overfull hboxes, but the resulting PDF is usable for my need.


I_resent_having_to_name_everything said...

Two small fixes:
remove the first \newpage to get rid of the extra blank page.

prefix each \includegraphics with a \noindent to remove the extra space on the left.

Likai Liu said...

\newpage only starts a page if there are existing content (i.e. \newpage \newpage won't give you two blank pages). But you're right about \noindent.