[greenstone-devel] PagedImagePlugin and PDFs

From Anupama of Greenstone Team
DateFri Jan 7 16:08:25 2011
Subject [greenstone-devel] PagedImagePlugin and PDFs
In-Reply-To (4D25FC12-9050005-gmx-com)
Hello Yitzchak,

Thank you very much for your kind and helpful contribution. When
Katherine comes on Monday, I hope to show her the code you have written.
(If you are willing, she may like it to be incorporated into the default
PagedImagePlugin.)

The recently released Release Candidate for Greenstone 2.84 uses
"PDFbox" (which is to be downloaded as an extension) to try to obtain
better conversions from PDF. I am curious whether Greenstone will
respond better now to your single-page PDFs and whether they will
preserve the images as well.

If you have the time and are interested, you may wish to try out the
GS2.84 release candidate. So far, binaries are available for Windows and
Linux. They have not been exposed to extensive testing yet, but I think
the PDFBox aspect may have been when Greenstone's interaction with it
was coded up.

The 2.84 release candidate is available from
http://www.greenstone.org/snapshots
and the PDFBox extension is there as well.

Thanks again for your regular enthusiastic contributions,
Anupama

Yitzchak Schaffer wrote:
> Hello all,
>
> We are starting work on a book collection based on single-page PDFs. We
> need to extract the text from the PDFs for indexing. We are planning to
> use the collection with our EmeraldView standalone frontend, with a new
> paged PDF presentation scheme.
>
> It looks to me like the PDFPlugin should be able to handle this with the
> convertto option, but Ghostscript was choking on our PDFs. Generating
> images is not important, as we plan on presenting the PDFs as-is only.
>
> In experimenting with the import process to get this to work, I have
> produced a modified PagedImagePlugin that appears to do what we need. I
> attach a patch here in case it might prove useful to anyone else. It
> assumes that one has the pdftotext executable installed; on my dev
> machine I just put it in GSDLHOMEbinwindows
>
> The patch includes a few lines that are collection-specific, around line
> 40 of the patch.
>
> Cheers,
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> greenstone-devel mailing list
> greenstone-devel@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-devel