Re: changing greenstone

From Stefan Boddie
DateMon, 31 Mar 2003 21:50:30 +1200
Subject Re: changing greenstone
In-Reply-To (OF28964AFB-C586BE53-ON69256CFA-00163300-69256CFA-0016332A-ntu-edu-au)
>
> I just added a switch to PDFPlug. I think it works properly - but I was
> hoping someone could tell me the correct approval, testing and submission
> procedures?
>

Um, procedures for stuff like this are something we're a little short of I'm
afraid. In the past (and this hasn't happened more than a few times) I've
asked people to send the files they've altered to me to check and add to
CVS. So, if you send me the altered files and, if possible, a pdf file for
which the new option is required, I'll have a play with it and put it in.

I'm hoping that contributions like this from outside the core development
team will become more common now that greenstone is becoming more widely
used. That said, we'll have to do some work on the way we manage things.

> The switch is '-get_hidden_text' and it allows greenstone to retrieve and
> index 'hidden' text in PDF files created from scanned documents.
>
> A common way to deliver searchable documents is as PDF files with the page
> image covering the invisible, and uncorrected OCR'd text.
>
> The current version of Greenstone spits the dummy on me when I try to
build
> collections of these documents --> I get the dredded and uninformative
> 'broken pipe' error at the end of the build process.
>
> When I first encountered this problem I just hacked the pdftohtml.pl file
> to include the '-hidden' switch in the arguments for the pdftohtml
> converter.
> Because I felt this would cause problems if I tried to convert regular pdf
> files (especially if I used the '-complex' switch), so I have changed a
few
> files to give PDFPlug a switch to 'get hidden text'.
>

So you're saying that pdftohtml barfs on some pdf's if the -hidden switch is
set and barfs on others if it's not? That's kind of a problem if you're
building a collection containing both kinds of pdf. We'd really need to
detect these "image over text" pdf's and set the flag only for those that
need it, wouldn't we? Anyway, send me what you've done and I'll take a look
and discuss some more.

> The files I have changed are;
> PDFPlug.pm
> gsConvert.pl
> pdftohtml.pl
>
> I know it's a small thing but I feel it is important to contribute to open
> source software projects (since I use so many - but pay for so few).
>

It's much appreciated.

cheers,
Stefan.