Re: changing greenstone

From Stefan Boddie
DateWed, 2 Apr 2003 16:50:13 +1200
Subject Re: changing greenstone
In-Reply-To (OF34B8F61A-93594464-ON69256CFA-00802A95-69256CFA-00802D3E-ntu-edu-au)
Hi Stephen,

I've been playing with the files you sent and have tested pdftohtml with
those and various other pdfs.

It seems that the -hidden flag causes no problems for pdfs that don't
contain hidden text. The only potential problem is if -hidden is set
AND -complex is set AND the pdf contains hidden text. In this case pdftohtml
produces a rather interesting looking html file with the text superimposed
over the extracted images.

My feeling is that we should simply alter pdftohtml.pl so it always uses
the -hidden switch, except when the -complex option is passed to PDFPlug.
This way we have a situation where:

1. PDFs without hidden text will continue to be processed correctly (with or
without -complex being set) since the -hidden flag has no effect on them.

2. PDFs with hidden text will be processed correctly if -complex is not set.

3. PDFs with hidden text will come out looking nice if -complex is set but
they won't have their text extracted.

I don't really see an advantage in adding a new -get_hidden_text option to
PDFPlug to allow the -hidden flag to be toggled. The only added function it
gives over the above is the ability to set both -hidden and -complex. Since
all -complex does is make the result look nicer and since -hidden
effectively trashes the way it looks this doesn't seem so useful.

My only hesitation is that users may become a little confused if they set
the -complex option and find that some/all of their documents suddenly
contain no searchable text. For this reason I'm tempted to make it
so -hidden is always set, even if -complex is also set. This way text is
always extracted but pdfs containing hidden text come out looking rubbish
if -complex is set. As I write this I'm thinking maybe I prefer that
approach.

What do you think? Anyone else have any comments?

cheers,
Stefan.

----- Original Message -----
From: <Stephen.DeGabrielle@ntu.edu.au>
To: "Stefan Boddie" <sjboddie@cs.waikato.ac.nz>
Sent: Tuesday, April 01, 2003 11:20 AM
Subject: Re: changing greenstone


>
> Hi Stephan,
>
> > So you're saying that pdftohtml barfs on some pdf's if the -hidden
switch
> is
>
> I don't really know to be honest - I don't think it will crash - though
> indexing no text will cause a 'broken pipe' as I mentioned before. I felt
> that this was better to implement as a switch to give users the option. (I
> have thought about a more general mechanism for using plugins with 3rd
> party converters that just passes the file and the arguments needed. -Or
> does this already exist?)
>
> I expect just setting '-hidden' will stuff up the layout of the html
> version that pdftohtml creates, and probably get worse if '-complex' is
> set.
>
> > set and barfs on others if it's not? That's kind of a problem if you're
> > building a collection containing both kinds of pdf. We'd really need to
>
> > detect these "image over text" pdf's and set the flag only for those
that
> > need it, wouldn't we? Anyway, send me what you've done and I'll take a
> look
> > and discuss some more.
>
> Good point - but it is beyond me to do a 'detect hidden text' on a pdf -
> and I expect the creators will know when they have this type of file as
> they will have most likely created it as part of a scanning/digitisation
> project.
>
> Maybe the flag could be set in the metadata(.xml) for the file?
>
>
>
> Here are a couple of examples;
>
> hidden_text_pdf_files.zip
>
>
>
> Here are the three files;
>
> pdfplug_get_hidden_text.zip
>
>
>
> Regards,
>
> Stephen
>
>
>