I've been playing with the files you sent and have tested pdftohtml with
those and various other pdfs.
It seems that the -hidden flag causes no problems for pdfs that don't
contain hidden text. The only potential problem is if -hidden is set
AND -complex is set AND the pdf contains hidden text. In this case pdftohtml
produces a rather interesting looking html file with the text superimposed
over the extracted images.
My feeling is that we should simply alter pdftohtml.pl so it always uses
the -hidden switch, except when the -complex option is passed to PDFPlug.
This way we have a situation where:
1. PDFs without hidden text will continue to be processed correctly (with or
without -complex being set) since the -hidden flag has no effect on them.
2. PDFs with hidden text will be processed correctly if -complex is not set.
3. PDFs with hidden text will come out looking nice if -complex is set but
they won't have their text extracted.
I don't really see an advantage in adding a new -get_hidden_text option to
PDFPlug to allow the -hidden flag to be toggled. The only added function it
gives over the above is the ability to set both -hidden and -complex. Since
all -complex does is make the result look nicer and since -hidden
effectively trashes the way it looks this doesn't seem so useful.
My only hesitation is that users may become a little confused if they set
the -complex option and find that some/all of their documents suddenly
contain no searchable text. For this reason I'm tempted to make it
so -hidden is always set, even if -complex is also set. This way text is
always extracted but pdfs containing hidden text come out looking rubbish
if -complex is set. As I write this I'm thinking maybe I prefer that
What do you think? Anyone else have any comments?
----- Original Message -----
To: "Stefan Boddie" <firstname.lastname@example.org>
Sent: Tuesday, April 01, 2003 11:20 AM
Subject: Re: changing greenstone
> Hi Stephan,
> > So you're saying that pdftohtml barfs on some pdf's if the -hidden
> I don't really know to be honest - I don't think it will crash - though
> indexing no text will cause a 'broken pipe' as I mentioned before. I felt
> that this was better to implement as a switch to give users the option. (I
> have thought about a more general mechanism for using plugins with 3rd
> party converters that just passes the file and the arguments needed. -Or
> does this already exist?)
> I expect just setting '-hidden' will stuff up the layout of the html
> version that pdftohtml creates, and probably get worse if '-complex' is
> > set and barfs on others if it's not? That's kind of a problem if you're
> > building a collection containing both kinds of pdf. We'd really need to
> > detect these "image over text" pdf's and set the flag only for those
> > need it, wouldn't we? Anyway, send me what you've done and I'll take a
> > and discuss some more.
> Good point - but it is beyond me to do a 'detect hidden text' on a pdf -
> and I expect the creators will know when they have this type of file as
> they will have most likely created it as part of a scanning/digitisation
> Maybe the flag could be set in the metadata(.xml) for the file?
> Here are a couple of examples;
> Here are the three files;