Re: changing greenstone

From Duane Milne
Date02 Apr 2003 17:55:58 +1200
Subject Re: changing greenstone
In-Reply-To (00ae01c2f8d8$af68a390$0200a8c0-lomu)
Hi Stefan,

How about outputting a warning or error message if both are selected?
If you think the combination will be dangerous, then after outputting
the error/warning somewhere, have the script act as if -hidden was
selected and -complex was not...

Cheers,
Duane

[This reply is including the list, which I missed yet again]

On Wed, 2003-04-02 at 16:50, Stefan Boddie wrote:
> Hi Stephen,
>
> I've been playing with the files you sent and have tested pdftohtml with
> those and various other pdfs.
>
> It seems that the -hidden flag causes no problems for pdfs that don't
> contain hidden text. The only potential problem is if -hidden is set
> AND -complex is set AND the pdf contains hidden text. In this case pdftohtml
> produces a rather interesting looking html file with the text superimposed
> over the extracted images.
>
> My feeling is that we should simply alter pdftohtml.pl so it always uses
> the -hidden switch, except when the -complex option is passed to PDFPlug.
> This way we have a situation where:
>
> 1. PDFs without hidden text will continue to be processed correctly (with or
> without -complex being set) since the -hidden flag has no effect on them.
>
> 2. PDFs with hidden text will be processed correctly if -complex is not set.
>
> 3. PDFs with hidden text will come out looking nice if -complex is set but
> they won't have their text extracted.
>
> I don't really see an advantage in adding a new -get_hidden_text option to
> PDFPlug to allow the -hidden flag to be toggled. The only added function it
> gives over the above is the ability to set both -hidden and -complex. Since
> all -complex does is make the result look nicer and since -hidden
> effectively trashes the way it looks this doesn't seem so useful.
>
> My only hesitation is that users may become a little confused if they set
> the -complex option and find that some/all of their documents suddenly
> contain no searchable text. For this reason I'm tempted to make it
> so -hidden is always set, even if -complex is also set. This way text is
> always extracted but pdfs containing hidden text come out looking rubbish
> if -complex is set. As I write this I'm thinking maybe I prefer that
> approach.
>
> What do you think? Anyone else have any comments?
>
> cheers,
> Stefan.
>
> ----- Original Message -----
> From: <Stephen.DeGabrielle@ntu.edu.au>
> To: "Stefan Boddie" <sjboddie@cs.waikato.ac.nz>
> Sent: Tuesday, April 01, 2003 11:20 AM
> Subject: Re: changing greenstone
>
>
> >
> > Hi Stephan,
> >
> > > So you're saying that pdftohtml barfs on some pdf's if the -hidden
> switch
> > is
> >
> > I don't really know to be honest - I don't think it will crash - though
> > indexing no text will cause a 'broken pipe' as I mentioned before. I felt
> > that this was better to implement as a switch to give users the option. (I
> > have thought about a more general mechanism for using plugins with 3rd
> > party converters that just passes the file and the arguments needed. -Or
> > does this already exist?)
> >
> > I expect just setting '-hidden' will stuff up the layout of the html
> > version that pdftohtml creates, and probably get worse if '-complex' is
> > set.
> >
> > > set and barfs on others if it's not? That's kind of a problem if you're
> > > building a collection containing both kinds of pdf. We'd really need to
> >
> > > detect these "image over text" pdf's and set the flag only for those
> that
> > > need it, wouldn't we? Anyway, send me what you've done and I'll take a
> > look
> > > and discuss some more.
> >
> > Good point - but it is beyond me to do a 'detect hidden text' on a pdf -
> > and I expect the creators will know when they have this type of file as
> > they will have most likely created it as part of a scanning/digitisation
> > project.
> >
> > Maybe the flag could be set in the metadata(.xml) for the file?
> >
> >
> >
> > Here are a couple of examples;
> >
> > hidden_text_pdf_files.zip
> >
> >
> >
> > Here are the three files;
> >
> > pdfplug_get_hidden_text.zip
> >
> >
> >
> > Regards,
> >
> > Stephen
> >
> >
> >
>