Re: pdf to html errors

From Stefan Boddie
DateTue, 19 Nov 2002 12:24:26 +1300
Subject Re: pdf to html errors
In-Reply-To (Pine-LNX-4-44-0211080855270-14243-100000-mail-rri-local-net)
Girija S wrote:
> Greetings from Bangalore!
>
> I have been trying to create a collection of pdf files. My system is
> windows XP and I use gsdl 2.38. I find that some of the pdf files cannot
> be included in the collection probably because they are image files as I
> learn from previous messages in the archives. (By the way, I have learned
> an enormous amount from the previous messages. Thanks to everyone)
>
> What I would like to know is - is there ANY way that we can include these
> files in the collection? If so I would like to have the details regarding
> this.
>
> Thanks a lot
>
> Girija Srinivasan
> -------------------------------------------------------------------------------
> Raman Research Institute Library, Tel: +91 80 361 0122
> C V Raman Avenue, Sadashivanagar, Fax: +91 80 361 0492
> Bangalore 560 080, India. Email: girija@rri.res.in
>

To include a file with no text (e.g. a PDF containing only images or for
which the conversion to HTML failed) you'll need to hack
gsdl/perllib/plugins/ConvertToPlug.pm. In the read() function of this
module you'll see a block of code like the following:

&BasPlug::read_file($self, $conv_filename, $encoding, $language, $text);
if (!length ($text)) {
my $plugin_name = ref ($self);
print $outhandle "$plugin_name: ERROR: $file contains no text " if
$self->{'verbosity'};
return 0;
}

This forces all plugins derived from ConvertToPlug (i.e. PDFPlug,
WordPlug etc.) to ignore a file if no text could be extracted from it.
By commenting out or deleting the line reading "return 0;" you should be
able to get it to continue and include the file.

We should probably add a new plugin option to allow this as it's
preferable to include the file in many cases (even though text wasn't
extracted so it's not full-text searchable).

regards,
Stefan.