Re: [greenstone-users] PDFPlug

From Katherine Don
DateThu, 17 Jun 2004 09:38:31 +1200
Subject Re: [greenstone-users] PDFPlug
In-Reply-To (5-1-1-6-0-20040616011227-02ad5130-cwpanama-net)

This could be the same problem that Diego had - here is the answer I
sent him (which I should have sent to the list as well, sorry).
Try the following fix, and also run it in expert mode so that you can
see any error messages.


Sometimes pdftohtml puts the filename as the title in the html file. We
now check for that and remove it, because it makes more sense to assign
a new title based on the text of the document, instead of using the
In some cases where the html documents have no text, this means that the
html files produced by pdftohtml are identical, and therefore all end up
with the same id (generated by hashing the contents of the html file).

To fix this, we will add a new meta tag to the html if we have removed
the filename title.

You will need to edit your gsdl/bin/script/ file.
At around line 173 there is a bit like

# is this title the name of a filename?
if (-r "$title.pdf" || -r "$title.html") {
# remove the title
$line =~ s@<title>.*?</title>@<title></title>@;

replace the $line =~ ... with the following
$line =~ s@<title>.*?</title>@<title></title><META NAME="Orig-title"

Now all your converted html documents will be different and should get
different ids.


Azael Barrera, Ph.D. wrote:
> Hi Katherine, Diego and others,
> I have some sort of related problems. I had two collections I attempted
> to build today.
> One has 153 pdf documents. I create a vlist of dls.Titles instead of the
> title extracted by the
> pdf plugin. Fine, after a while, it worked.
> Then I did the same process with another collection of about 24 pdf
> files. Only 3 docs survived,
> not neccesarily in sequence in the metadata.xml file and not the last
> ones in the file.
> The rest, missing in action. The build.cfg file shows clearly numdocs
> 3. What happened?
> Is same or similar to problem faced by Diego? Should I try similar
> solution? In a third
> dissapointing try, with a collection of 12 pdf, the result was zero.
> I am using GLI from GSDL 2.50 under RH Linux 9. I am using it in the
> librarian mode, should I
> use it with advanced expert mode to trace the problem? Never had this
> problem with 2.40a
> and 2.41.
> Thanks in advance,
> Azael.-