Re: [greenstone-users] Some PDF files are imported but not shown in the listings

From John R. McPherson
DateTue, 21 Sep 2004 12:39:25 +1200
Subject Re: [greenstone-users] Some PDF files are imported but not shown in the listings
In-Reply-To (414EBE4D-5000400-esperanto-org-uy)
On Mon, 2004-09-20 at 23:26, Eduardo Trápani wrote:
> Hi,
>
> I'm trying to set up my own library after having successfully followed most of the three days course.
>
> I added three PDF files. I got no errors during creation and all three files are taken into account. But I only get two files when browsing by files or by titles.
>
> ex.Source is set alright, how can it be that the file does not show when listing by filename? (I created the library without any fancy options, just Dublin metadata set and nothing else).
>
> I then tried to "enrich" the file by adding a title, dc.Title. Then I added an index and the possibility to browse by dc.Title and the file still does not show!
>
> Is there anyway to debug that? I'm checking it now and it happens with many PDF files.

The most common reason for a file to be imported but not included in the
built results is if the imported file contains badly encoded characters.
For PDF files, this often happens if the 3rd party pdftohtml program we
use can't successfully extract the text, and instead extracts binary
information.

This should show up in a build log - it will say something about not
processing a file because it contains invalid characters. I don't know
how to see the build log when you use the GLI program, but you could try
building from the command line as documented in the user manual to get
more debugging information.

> In case you care to take a look at the PDF file, it is at: http://www.unesco.org.uy/phi/libros/analisisMaule.pdf

For this file, pdftohtml correctly extracted and encoded the text from
the file, but it extracted the name of the author in the wrong encoding,
which results in an invalid greenstone archives format XML file:
<META name="author" content="Centro de Informática">
Part of the problem seems to be that the PDF standard doesn't appear to
have a method of saying what encoding the metadata is using.

We've recently added some code that tries to detect when any document
metadata isn't properly encoded and fixes it up, and this will be in the
next release of greenstone. For now, maybe the best way to deal with
this is to manually assign metadata for that file to override the badly
encoded metadata inside the file.

John McPherson