Re: [greenstone-users] Some PDF files are imported but not shown in the listings

From John R. McPherson
DateThu, 23 Sep 2004 11:00:44 +1200
Subject Re: [greenstone-users] Some PDF files are imported but not shown in the listings
In-Reply-To (41502818-8080904-unesco-org-uy)
On Wed, 2004-09-22 at 01:09, Eduardo TrĂ¡pani wrote:

> Yes, that seems to have been the problem, not that the characters were badly encoded, but the fact that the PDF plugin tried to read utf8, what meant that the characters were no longer valid. I changed the input encoding of the PDF plugin from auto to iso_8859_1 and the files now show up in a new collection.
>
> But the old collection, even after rebuilding, still does not show them. Maybe once the metadata is wrong you have to get rid of it (or edit it) before the collection can be safely rebuilt ...

Yes, once a document has been imported, the marked up version is kept in
the collection's archive/ directory. This means that you can update a
collection by just adding new documents to the import/ directory and
re-importing and re-building the collection.

Normally, there is no problem re-importing an existing file because the
document's ID is generated by hashing the file's contents, so it will
have the same document ID as previously and overwrite the saved copy in
the archives/. However, for the PDF plugin, if you change the encoding
then the converter will generate a different file, and it will have a
different hash, and get a different doc ID. You can just remove the
archives/ directory and put all the originals into the import/ directory
again.

John