|William Hursthouse wrote:
> I am trying to put together a collection (actually, a series of
> collections) of pdf files. (A technical library) Almost all of them
> have no metadata to be extracted (and some are very large), so I wish
> to restrict the build process to only extracting metadata which I have
> entered manually for each file - yet still recongnise the file is a
> pdf and have it available to click on afterwards.
> I presume the build has to extract the "Source" from the original file
> - but can I restrict it to that? (At the moment the build process also
> extracts garbage and displays it. The build also sometimes screws with
> some of the files after getting frustrated at not being able to read
> them, so they don't open at all after it has finished).
> I am working with just a few files while I experiment, but the
> collection(s) will probably have several thousand such scanned pdf
> files, so I need to know if what I am aiming for is possible, and
> hopefully receive a little guidance.
> Thanks very much
I had a similar problem and posted it to the list a while ago. Someone
recommended using UnknownPlug before PDFPlug in collect.cfg. I hadn't
tried it then because I settled on not extracting PDF metadata at all
(none of the files I included in the collection turned out to have any),
so I just used UnknownPlug - this gives you a file icon to click on, but
of course no metadata at all.
If you have metadata for the scanned files stored in a database, it's
pretty easy to setup an external file metadata.xml, with records
containing the file name plus metadata items, drop that into the import
directory and Greenstone will automatically process it. (I can post an
example if you're interested.) In my case this was easy; I had a
database with bibliographical information and pdf files whose names were
derived from combinations of database fields, so I could generate the
metadata.xml with a simple perl script.
I'm not sure whether metadata can be processed from a separate metadata
file and, for the potentially remaining files, extracted from the pdf