Thanks for the information ... I am fairly conversant with perl and you've given some choice little nuggets to chew on. Time to do some coding I think ...
From:John R. McPherson [mailto:email@example.com]
Sent:Mon 8/4/2003 1:29 AM
To:Gregory S. Williamson
Subject:Re: [greenstone-devel] Retrieving a page from greenstone
On Sun, Aug 03, 2003 at 11:42:04PM -0700, Gregory S. Williamson wrote:
> Since I have a few dozen directories (in my test case) and this is
> on a Windows machine, I am at a bit of a loss as to how to find the
> hash code for a given document ... I could accumulate all of the
> relevant lines from the .xml files (<Metadata
> name="Identifier">HASH010ddaf18d9e03d233379b41</Metadata>, etc.) and
> put them into a dbm or other easily searched format. Or code a tool
> that searches .xml files in directories (and sub-directories, and
> perhaps sub-sub-directories ?) for the proper .xml file and then
> extract the hash.
> I was wondering, I suppose, if there is a function call that,
> given a file name would calculate the hash. I've been poking through
> various perl modules without luck so far ... (slightly slow going,
> lots of neat aspects to the code).
We use our own hash function, so that we can get consistency in
hashes across all platforms etc. (Actually, it seems possible that
it works differently on little-endian vs big-endian machines, I'm not
sure). Anyway, the program is in the gsdlhome\bin\(operating system)
directory, so on windows, look for %GSDLHOME%\bin\windows\hashfile.exe.
Another thing you could do if you want predictable Identifiers is to
modify the plugin and define your own set_OID() function, otherwise
the default set_OID() function in BasPlug is used if the import plugin
doesn't use its own. (This assumes you know some perl).
One last thing that might be helpful is the OIDtype option that you
can put in the collect.cfg file. If you put
in the collect.cfg (instead of the default hash method), then the
identifiers get called D0, D1, D2, etc in order that they are
processed at import time. (The "D" means document, and is so that the
Identifier isn't purely numeric).
Hope this helps