Colleagues and I would like to
create Greenstone collections from nested directories that contain many
duplicates.(**) For obvious reasons, we want to represent each set of
duplicate files with a single representative and with a link from the
directory tree location of each redundant copy. I.e., the directory
tree structure represents information that we do not want to lose.
We are writing programs to scan the
input directories, putting something in place of each redundant file.
Currently the "something" is a relative link to a sole kept original
file. However, a test suggests to us that Greenstone would import the
resulting directory/file tree by (cleverly) replacing every pointer by
a copy of the file it points at, i.e., undoing what we have been at
some pains to do!
My search of the Greenstone
documentation did not reveal anything addressing our requirement. Am I
correct in this, or have I overlooked a documented solution to our
A not-very-clever possible solution
is to replace each redundant file with a one-line text file that
Greenstone does not recognize, but that an end user readily
understands. E.g., it might read as an abbreviation of "This
collection element is a duplicate of the one at [relative address
including filename]." However, I can immediately think of some rough
edges of what the prior paragraph suggests, including that (for user
convenience) one might be tempted to create further programs that end
users need to know about, with all the administrative overhead and
opportunities for errors implied by that.
Can any listener suggest a clever
hack? Or should we be requesting a GSDL enhancement?
H.M. Gladney, Ph.D. http://home.pacbell.net/hgladney
(**) Our first case is a file
collection representing the history of the Snobol/ICON family of
programming languages. This collection consists of 75015 files in approximately 3.6 Gbyte organized
into a directory tree with about 2300 directories (internal nodes).
Among these files (the terminal nodes), there are 47094 duplicates in