[greenstone-users] Importing symbolic links (replacing duplicates)

From H.M. Gladney
DateThu, 10 May 2007 18:33:17 -0700
Subject [greenstone-users] Importing symbolic links (replacing duplicates)

Colleagues and I would like to create Greenstone collections from nested directories that contain many duplicates.(**)  For obvious reasons, we want to represent each set of duplicate files with a single representative and with a link from the directory tree location of each redundant copy.  I.e., the directory tree structure represents information that we do not want to lose.

We are writing programs to scan the input directories, putting something in place of each redundant file.  Currently the "something" is a relative link to a sole kept original file.  However, a test suggests to us that Greenstone would import the resulting directory/file tree by (cleverly) replacing every pointer by a copy of the file it points at, i.e., undoing what we have been at some pains to do!

My search of the Greenstone documentation did not reveal anything addressing our requirement.  Am I correct in this, or have I overlooked a documented solution to our requirement?

A not-very-clever possible solution is to replace each redundant file with a one-line text file that Greenstone does not recognize, but that an end user readily understands.  E.g., it might read as an abbreviation of "This collection element is a duplicate of the one at [relative address including filename]."  However, I can immediately think of some rough edges of what the prior paragraph suggests, including that (for user convenience) one might be tempted to create further programs that end users need to know about, with all the administrative overhead and opportunities for errors implied by that.

Can any listener suggest a clever hack?  Or should we be requesting a GSDL enhancement?

Cheerio, Henry
 
H.M. Gladney, Ph.D.  http://home.pacbell.net/hgladney

(**) Our first case is a file collection representing the history of the Snobol/ICON family of programming languages.  This collection consists of 75015 files in approximately 3.6 Gbyte organized into a directory tree with about 2300 directories (internal nodes).  Among these files (the terminal nodes), there are 47094 duplicates in 10215 sets.