RE: [greenstone-users] Importing symbolic links (replacing duplicates) Correction

From H.M. Gladney
DateThu, 24 May 2007 17:47:36 -0700
Subject RE: [greenstone-users] Importing symbolic links (replacing duplicates) Correction
In-Reply-To (465602CF-50000-dlconsulting-co-nz)
Thank you.  These these lines had seemed out of place.
Cheerio, Henry

From: Richard Managh []
Sent: Thursday, May 24, 2007 2:26 PM
To: H.M. Gladney; 'greenstone user list'
Subject: Re: [greenstone-users] Importing symbolic links (replacing duplicates) Correction

Hi Henry,


Under where I wrote "Solution 2" in that last email are two lines that I was going to use to illustrate a point, but didn't end up using. Please disregard those two lines, they are probably just confusing.

<Metadata name="Identifier">HASH0102361bccb095d60673448c</Metadata>
<Metadata name="gsdlsourcefilename">import/apples/b17mie/b17mie.htm</Metadata>

DL Consulting
Greenstone Digital Library and Digitisation Specialists

H.M. Gladney wrote:
Thank you.  Will study your #2 and consider. 
And I do have an associate who has written AWK programs for detection of duplicates and is working on a tentative solution involving a replacement import directory.  Have passed your note on to him (in France!).
Cheerio, Henry
H.M. Gladney, Ph.D.

From: Richard Managh []
Sent: Wednesday, May 23, 2007 3:21 PM
To: H.M. Gladney; greenstone user list
Subject: Re: [greenstone-users] Importing symbolic links (replacing duplicates)

Hi  Henry,

If I understand what you want to do correctly, I imagine you want a document to be a unique filename, and the directories it appears in to be part of  the metadata of this file.

I'd say you probably need to know some Perl or need a perl programmer for this problem.

So lets say you had a file a.html, and the directories it appears in are import/a, import/b, import/c/d/e/f, import/g.

you might have as part of your built collection a document display that looks like this:


Source Directories

Solution 1: (see below for solution 2 which is more recommended)

I would suggest you:
 o try importing your files using UnknownPlug,
 o altering the identifier or OID that greenstone uses to identify documents in a collection to be your filenames i.e. a.html in the example,
 o alter UnknownPlug to add the directory that each file appears in as metadata to the file in the collection.

If you look at your currently imported data in the archives directory of your collection, you will notice doc.xml files for each of your imported files.

inside these doc.xml files you will find something like

<Metadata name="gsdlsourcefilename">import/apples/b17mie/b17mie.htm</Metadata>

any metadata that begins with "gsdl" is used internally by greenstone and isnt available in the built collection for use, contrary to the way that dc.Title might be for example.
If this was available, you might be able to use it to accomplish the above, but as it isnt, you need to add your own SourceFileDirectory metadata or something similar.

When a plugin i.e. UnknownPlug deals with a file to be imported it has available the files path, you could add that data as metadata to each unique filename.

Actually, after discussion with colleagues the above solution might be tricky, but I'll leave it there as it might be useful anyway.

Solution 2:

Here's another solution,

Write a perl program which builds an import directory from your existing import data.
This import directory will contain all of your unique files, and a metadata.xml file.
The metadata.xml file will contain an entry for each of the unique files and that files original directory paths in some metadata item called something like SourceFilePath.
In this way, you will import all your unique files and all their original paths as metadata.

So the entry for a.html will look something like this:

<Metadata name="Identifier">HASH0102361bccb095d60673448c</Metadata>
<Metadata name="gsdlsourcefilename">import/apples/b17mie/b17mie.htm</Metadata>
<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE GreenstoneDirectoryMetadata SYSTEM "">
            <Metadata mode="accumulate" name="SourceFilePath">a</Metadata>
            <Metadata mode="accumulate" name="SourceFilePath">b</Metadata>
            <Metadata mode="accumulate" name="SourceFilePath">c/d/e/f</Metadata>
            <Metadata mode="accumulate" name="SourceFilePath">g</Metadata>

Good luck,

DL Consulting
Greenstone Digital Library and Digitisation Specialists

H.M. Gladney wrote:

Colleagues and I would like to create Greenstone collections from nested directories that contain many duplicates.(**)  For obvious reasons, we want to represent each set of duplicate files with a single representative and with a link from the directory tree location of each redundant copy.  I.e., the directory tree structure represents information that we do not want to lose.

We are writing programs to scan the input directories, putting something in place of each redundant file.  Currently the "something" is a relative link to a sole kept original file.  However, a test suggests to us that Greenstone would import the resulting directory/file tree by (cleverly) replacing every pointer by a copy of the file it points at, i.e., undoing what we have been at some pains to do!

My search of the Greenstone documentation did not reveal anything addressing our requirement.  Am I correct in this, or have I overlooked a documented solution to our requirement?

A not-very-clever possible solution is to replace each redundant file with a one-line text file that Greenstone does not recognize, but that an end user readily understands.  E.g., it might read as an abbreviation of "This collection element is a duplicate of the one at [relative address including filename]."  However, I can immediately think of some rough edges of what the prior paragraph suggests, including that (for user convenience) one might be tempted to create further programs that end users need to know about, with all the administrative overhead and opportunities for errors implied by that.

Can any listener suggest a clever hack?  Or should we be requesting a GSDL enhancement?

Cheerio, Henry
H.M. Gladney, Ph.D.

(**) Our first case is a file collection representing the history of the Snobol/ICON family of programming languages.  This collection consists of 75015 files in approximately 3.6 Gbyte organized into a directory tree with about 2300 directories (internal nodes).  Among these files (the terminal nodes), there are 47094 duplicates in 10215 sets.