|
Thank you. These these lines
had seemed out of place.
Cheerio, Henry
Hi Henry,
Correction:
Under where I wrote "Solution
2" in that last email are two lines that I was going to use to illustrate a
point, but didn't end up using. Please disregard those two lines, they are
probably just confusing.
<Metadata
name="Identifier">HASH0102361bccb095d60673448c</Metadata> <Metadata
name="gsdlsourcefilename">import/apples/b17mie/b17mie.htm</Metadata>
Richard.
--
DL Consulting
Greenstone Digital Library and Digitisation Specialists
contact@dlconsulting.com
www.dlconsulting.com
H.M.
Gladney wrote:
Thank you. Will study your #2 and consider.
And I do have an associate who has written AWK programs
for detection of duplicates and is working on a tentative solution involving a
replacement import directory. Have passed your note on to him (in
France!).
Hi
Henry,
If I understand what you want to do correctly, I imagine you
want a document to be a unique filename, and the directories it appears in to
be part of the metadata of this file.
I'd say you probably need
to know some Perl or need a perl programmer for this problem.
So lets
say you had a file a.html, and the directories it appears in are import/a,
import/b, import/c/d/e/f, import/g.
you might have as part of your
built collection a document display that looks like
this:
a.html
Source
Directories a b c/d/e/f g
Solution 1: (see below for
solution 2 which is more recommended)
I would suggest you: o
try importing your files using UnknownPlug, o altering the
identifier or OID that greenstone uses to identify documents in a collection
to be your filenames i.e. a.html in the example, o alter UnknownPlug
to add the directory that each file appears in as metadata to the file in the
collection.
If you look at your currently imported data in the archives
directory of your collection, you will notice doc.xml files for each of your
imported files.
inside these doc.xml files you will find something
like
<Metadata
name="gsdlsourcefilename">import/apples/b17mie/b17mie.htm</Metadata>
any
metadata that begins with "gsdl" is used internally by greenstone and isnt
available in the built collection for use, contrary to the way that dc.Title
might be for example. If this was available, you might be able to use it to
accomplish the above, but as it isnt, you need to add your own
SourceFileDirectory metadata or something similar.
When a plugin i.e.
UnknownPlug deals with a file to be imported it has available the files path,
you could add that data as metadata to each unique
filename.
Actually, after discussion with colleagues the
above solution might be tricky, but I'll leave it there as it might be useful
anyway.
Solution 2:
Here's another solution,
Write a
perl program which builds an import directory from your existing import
data. This import directory will contain all of your unique files, and a
metadata.xml file. The metadata.xml file will contain an entry for each of
the unique files and that files original directory paths in some metadata item
called something like SourceFilePath. In this way, you will import all your
unique files and all their original paths as metadata.
So the entry for
a.html will look something like this:
<Metadata
name="Identifier">HASH0102361bccb095d60673448c</Metadata> <Metadata
name="gsdlsourcefilename">import/apples/b17mie/b17mie.htm</Metadata> <?xml
version="1.0" encoding="UTF-8"?>
<!DOCTYPE
GreenstoneDirectoryMetadata SYSTEM "http://greenstone.org/dtd/GreenstoneDirectoryMetadata/1.0/GreenstoneDirectoryMetadata.dtd"> <GreenstoneDirectoryMetadata> ....
<FileSet>
<FileName>a.html</FileName>
<Description>
...
<Metadata mode="accumulate"
name="SourceFilePath">a</Metadata>
<Metadata mode="accumulate"
name="SourceFilePath">b</Metadata>
<Metadata mode="accumulate"
name="SourceFilePath">c/d/e/f</Metadata>
<Metadata mode="accumulate"
name="SourceFilePath">g</Metadata>
...
</Description>
</FileSet> ... </GreenstoneDirectoryMetadata>
Good
luck,
Richard.
--
DL Consulting
Greenstone Digital Library and Digitisation Specialists
contact@dlconsulting.com
www.dlconsulting.com H.M.
Gladney wrote:
Colleagues and I would like to create Greenstone
collections from nested directories that contain many duplicates.(**)
For obvious reasons, we want to represent each set of duplicate files with a
single representative and with a link from the directory tree location of
each redundant copy. I.e., the directory tree structure represents
information that we do not want to lose.
We are writing programs to scan the input
directories, putting something in place of each redundant file.
Currently the "something" is a relative link to a sole kept original file. However, a test suggests to us that Greenstone would import the
resulting directory/file tree by (cleverly) replacing every pointer by a
copy of the file it points at, i.e., undoing what we have been at some pains
to do!
My search of the Greenstone documentation did not
reveal anything addressing our requirement. Am I correct in this, or
have I overlooked a documented solution to our requirement?
A not-very-clever possible solution is to replace
each redundant file with a one-line text file that Greenstone does not
recognize, but that an end user readily understands. E.g., it might
read as an abbreviation of "This collection element is a duplicate of the
one at [relative address including filename]." However, I can
immediately think of some rough edges of what the prior paragraph suggests,
including that (for user convenience) one might be tempted to create further
programs that end users need to know about, with all the administrative
overhead and opportunities for errors implied by that.
Can any listener suggest a clever hack? Or
should we be requesting a GSDL enhancement?
Cheerio, Henry H.M. Gladney, Ph.D.
http://home.pacbell.net/hgladney
(**) Our first case is a file collection
representing the history of the Snobol/ICON family of programming
languages. This collection consists of 75015 files in approximately 3.6 Gbyte organized into a directory
tree with about 2300 directories (internal nodes). Among these files
(the terminal nodes), there are 47094 duplicates in 10215 sets.
|