Re: [greenstone-users] Seeking plug-ins for program objects

From Richard Managh
DateFri, 22 Jun 2007 14:00:05 +1200
Subject Re: [greenstone-users] Seeking plug-ins for program objects
In-Reply-To (5e9e02e30706120450y63e2fe8u507fd30ae150653f-mail-gmail-com)
Hi Henry,

Henry Gladney wrote:
Before writing any plug-ins to help ingest computer programs into a Greenstone library, I would like some information. 

The background is a Computer History Museum (CHM: see http://www.computerhistory.org/) collection of 75,000 files recording the development of the Snobol and ICON programming languages.  This collection is being used in a pilot service to test and demonstrate the suitability of GSDL as a component of CHM infrastructure.  Most of these files belong to datatypes not represented among the plug-ins delivered as part of the GSDL 2.72 distribution; many of these are computer programs with modest internal comments that could be used to construct search indices.

(1) Has anyone written plug-ins for computer program and related document types?

There currently is a plugin called SRCPlug which handles the following:

# Current languages:
#   text: READMEs/Makefiles
#   C/C++   (currently extracts #include statements and C++ class decls)
#   Perl    (currently only done as text)
#   Shell   (currently only done as text)

I'm not sure if it will do what you need it to, but it might provide a starting point.

(2) More generally, is there anywhere a listing of available plug-ins beyond those distributed with GSDL 2.72?
Not that I know of.

(3) Having ingested the Snobol collection into a "vanilla" GSDL 2.72 instance, the normal collection interface on the Web shows only about 6000 of the 75,000 collection members.  Am I correct in supposing that the shortfall occurs because the other 69,000 files are not represented by plug-ins?
Probably, yes. You can check the fail.log for details on what has not been processed.


(4) Does some reader of this posting know a convenient way of showing the tree structure of the nested 3000 directory organization of the input collection?  Or of showing a browsable list of all files and directories in the same original directory as some file that has been selected by using existing GSDL search services?
You could use a Hierarcy classifier in greenstone for this, however it would not show any paths that do not have imported documents in them.

Greenstone may already add path metadata to your documents, when it converts them to the Greenstone archive format in the archives directory of your collection.
perhaps something like this, in each doc.xml
<Metadata name="gsdlsourcesfilename">Snobol/apples.txt</Metadata>

you can use classinfo.pl to display information about different classifiers, consulting this,  the Hierarchy classifier has an option -suppresslastlevel 

-suppresslastlevel       Ignore the final part of the metadata value. This is
                                  useful for metadata where each value is unique, such
                                  as file paths.

So you could try in your etc/collect.cfg file for your collection something like:

Hierarchy
classifyHierarchy -metadata gsdlsourcefilename -suppresslastlevel


Hopefully this will help you achieve what you want.


Richard.
-- 
DL Consulting
Greenstone Digital Library and Digitisation Specialists
contact@dlconsulting.com
www.dlconsulting.com