All about associated files

From Gordon Paynter
DateThu, 7 Feb 2002 17:13:27 -0800
Subject All about associated files
In-Reply-To (77261E830267D411BD4D00902740AC2508D640A7-xvan01-vcd-hp-com)
Here's an explanation of the difference between Greenstone
documents "in the database" and Greenstone documents stored as
associated files. I hope it will not further confuse anyone...

On Thursday 07 February 2002, someone wrote:

> I'm dying to ask one quiet question about non-textual artifacts such
> as digital photos, Gordon's Quicktime movies and 'executables' like
> an Excel spreadsheet.

> Can a Greenstone digital library store these types of artifacts as
> 'chunks' or 'blobs' (binary, large objects) in its database? Or,
> are they stored in their native format alongside the database

The term you're looking for is "associated file". The text and
metadata of every document in a collection is stored in the GDBM
database. However, each document can also have several "associated
files" stored in the collection (in the index/assoc directory). A web
browser can load any such associated file if it knows the URL.

A good example is a web page that includes text and images. In this
case, HTMLPlug stores the text of the web page in the GDBM database
and each of the images as associated files.

Some document plugins (e.g. TextPlug, EmailPlug, ReferPlug) process a
document and only store text in the GDBM database. Some (e.g.
ImagePlug, UnknownPug) process a document and store data only as
asoociated files. Some plugins (e.g. WordPlug, PDFPlug, HTMLPlug)
store data in both ways.

> in a way so
> the database can provide access to them via a <srclink> tag?

The srclink "tag" (which is really a metadata element, I think) is
only set by PDFPlug and WordPlug (as far as I know). These Plugins
are unique in that they take the original document, extract its text
and store it in the GDMB database, then store the original document in
the collection as an associated file. (The URL of this associated
file is stored in the srclink tag so you can retrieve it later.) In
effect, the document is in the collection twice, once in its original
form, and once as text.

This is a bit different from how HTMLPlug works: there the body of the
document is in the database, and supplimentary data (i.e. images) are
in the associated files. And it is very different from ImagePlug and
UnknownPlug, where there is no "document text" in the GDBM database
(though there may be metadata), and the documents are stored only as
associated files.

(Note: I think there are options to WordPlug & PDFPlug that let you
throw away the associated file.)

> I thought I read somewhere (but can't find it now) that one could
> submit a URL and associated meta-data to the Collector as a means of
> allowing Greenstone to index an item without having to absorb the
> referred-to artifact into the database itself. The example I think
> I'm recalling had to do with a large photo library.

I don't know anything much about the collector, sorry. Perhaps Stefan
can provide some insight?

Warning: what follows is a too-detailed explanation of how
you can get the URLs of associated files. If you aren't
really interested, you should probably stop here.

> I've got copies of the User's Guide, Developer's Guide and
> Installation Guide, but can only find the briefest mention of the
> <srclink> tag and how it works. Are there other documents I've
> missed that give a more complete explanation about how to index
> references to 'blobs' that live outside the Greenstone DB?

In order to "get to" an associated file, you need a URL for it. As
noted, srclink is just such an URL, and is generated for you by
PDFPlug and WordPlug. In HTMLPlug, the URLs of images in a document
are usually changed so that they point to the URLs of the new
associated files, not the original images (Note: the behaviour in The
Collector may be different).

For other Plugins, you have to know what your plugin will name the
file, then somehow create an URL for the associated file in your
format strings. This is not hard, but can be complex. The directory
where all the associated files for a document (in a collection called
pictures) can be reached by placing the following url in a format
string in the collec.cfg file:


Here _httpprefix_ will be set according to your collect.cfg file, and
[assocfilepath] will be set on a document-by-document basis for every
document. If you know there is a thumbnail image in there, and its
filename is stored in a metadata element called Thumb (ImagePlug works
like this) the you can embed the following in your format string to
display that associated file:


Note that WordPlug always names assocaited word document files
"doc.doc" (I think) so you can create a link to them like this (all on
one line):

<a href="_httpprefix_/collect/pictures/index/
assoc/[assocfilepath]/doc.doc">word document</a>

If you use UnknownPlug to import Excel files as in this example:

plugin UnknownPlug -process_exp '.xls' \
-assoc_field 'ExcelFileName' -file_type 'bin/excel'

Then the name of the associated file (in this case, an Excel
spreadsheet) it will be stored as a metadata element called
"ExcelFileName". You can get to it with this URL:


Note that you can always look in the directories in
$GSDLHOME/collect/pictures/index/assoc/* to see what associated files
are stored for your collection

I hope this helped. Someone.