Re: Protect URLs from rewriting, construct archive files directly?

From John R. McPherson
DateTue, 05 Nov 2002 13:54:14 +1300
Subject Re: Protect URLs from rewriting, construct archive files directly?
In-Reply-To (20021104182451F-tao-lib-uchicago-edu)
Tod Olson wrote:
> I'm working with Greenstone to put up a collection of scanned piano
> scores. After working with Greenstone for a little bit, I have two
> questions:
> 1. I'm indexing HTML files that refer to images (via URL). I do not
> want to copy the images into Greenstone, but rather to have my archive
> documents refer to their original URLs (they're large, and I can't
> increase my storage right now). How can I prevent the URLs for the
> image files from being rewritten by the HTML plugin?

Looking at the source, it looks like you can't. There is
a "-nolinks" option but that seems to only refer to "href" links, and
not links to image sources. It looks like you could edit the
file and comment out 2 lines to get greenstone to keep the original
<img src... links. Assuming you are using greenstone version 2.38,
on lines 277 and lines 278 of $GSDLHOME/perllib/plugins/
that start with
$$textref =~ s/(<img[^>]*?src ....
$self->replace_images ($1, $2, .....

If you add a hash ("#") symbol at the start of both of those lines,
then greenstone should leave any images pointing to the original.

> 2. I would like to generate the Greenstone archive documents
> directly, rather than generate HTML. Is there a plugin that will let
> me import archive documents directly, or is there some other way to do
> this?

The "GAPlug" (Greenstone Archives) does this, and is normally included
in any collect.cfg file. This plugin reads the collect/archives/archives.inf
file, which gives the names of the .xml archive files which are then
processed during build time. The other plugins (such as HTMLPlug) create
these archive files at import time, so a custom plugin would need to
do the same thing for GAPlug to understand your generated files.

PS - my "real" work here is doing optical music recognition on images
of sheet music, so I might have more than a passing interest in your
collection. Will it be publicly available?