Re: [greenstone-users] Associated Files Storage

From Katherine Don
DateThu, 27 Jul 2006 10:45:14 +1200
Subject Re: [greenstone-users] Associated Files Storage
In-Reply-To (000001c6ad45$ecc6ad70$6ca22dcb-Colin)
Hi Colin

This is a consequence of how our collection building process works, and
because we started off with collections being static rather than
continually growing.

Data to build the collection on goes into the import directory. This is
just so we know which data to use for the collection. Importing creates
intermediate files (the doc.xml files) which are then used as source
data for indexing.
Strictly speaking, the archives are not essential. You can supposedly
run on the import directory. The trouble with that is that
there are several passes through the documents to compress the text,
index and build up the database. If you haven't generated the archives,
then each pass through the documents you are effectively regenerating
the doc.xml version. i.e. doing all the conversions and metadata extraction.
So this is much slower.

The final index directory is where the collection is served from. Both
import and archives can be deleted once the index is created.
Unfortunately, with our collection building process not being
incremental, deleting the import directory means that the collection
cannot be rebuilt.

In a true incremental system, once you have indexed some documents,
their import and archive documents can be deleted. And when you come to
add some more documents, they can be imported and built into the
collection without needing the source files for the others.

This is what we are working towards using Lucene. I am not sure how far
we have got towards that goal.

For now, here are some ideas:

1. I am hoping that the incremental stuff will be tidied up a bit for
the next release, and there will be some documentation about it. You
could wait till then, and maybe you will be able to delete import and
archives and still add more documents to the collection. I am hoping
that will be the case, but whether we will have got there yet I am not sure.

2. do command line building, and each time you import, use -keepold
and only have the new documents in the import directory. The archives
directory will not be deleted each time, and will just get the new
documents added in each import. This way, you will still have two copies
of the associated files, in archives and index, but its better than
three copies :-)

3. If you need to keep the archives files around for rebuilding, you
could perhaps modify the perl code that handles associated files, so
that they are not copied over into the index directory, but are linked
to from the archives directory. Now you only have one copy left.
This will involve programming though, and I am not sure how much of the
system it will affect.

Or maybe you could modify the code so that associated files are copied
to the index/building directory the first time a document is indexed,
and then they are deleted from the archives. Next time a build is done
you need to make sure that the associated files directory is not
deleted, and so existing associated files stay there. new associated
files will be transferred over, but ones that have already gone will not be.

Well, thats all. I hope this rambling was useful.


Lost Samoan (Corporate) wrote:
> Greetings All,
> I have (as an example) a collection with 2 of HTML pages. Both pages
> have associated files, both embedded in the HTML and as links (audio files).
> I place the 2 HTML files and ALL their associated files in a local
> directory, say c:\storage. If I □gather□ just the HTML files then none
> of the associated files are available to the collections pages. If I
> □gather□ the complete directory c:\storage all the collection pages
> embedded and linked files work fine, but: I then have the associated
> files located in the c:\~\{collection_name}\import\storage\ directory
> AND in the c:\~\{collection_name}\inex\assoc\HASHxxxx\ directory,
> doubling the storage required. After any □create□ the files are also
> archived tripling the storage space.
> If I then remove the associated files from
> c:\~\{collection_name}\import\storage, the collection pages will not
> work after any complete or incremental □create□.
> The real library is incrementally built six days per week with 30 new
> HTML pages and associated audio files (in three formats) to say 6 of
> those. This □triple□ storage acquaints to 54Mb for the audio files in
> lieu of the base 18Mb. Over one Year this is an extra 11Gb.
> Question 1) Am I doing something wrong with the □gather□, or the files
> associations within the HTML pages?
> Question 2) I know I can delete the □archive□ HASHxxxx folders and all
> still works- but is there an associated risk doing this?
> I have re-read all the manuals and checked the users archive, but cannot
> find an answer.
> Thank you for your consideration.
> **Colin J. Murfett**
> ------------------------------------------------------------------------
> _______________________________________________
> greenstone-users mailing list