[greenstone-users] Re: [greenstone-devel] Problem with hash and archives.inf

From Katherine Don
DateSat Feb 23 09:12:02 2008
Subject [greenstone-users] Re: [greenstone-devel] Problem with hash and archives.inf
In-Reply-To (001701c8731e$244872c0$7c3401c8-diegos)
Hi Diego

There are two flags for import, -keepold, and -incremental. -incremental
should be used if you have the same document in the import directory and
don't want to reimport it. This will use timestamps to determine if a
document should be reimported or not.

-keepold should be used if you have imported documents previously, and
now want to import some new ones. I guess it assumes that you know what
you are doing, and doesn't check whether you have the same document
again or not.

The way things are implemented, two documents with the same hash id will
only get one entry in the archives.inf file. This very rarely happens.
The filename is used when calculating the hash id, so two identical
documents in the same directory with different filenames will get
different hash ids. And I think it is very unlikely that different
documents will get the same hash id.

When you have -keepold on, the bit that saves the file in the archives
directory will assume that it is different to any there and make sure
its saved in a separate directory. However, the archives.inf bit will be
overwritten (its a hash structure, with the hash id the key).

Does all this make sense?
The main recommendation is that you should use the -incremental flag,
not the -keepold flag, unless you do only have new documents. And maybe
we need to think about this behaviour, and see if we can make it better.

Cheers,
Katherine
Diego Spano wrote:
> Hi list,
>
> I can□t understand how GS manage the import process when it finds the
> same input filename. Let me explain with an example:
>
> I have doc1.pdf in import folder. Then I run "perl -S import.pl demo"
> and after it the archives folder has a subfolder named HASH01d6.dir
> with 2 files: doc.xml and doc.pdf and the file archives.inf has the
> following line:
>
> HASH01d6d949f2fdf0194131046a HASH01d6.dirdoc.xml I
>
> If I run again the import process with -keepold option and the same
> input file, the contents change as following:
>
> the archives folder has a subfolder named HASH01d6.dir with 2 files:
> doc.xml and doc.pdf and inside it there is another folder named .dir
> with two files too, doc.xml and doc.pdf. The file archives.inf has the
> following line:
>
> HASH01d6d949f2fdf0194131046a HASH01d6.dir.dirdoc.xml I
>
>
> So, there is no reference to the first imported file. The archives
> folder has 2 doc.xml and 2 doc.pdf but only one is referenced in
> archive.inf. This behaviour makes me think that every time we have a
> file in the import folder that has the same filename as other imported
> file (no matter when it was imported), the original file will be lost.
> It is impossible to unsure that every imput file will have a unique
> filename. GS should index both of them, so both files should be in
> archives.inf. Am I wrong?. It is a bug?
>
> TIA
>
> Diego Spano
>
>
> *Diego J. Spano*
> Direcci□n General de Gesti□n Inform□tica
> Ministerio de Justicia, Seg. y DD. HH.
> Tel.: 4328.3015 (int.1404)
> 4322.6122 (directo)
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> greenstone-devel mailing list
> greenstone-devel@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-devel
>