[greenstone-users] RE: [greenstone-devel] Problem with hash and archives.inf

From Diego Spano
DateWed Feb 27 05:08:23 2008
Subject [greenstone-users] RE: [greenstone-devel] Problem with hash and archives.inf
Hi Katherine,

What a beatiful baby. Congratulations!!!!

You are right, -keepold option process the file again, no matter if it is
the same. But the problem is that archives.inf only maintains the last
entry, and it should have both files. Let me explain why: we have a customer
that every day generate a set of images image001.jpg to image100.jpg.

Every day those 100 images are copied to import directory, so every day we
will have the same filenames, with the only difference that after running
import process, we clean the import folder just to get it ready to receive
the new photos. But if import -keepold is executed, the older 100 photos
don?t appear in archive.inf. If I use -incremental, it said: 0 documents
were consider fro processing.

So, I think it should be better to get an extra option for import.pl,
something like -import_again, so for same filenames, get 2 entries in
archive.inf. Could it be possible?

Thanks a lot.


-----Mensaje original-----
De: Katherine Don [mailto:kjdon@cs.waikato.ac.nz] Enviado el: Viernes, 22 de
Febrero de 2008 06:12 p.m.
Para: Diego Spano
CC: 'Greenstone WAIKATO'
Asunto: Re: [greenstone-devel] Problem with hash and archives.inf

Hi Diego

There are two flags for import, -keepold, and -incremental. -incremental
should be used if you have the same document in the import directory and
don't want to reimport it. This will use timestamps to determine if a
document should be reimported or not.

-keepold should be used if you have imported documents previously, and now
want to import some new ones. I guess it assumes that you know what you are
doing, and doesn't check whether you have the same document again or not.

The way things are implemented, two documents with the same hash id will
only get one entry in the archives.inf file. This very rarely happens.
The filename is used when calculating the hash id, so two identical
documents in the same directory with different filenames will get different
hash ids. And I think it is very unlikely that different documents will get
the same hash id.

When you have -keepold on, the bit that saves the file in the archives
directory will assume that it is different to any there and make sure its
saved in a separate directory. However, the archives.inf bit will be
overwritten (its a hash structure, with the hash id the key).

Does all this make sense?
The main recommendation is that you should use the -incremental flag, not
the -keepold flag, unless you do only have new documents. And maybe we need
to think about this behaviour, and see if we can make it better.

Diego Spano wrote:
> Hi list,
> I can?t understand how GS manage the import process when it finds the
> same input filename. Let me explain with an example:
> I have doc1.pdf in import folder. Then I run "perl -S import.pl demo"
> and after it the archives folder has a subfolder named HASH01d6.dir
> with 2 files: doc.xml and doc.pdf and the file archives.inf has the
> following line:
> HASH01d6d949f2fdf0194131046a HASH01d6.dirdoc.xml I
> If I run again the import process with -keepold option and the same
> input file, the contents change as following:
> the archives folder has a subfolder named HASH01d6.dir with 2 files:
> doc.xml and doc.pdf and inside it there is another folder named .dir
> with two files too, doc.xml and doc.pdf. The file archives.inf has the
> following line:
> HASH01d6d949f2fdf0194131046a HASH01d6.dir.dirdoc.xml I
> So, there is no reference to the first imported file. The archives
> folder has 2 doc.xml and 2 doc.pdf but only one is referenced in
> archive.inf. This behaviour makes me think that every time we have a
> file in the import folder that has the same filename as other imported
> file (no matter when it was imported), the original file will be lost.
> It is impossible to unsure that every imput file will have a unique
> filename. GS should index both of them, so both files should be in
> archives.inf. Am I wrong?. It is a bug?
> Diego Spano
> *Diego J. Spano*
> Direcci?n General de Gesti?n Inform?tica Ministerio de Justicia, Seg.
> y DD. HH.
> Tel.: 4328.3015 (int.1404)
> 4322.6122 (directo)
> ----------------------------------------------------------------------
> --
> _______________________________________________
> greenstone-devel mailing list
> greenstone-devel@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-devel