[greenstone-users] RE: [greenstone-devel] Problem with hash and archives.inf

From graeme
DateFri Feb 29 13:37:56 2008
Subject [greenstone-users] RE: [greenstone-devel] Problem with hash and archives.inf
In-Reply-To (001601c87892$8c84a240$7c3401c8-diegos)
Maybe you could run a script to append the date to the file names.

Graeme.

On Wed, Feb 27, 2008 at 2:13 AM, Diego Spano <djspano&#64;jus.gov.ar> wrote:
> Hi Katherine,
>
> What a beatiful baby. Congratulations!!!!
>
> You are right, -keepold option process the file again, no matter if it is
> the same. But the problem is that archives.inf only maintains the last
> entry, and it should have both files. Let me explain why: we have a customer
> that every day generate a set of images image001.jpg to image100.jpg.
>
> Every day those 100 images are copied to import directory, so every day we
> will have the same filenames, with the only difference that after running
> import process, we clean the import folder just to get it ready to receive
> the new photos. But if import -keepold is executed, the older 100 photos
> don□t appear in archive.inf. If I use -incremental, it said: 0 documents
> were consider fro processing.
>
> So, I think it should be better to get an extra option for import.pl,
> something like -import_again, so for same filenames, get 2 entries in
> archive.inf. Could it be possible?
>
> Thanks a lot.
>
> Diego
>
>
> -----Mensaje original-----
> De: Katherine Don [mailto:kjdon@cs.waikato.ac.nz] Enviado el: Viernes, 22 de
> Febrero de 2008 06:12 p.m.
> Para: Diego Spano
> CC: 'Greenstone WAIKATO'
> Asunto: Re: [greenstone-devel] Problem with hash and archives.inf
>
>
>
> Hi Diego
>
> There are two flags for import, -keepold, and -incremental. -incremental
> should be used if you have the same document in the import directory and
> don't want to reimport it. This will use timestamps to determine if a
> document should be reimported or not.
>
> -keepold should be used if you have imported documents previously, and now
> want to import some new ones. I guess it assumes that you know what you are
> doing, and doesn't check whether you have the same document again or not.
>
> The way things are implemented, two documents with the same hash id will
> only get one entry in the archives.inf file. This very rarely happens.
> The filename is used when calculating the hash id, so two identical
> documents in the same directory with different filenames will get different
> hash ids. And I think it is very unlikely that different documents will get
> the same hash id.
>
> When you have -keepold on, the bit that saves the file in the archives
> directory will assume that it is different to any there and make sure its
> saved in a separate directory. However, the archives.inf bit will be
> overwritten (its a hash structure, with the hash id the key).
>
> Does all this make sense?
> The main recommendation is that you should use the -incremental flag, not
> the -keepold flag, unless you do only have new documents. And maybe we need
> to think about this behaviour, and see if we can make it better.
>
> Cheers,
> Katherine
> Diego Spano wrote:
> > Hi list,
> >
> > I can□t understand how GS manage the import process when it finds the
> > same input filename. Let me explain with an example:
> >
> > I have doc1.pdf in import folder. Then I run "perl -S import.pl demo"
> > and after it the archives folder has a subfolder named HASH01d6.dir
> > with 2 files: doc.xml and doc.pdf and the file archives.inf has the
> > following line:
> >
> > HASH01d6d949f2fdf0194131046a HASH01d6.dirdoc.xml I
> >
> > If I run again the import process with -keepold option and the same
> > input file, the contents change as following:
> >
> > the archives folder has a subfolder named HASH01d6.dir with 2 files:
> > doc.xml and doc.pdf and inside it there is another folder named .dir
> > with two files too, doc.xml and doc.pdf. The file archives.inf has the
> > following line:
> >
> > HASH01d6d949f2fdf0194131046a HASH01d6.dir.dirdoc.xml I
> >
> >
> > So, there is no reference to the first imported file. The archives
> > folder has 2 doc.xml and 2 doc.pdf but only one is referenced in
> > archive.inf. This behaviour makes me think that every time we have a
> > file in the import folder that has the same filename as other imported
> > file (no matter when it was imported), the original file will be lost.
> > It is impossible to unsure that every imput file will have a unique
> > filename. GS should index both of them, so both files should be in
> > archives.inf. Am I wrong?. It is a bug?
> >
> > TIA
> >
> > Diego Spano
> >
> >
> > *Diego J. Spano*
> > Direcci□n General de Gesti□n Inform□tica Ministerio de Justicia, Seg.
> > y DD. HH.
> > Tel.: 4328.3015 (int.1404)
> > 4322.6122 (directo)
> >
> > ----------------------------------------------------------------------
> > --
> >
> > _______________________________________________
> > greenstone-devel mailing list
> > greenstone-devel@list.scms.waikato.ac.nz
> > https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-devel
> >
>
>
> _______________________________________________
> greenstone-users mailing list
> greenstone-users@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
>