Re: more [greenstone-users] Importing CDS/ISIS - metadata non ascii bugs]

From Michael Dewsnip
DateWed, 01 Mar 2006 17:40:34 +1300
Subject Re: more [greenstone-users] Importing CDS/ISIS - metadata non ascii bugs]
In-Reply-To (44049853-9020608-inwind-it)
Hi Ruben,

We have already fixed the problem with "&" characters -- you can
download an updated version of explode_metadata_database.pl from
http://www.cs.waikato.ac.nz/~mdewsnip/greenstone/temp-2.63/explode_metadata_database.pl
(this goes in your Greenstone "bin/script" directory). Then re-explode
the database.

Regards,

Michael

ruben pandolfi wrote:

> Hi,
>
> In addition to the 2 previous bugs reported in messages below:
>
> 1 - non ascii metadata can not be read after mst explosion
> 2 - metadata containing semicolon " : " cause problem
>
> I would like to add a further bug.
>
> 3 - the character " & " in records value prevent the parser to
> correctly read metadata.xml :
>
>
> import.pl> RecPlug: ERROR
> /var/www/gsdl/collect/babeluni/import/BABEL3/metadata.xml is not a
> well formed metadata.xml file (
> import.pl> not well-formed (invalid token) at line 4979, column 146,
> byte 281720 at /usr/lib/perl5/XML/Parser.pm line 187
> import.pl> )
>
> The build process fails.
>
>
> This character is used in websites URL.
>
> ...................................
>
> Remouving the character " & " manually solves the problem, and the
> collection builds ok. Though the URL links will not work.
>
> Hope this will help to fix new gsdl version.
>
> Thank you!
>
> Ruben
>
>
>
>
>
>
>
>
> -------- Original Message --------
> Subject: Re: [greenstone-users] Importing CDS/ISIS - metadata non
> ascii bugs
> Date: Wed, 22 Feb 2006 16:58:25 +0100
> From: ruben pandolfi <pandolfi.r@inwind.it>
> To: Katherine Don <kjdon@cs.waikato.ac.nz>,
> greenstone-users@list.scms.waikato.ac.nz
> References: <43EC3393.4040505@inwind.it>
> <Pine.WNT.4.61.0602100823230.3520@LIBSTFSYS11.LIBRARY.UIUC.EDU>
> <43EDEC87.1070809@inwind.it>
> <20060211214448.GM14862@matai.cs.waikato.ac.nz>
> <43F2E7DD.5000107@inwind.it> <43F39C3E.60602@cs.waikato.ac.nz>
> <43F42F6C.2090901@inwind.it> <43F5538B.3010508@cs.waikato.ac.nz>
>
> Hi ,
>
> To continue testing the isis/db import, I confirm I can not go any
> further after having exploded the database, because of the xml parse
> error due to non ascii charset for metadata.
>
> I then changed my meta to ascii only, everything is ok but there is
> still a minor bug:
>
> if the metadata contains the char ":" eg: Dc.Dcterms:issued (beacuse I
> have remapped to dublin core)
>
> strange things happens if I choose to build an index on it:
>
> ON the search page I have :
>
> "Search _is_ for"
>
> instead of
>
>
> the option does not display correctly and finally the search does not
> work.
>
> Remouving this index solve the error correctly and the search.
>
> Possibly this cause mismatch in the add/merge/ignore function when
> adding new metadata sets as well.
>
> Bye!
>
> ruben
>
>
>
>
> Katherine Don wrote:
>
>> Hi Ruben
>>
>> This appears to be a bug in Greenstone. The metadata.xml files end up
>> encoded in UTF-8, but when the metadata names get to the archive doc.xml
>> files, they are no longer in UTF-8, and hence the XML parse error.
>>
>> We'll try and have a look at this next week, but Michael and I may both
>> be away, so it may be the week after.
>>
>> Cheers,
>> Katherine
>>
>> ruben pandolfi wrote:
>>
>>> Thank you Michael, Thank you guys!
>>>
>>> now I can import the DB :-) corretcly, and setting dos-850 does shows
>>> the correct charset, great.
>>>
>>> I have imported correclty, and want to explode the .mst to be able to
>>> use gsdl to add/edit metadata, and associate full text docs when
>>> available to the relevant record.
>>>
>>> Unfortunately I have the same encoding error, peraphs there is another
>>> fix for this?
>>>
>>>
>>> import.pl> NULPlug processing
>>> "/var/www/gsdl/collect/babel/import/EM20/0019.nul"
>>> import.pl> NULPlug processing
>>> "/var/www/gsdl/collect/babel/import/EM20/0020.nul"
>>> import.pl> *********************************************
>>> import.pl> Import complete
>>> import.pl> *********************************************
>>> import.pl> * 20 documents were considered for processing
>>> import.pl> * 20 were processed and included in the collection
>>> import.pl> Command complete.
>>> import.pl> Extracting new metadata from archive files.
>>> import.pl> Archived metadata extraction complete.
>>> Command: /var/www/gsdl/bin/script/buildcol.pl -gli -language en
>>> -collectdir /var/www/gsdl/collect/ -removeold babel
>>> buildcol.pl> *** creating the compressed text
>>> buildcol.pl> collecting text statistics
>>> buildcol.pl> ArcPlug: processing
>>> /var/www/gsdl/collect/babel/archives/archives.inf
>>> buildcol.pl> GAPlug: processing HASHedda.dir/doc.xml
>>> buildcol.pl> **** Error is:
>>> buildcol.pl> not well-formed (invalid token) at line 12, column 23,
>>> byte
>>> 509 at /usr/lib/perl5/XML/Parser.pm line 187
>>> buildcol.pl> WARNING: No plugin could process HASHedda.dir/doc.xml
>>> buildcol.pl> GAPlug: processing HASH01c2.dir/doc.xml
>>> buildcol.pl> **** Error is:
>>>
>>>
>>>
>>>
>>> and finally
>>>
>>>
>>>
>>> buildcol.pl> WARNING: No plugin could process HASH73fe.dir/doc.xml
>>> buildcol.pl> *** creating auxiliary files
>>> buildcol.pl> arcinfo::save_info couldn't write
>>> /var/www/gsdl/collect/babel/archives/HASH73fe.dir/doc.xml/archives.inf
>>> buildcol.pl> Command failed.
>>>
>>>
>>>
>>>
>>> thank you again
>>>
>>> Ruben
>>>
>>>
>>>
>>>
>>> Michael Dewsnip wrote:
>>>
>>>
>>>> Hi Ruben,
>>>>
>>>> It turns out your problem is caused by a bug in ISISPlug -- obviously
>>>> you're the first person to try it on a database with non-ASCII
>>>> characters in the field names! (The .fdt file wasn't being read using
>>>> the encoding provided).
>>>>
>>>> I've fixed this; you can download a new version of ISISPlug.pm from
>>>> http://www.cs.waikato.ac.nz/~mdewsnip/greenstone/temp-2.63/ISISPlug.pm
>>>> (this should overwrite your existing ISISPlug.pm file in Greenstone's
>>>> "perllib/plugins" directory).
>>>>
>>>> Regards,
>>>>
>>>> Michael
>>>>
>>>> PS Your database seems to be a bit inconsistent: it contains data for
>>>> tags that are not defined in the .fdt file. For example, the .mst file
>>>> seems to have two Date tags: 45 and 50, but only 50 is defined in the
>>>> .fdt file.
>>>>
>>>>
>>>>
>>>> ruben pandolfi wrote:
>>>>
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> John R. McPherson wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Normally, a "not well-formed" error in the XML Parser means that a
>>>>>> source file has badly encoded data, and the plugin has not detected
>>>>>
>>>>>
>>>>>
>>>>> this
>>>>>
>>>>>
>>>>>> and has made a non-utf8 archive .xml file. It might also mean
>>>>>> that the
>>>>>> plugin has used or passed in an invalid xml tag.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> yes, I can see there is an encoding problem.
>>>>>
>>>>> Anyway , I have set GAPplug ArcPlug RecPlug and isisPlug to dos 850
>>>>>
>>>>> (I'm 50 % sure this is the correct code , altough I thought it was
>>>>> called ibm 850 )
>>>>>
>>>>> It contains italian, french and portuguese characters.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Most of the plugins are careful enough to convert any wrongly
>>>>>> encoded
>>>>>> metadata/text into the correct encoding, so perhaps the ISIS plugin
>>>>>> doesn't. Are you able to make your input documents available for
>>>>>
>>>>>
>>>>>
>>>>> testing?
>>>>>
>>>>>
>>>>>> That might be the quickest way for a developer to work out where the
>>>>>> problem is.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> if someone have time and want to check ;-) , you can temporarly
>>>>> download the complete db isis files here:
>>>>>
>>>>> http://www.evk2cnr.org/ruben/Babel809.zip
>>>>>
>>>>>
>>>>> thank you for your help!
>>>>>
>>>>> ruben
>>>>>
>>>>> John R. McPherson wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On Sat, Feb 11, 2006 at 02:54:15PM +0100, ruben pandolfi wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Jonathan Gorman wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Check "How do I fix XML::Parser errors during import.pl?" in
>>>>>>>> the FAQ.
>>>>>>>>
>>>>>>>> Jon Gorman
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thank you Jon,
>>>>>>>
>>>>>>> I do not think the error is due to perl.
>>>>>>>
>>>>>>> Infact I only have warnings from perl:
>>>>>>>
>>>>>>> buildcol.pl> not well-formed (invalid token) at line 31, column 34,
>>>>>>> byte 1572 at /usr/lib/perl5/XML/Parser.pm line 187
>>>>>>> buildcol.pl> WARNING: No plugin could process
>>>>>>> HASH7bca/b456434f/1d719200/0bs809.dir/doc.xml
>>>>>>>
>>>>>>
>>>>
>>
>