Re: [greenstone-users] Importing CDS/ISIS - metadata non ascii bugs

From Michael Dewsnip
DateWed, 01 Mar 2006 16:15:52 +1300
Subject Re: [greenstone-users] Importing CDS/ISIS - metadata non ascii bugs
In-Reply-To (43FC8A21-2040003-inwind-it)
Hi Ruben,

> if the metadata contains the char ":" eg: Dc.Dcterms:issued (beacuse I
> have remapped to dublin core)
>
> strange things happens if I choose to build an index on it:
>
> ON the search page I have :
>
> "Search _is_ for"
>
> instead of
>
>
> the option does not display correctly and finally the search does not
> work.
>
> Remouving this index solve the error correctly and the search.

Yes, the ":" character is special and should not be included in metadata
names. Where did the ":" character come from? If Greenstone/GLI is
adding it (or not removing it) then this is a bug, and we'd like to fix it.

Regards,

Michael

> Katherine Don wrote:
>
>> Hi Ruben
>>
>> This appears to be a bug in Greenstone. The metadata.xml files end up
>> encoded in UTF-8, but when the metadata names get to the archive doc.xml
>> files, they are no longer in UTF-8, and hence the XML parse error.
>>
>> We'll try and have a look at this next week, but Michael and I may both
>> be away, so it may be the week after.
>>
>> Cheers,
>> Katherine
>>
>> ruben pandolfi wrote:
>>
>>> Thank you Michael, Thank you guys!
>>>
>>> now I can import the DB :-) corretcly, and setting dos-850 does shows
>>> the correct charset, great.
>>>
>>> I have imported correclty, and want to explode the .mst to be able to
>>> use gsdl to add/edit metadata, and associate full text docs when
>>> available to the relevant record.
>>>
>>> Unfortunately I have the same encoding error, peraphs there is another
>>> fix for this?
>>>
>>>
>>> import.pl> NULPlug processing
>>> "/var/www/gsdl/collect/babel/import/EM20/0019.nul"
>>> import.pl> NULPlug processing
>>> "/var/www/gsdl/collect/babel/import/EM20/0020.nul"
>>> import.pl> *********************************************
>>> import.pl> Import complete
>>> import.pl> *********************************************
>>> import.pl> * 20 documents were considered for processing
>>> import.pl> * 20 were processed and included in the collection
>>> import.pl> Command complete.
>>> import.pl> Extracting new metadata from archive files.
>>> import.pl> Archived metadata extraction complete.
>>> Command: /var/www/gsdl/bin/script/buildcol.pl -gli -language en
>>> -collectdir /var/www/gsdl/collect/ -removeold babel
>>> buildcol.pl> *** creating the compressed text
>>> buildcol.pl> collecting text statistics
>>> buildcol.pl> ArcPlug: processing
>>> /var/www/gsdl/collect/babel/archives/archives.inf
>>> buildcol.pl> GAPlug: processing HASHedda.dir/doc.xml
>>> buildcol.pl> **** Error is:
>>> buildcol.pl> not well-formed (invalid token) at line 12, column 23,
>>> byte
>>> 509 at /usr/lib/perl5/XML/Parser.pm line 187
>>> buildcol.pl> WARNING: No plugin could process HASHedda.dir/doc.xml
>>> buildcol.pl> GAPlug: processing HASH01c2.dir/doc.xml
>>> buildcol.pl> **** Error is:
>>>
>>>
>>>
>>>
>>> and finally
>>>
>>>
>>>
>>> buildcol.pl> WARNING: No plugin could process HASH73fe.dir/doc.xml
>>> buildcol.pl> *** creating auxiliary files
>>> buildcol.pl> arcinfo::save_info couldn't write
>>> /var/www/gsdl/collect/babel/archives/HASH73fe.dir/doc.xml/archives.inf
>>> buildcol.pl> Command failed.
>>>
>>>
>>>
>>>
>>> thank you again
>>>
>>> Ruben
>>>
>>>
>>>
>>>
>>> Michael Dewsnip wrote:
>>>
>>>
>>>> Hi Ruben,
>>>>
>>>> It turns out your problem is caused by a bug in ISISPlug -- obviously
>>>> you're the first person to try it on a database with non-ASCII
>>>> characters in the field names! (The .fdt file wasn't being read using
>>>> the encoding provided).
>>>>
>>>> I've fixed this; you can download a new version of ISISPlug.pm from
>>>> http://www.cs.waikato.ac.nz/~mdewsnip/greenstone/temp-2.63/ISISPlug.pm
>>>> (this should overwrite your existing ISISPlug.pm file in Greenstone's
>>>> "perllib/plugins" directory).
>>>>
>>>> Regards,
>>>>
>>>> Michael
>>>>
>>>> PS Your database seems to be a bit inconsistent: it contains data for
>>>> tags that are not defined in the .fdt file. For example, the .mst file
>>>> seems to have two Date tags: 45 and 50, but only 50 is defined in the
>>>> .fdt file.
>>>>
>>>>
>>>>
>>>> ruben pandolfi wrote:
>>>>
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> John R. McPherson wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Normally, a "not well-formed" error in the XML Parser means that a
>>>>>> source file has badly encoded data, and the plugin has not detected
>>>>>
>>>>>
>>>>>
>>>>> this
>>>>>
>>>>>
>>>>>> and has made a non-utf8 archive .xml file. It might also mean
>>>>>> that the
>>>>>> plugin has used or passed in an invalid xml tag.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> yes, I can see there is an encoding problem.
>>>>>
>>>>> Anyway , I have set GAPplug ArcPlug RecPlug and isisPlug to dos 850
>>>>>
>>>>> (I'm 50 % sure this is the correct code , altough I thought it was
>>>>> called ibm 850 )
>>>>>
>>>>> It contains italian, french and portuguese characters.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Most of the plugins are careful enough to convert any wrongly
>>>>>> encoded
>>>>>> metadata/text into the correct encoding, so perhaps the ISIS plugin
>>>>>> doesn't. Are you able to make your input documents available for
>>>>>
>>>>>
>>>>>
>>>>> testing?
>>>>>
>>>>>
>>>>>> That might be the quickest way for a developer to work out where the
>>>>>> problem is.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> if someone have time and want to check ;-) , you can temporarly
>>>>> download the complete db isis files here:
>>>>>
>>>>> http://www.evk2cnr.org/ruben/Babel809.zip
>>>>>
>>>>>
>>>>> thank you for your help!
>>>>>
>>>>> ruben
>>>>>
>>>>> John R. McPherson wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On Sat, Feb 11, 2006 at 02:54:15PM +0100, ruben pandolfi wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Jonathan Gorman wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Check "How do I fix XML::Parser errors during import.pl?" in
>>>>>>>> the FAQ.
>>>>>>>>
>>>>>>>> Jon Gorman
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thank you Jon,
>>>>>>>
>>>>>>> I do not think the error is due to perl.
>>>>>>>
>>>>>>> Infact I only have warnings from perl:
>>>>>>>
>>>>>>> buildcol.pl> not well-formed (invalid token) at line 31, column 34,
>>>>>>> byte 1572 at /usr/lib/perl5/XML/Parser.pm line 187
>>>>>>> buildcol.pl> WARNING: No plugin could process
>>>>>>> HASH7bca/b456434f/1d719200/0bs809.dir/doc.xml
>>>>>>>
>>>>>>
>>>>
>>
>