Re: [greenstone-users] Importing CDS/ISIS db failure...arcinfo::save_info couldn't write

From Katherine Don
DateFri, 17 Feb 2006 17:39:39 +1300
Subject Re: [greenstone-users] Importing CDS/ISIS db failure...arcinfo::save_info couldn't write
In-Reply-To (43F42F6C-2090901-inwind-it)
Hi Ruben

This appears to be a bug in Greenstone. The metadata.xml files end up
encoded in UTF-8, but when the metadata names get to the archive doc.xml
files, they are no longer in UTF-8, and hence the XML parse error.

We'll try and have a look at this next week, but Michael and I may both
be away, so it may be the week after.

Cheers,
Katherine

ruben pandolfi wrote:
> Thank you Michael, Thank you guys!
>
> now I can import the DB :-) corretcly, and setting dos-850 does shows
> the correct charset, great.
>
> I have imported correclty, and want to explode the .mst to be able to
> use gsdl to add/edit metadata, and associate full text docs when
> available to the relevant record.
>
> Unfortunately I have the same encoding error, peraphs there is another
> fix for this?
>
>
> import.pl> NULPlug processing
> "/var/www/gsdl/collect/babel/import/EM20/0019.nul"
> import.pl> NULPlug processing
> "/var/www/gsdl/collect/babel/import/EM20/0020.nul"
> import.pl> *********************************************
> import.pl> Import complete
> import.pl> *********************************************
> import.pl> * 20 documents were considered for processing
> import.pl> * 20 were processed and included in the collection
> import.pl> Command complete.
> import.pl> Extracting new metadata from archive files.
> import.pl> Archived metadata extraction complete.
> Command: /var/www/gsdl/bin/script/buildcol.pl -gli -language en
> -collectdir /var/www/gsdl/collect/ -removeold babel
> buildcol.pl> *** creating the compressed text
> buildcol.pl> collecting text statistics
> buildcol.pl> ArcPlug: processing
> /var/www/gsdl/collect/babel/archives/archives.inf
> buildcol.pl> GAPlug: processing HASHedda.dir/doc.xml
> buildcol.pl> **** Error is:
> buildcol.pl> not well-formed (invalid token) at line 12, column 23, byte
> 509 at /usr/lib/perl5/XML/Parser.pm line 187
> buildcol.pl> WARNING: No plugin could process HASHedda.dir/doc.xml
> buildcol.pl> GAPlug: processing HASH01c2.dir/doc.xml
> buildcol.pl> **** Error is:
>
>
>
>
> and finally
>
>
>
> buildcol.pl> WARNING: No plugin could process HASH73fe.dir/doc.xml
> buildcol.pl> *** creating auxiliary files
> buildcol.pl> arcinfo::save_info couldn't write
> /var/www/gsdl/collect/babel/archives/HASH73fe.dir/doc.xml/archives.inf
> buildcol.pl> Command failed.
>
>
>
>
> thank you again
>
> Ruben
>
>
>
>
> Michael Dewsnip wrote:
>
>> Hi Ruben,
>>
>> It turns out your problem is caused by a bug in ISISPlug -- obviously
>> you're the first person to try it on a database with non-ASCII
>> characters in the field names! (The .fdt file wasn't being read using
>> the encoding provided).
>>
>> I've fixed this; you can download a new version of ISISPlug.pm from
>> http://www.cs.waikato.ac.nz/~mdewsnip/greenstone/temp-2.63/ISISPlug.pm
>> (this should overwrite your existing ISISPlug.pm file in Greenstone's
>> "perllib/plugins" directory).
>>
>> Regards,
>>
>> Michael
>>
>> PS Your database seems to be a bit inconsistent: it contains data for
>> tags that are not defined in the .fdt file. For example, the .mst file
>> seems to have two Date tags: 45 and 50, but only 50 is defined in the
>> .fdt file.
>>
>>
>>
>> ruben pandolfi wrote:
>>
>>
>>> Hi,
>>>
>>> John R. McPherson wrote:
>>>
>>>
>>>>
>>>> Normally, a "not well-formed" error in the XML Parser means that a
>>>> source file has badly encoded data, and the plugin has not detected
>>>
>>>
>>> this
>>>
>>>> and has made a non-utf8 archive .xml file. It might also mean that the
>>>> plugin has used or passed in an invalid xml tag.
>>>
>>>
>>>
>>> yes, I can see there is an encoding problem.
>>>
>>> Anyway , I have set GAPplug ArcPlug RecPlug and isisPlug to dos 850
>>>
>>> (I'm 50 % sure this is the correct code , altough I thought it was
>>> called ibm 850 )
>>>
>>> It contains italian, french and portuguese characters.
>>>
>>>
>>>
>>>> Most of the plugins are careful enough to convert any wrongly encoded
>>>> metadata/text into the correct encoding, so perhaps the ISIS plugin
>>>> doesn't. Are you able to make your input documents available for
>>>
>>>
>>> testing?
>>>
>>>> That might be the quickest way for a developer to work out where the
>>>> problem is.
>>>
>>>
>>>
>>> if someone have time and want to check ;-) , you can temporarly
>>> download the complete db isis files here:
>>>
>>> http://www.evk2cnr.org/ruben/Babel809.zip
>>>
>>>
>>> thank you for your help!
>>>
>>> ruben
>>>
>>> John R. McPherson wrote:
>>>
>>>
>>>> On Sat, Feb 11, 2006 at 02:54:15PM +0100, ruben pandolfi wrote:
>>>>
>>>>
>>>>> Jonathan Gorman wrote:
>>>>>
>>>>>
>>>>>> Check "How do I fix XML::Parser errors during import.pl?" in the FAQ.
>>>>>>
>>>>>> Jon Gorman
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thank you Jon,
>>>>>
>>>>> I do not think the error is due to perl.
>>>>>
>>>>> Infact I only have warnings from perl:
>>>>>
>>>>> buildcol.pl> not well-formed (invalid token) at line 31, column 34,
>>>>> byte 1572 at /usr/lib/perl5/XML/Parser.pm line 187
>>>>> buildcol.pl> WARNING: No plugin could process
>>>>> HASH7bca/b456434f/1d719200/0bs809.dir/doc.xml
>>>>>
>>>>
>>
>>
>