more [greenstone-users] Importing CDS/ISIS - metadata non ascii bugs]

From ruben pandolfi
DateTue, 28 Feb 2006 19:37:07 +0100
Subject more [greenstone-users] Importing CDS/ISIS - metadata non ascii bugs]
Hi,

In addition to the 2 previous bugs reported in messages below:

1 - non ascii metadata can not be read after mst explosion
2 - metadata containing semicolon " : " cause problem

I would like to add a further bug.

3 - the character " & " in records value prevent the parser to correctly
read metadata.xml :


import.pl> RecPlug: ERROR
/var/www/gsdl/collect/babeluni/import/BABEL3/metadata.xml is not a well
formed metadata.xml file (
import.pl> not well-formed (invalid token) at line 4979, column 146,
byte 281720 at /usr/lib/perl5/XML/Parser.pm line 187
import.pl> )

The build process fails.


This character is used in websites URL.

...................................

Remouving the character " & " manually solves the problem, and the
collection builds ok. Though the URL links will not work.

Hope this will help to fix new gsdl version.

Thank you!

Ruben


-------- Original Message --------
Subject: Re: [greenstone-users] Importing CDS/ISIS - metadata non ascii bugs
Date: Wed, 22 Feb 2006 16:58:25 +0100
From: ruben pandolfi <pandolfi.r@inwind.it>
To: Katherine Don <kjdon@cs.waikato.ac.nz>,
greenstone-users@list.scms.waikato.ac.nz
References: <43EC3393.4040505@inwind.it>
<Pine.WNT.4.61.0602100823230.3520@LIBSTFSYS11.LIBRARY.UIUC.EDU>
<43EDEC87.1070809@inwind.it>
<20060211214448.GM14862@matai.cs.waikato.ac.nz>
<43F2E7DD.5000107@inwind.it><43F39C3E.60602@cs.waikato.ac.nz>
<43F42F6C.2090901@inwind.it><43F5538B.3010508@cs.waikato.ac.nz>

Hi ,

To continue testing the isis/db import, I confirm I can not go any
further after having exploded the database, because of the xml parse
error due to non ascii charset for metadata.

I then changed my meta to ascii only, everything is ok but there is
still a minor bug:

if the metadata contains the char ":" eg: Dc.Dcterms:issued (beacuse I
have remapped to dublin core)

strange things happens if I choose to build an index on it:

ON the search page I have :

"Search _is_ for"

instead of


the option does not display correctly and finally the search does not work.

Remouving this index solve the error correctly and the search.

Possibly this cause mismatch in the add/merge/ignore function when
adding new metadata sets as well.

Bye!

ruben


Katherine Don wrote:
> Hi Ruben
>
> This appears to be a bug in Greenstone. The metadata.xml files end up
> encoded in UTF-8, but when the metadata names get to the archive doc.xml
> files, they are no longer in UTF-8, and hence the XML parse error.
>
> We'll try and have a look at this next week, but Michael and I may both
> be away, so it may be the week after.
>
> Cheers,
> Katherine
>
> ruben pandolfi wrote:
>
>>Thank you Michael, Thank you guys!
>>
>>now I can import the DB :-) corretcly, and setting dos-850 does shows
>>the correct charset, great.
>>
>>I have imported correclty, and want to explode the .mst to be able to
>>use gsdl to add/edit metadata, and associate full text docs when
>>available to the relevant record.
>>
>>Unfortunately I have the same encoding error, peraphs there is another
>>fix for this?
>>
>>
>>import.pl> NULPlug processing
>>"/var/www/gsdl/collect/babel/import/EM20/0019.nul"
>>import.pl> NULPlug processing
>>"/var/www/gsdl/collect/babel/import/EM20/0020.nul"
>>import.pl> *********************************************
>>import.pl> Import complete
>>import.pl> *********************************************
>>import.pl> * 20 documents were considered for processing
>>import.pl> * 20 were processed and included in the collection
>>import.pl> Command complete.
>>import.pl> Extracting new metadata from archive files.
>>import.pl> Archived metadata extraction complete.
>>Command: /var/www/gsdl/bin/script/buildcol.pl -gli -language en
>>-collectdir /var/www/gsdl/collect/ -removeold babel
>>buildcol.pl> *** creating the compressed text
>>buildcol.pl> collecting text statistics
>>buildcol.pl> ArcPlug: processing
>>/var/www/gsdl/collect/babel/archives/archives.inf
>>buildcol.pl> GAPlug: processing HASHedda.dir/doc.xml
>>buildcol.pl> **** Error is:
>>buildcol.pl> not well-formed (invalid token) at line 12, column 23, byte
>>509 at /usr/lib/perl5/XML/Parser.pm line 187
>>buildcol.pl> WARNING: No plugin could process HASHedda.dir/doc.xml
>>buildcol.pl> GAPlug: processing HASH01c2.dir/doc.xml
>>buildcol.pl> **** Error is:
>>
>>
>>
>>
>>and finally
>>
>>
>>
>>buildcol.pl> WARNING: No plugin could process HASH73fe.dir/doc.xml
>>buildcol.pl> *** creating auxiliary files
>>buildcol.pl> arcinfo::save_info couldn't write
>>/var/www/gsdl/collect/babel/archives/HASH73fe.dir/doc.xml/archives.inf
>>buildcol.pl> Command failed.
>>
>>
>>
>>
>>thank you again
>>
>>Ruben
>>
>>
>>
>>
>>Michael Dewsnip wrote:
>>
>>
>>>Hi Ruben,
>>>
>>>It turns out your problem is caused by a bug in ISISPlug -- obviously
>>>you're the first person to try it on a database with non-ASCII
>>>characters in the field names! (The .fdt file wasn't being read using
>>>the encoding provided).
>>>
>>>I've fixed this; you can download a new version of ISISPlug.pm from
>>>http://www.cs.waikato.ac.nz/~mdewsnip/greenstone/temp-2.63/ISISPlug.pm
>>>(this should overwrite your existing ISISPlug.pm file in Greenstone's
>>>"perllib/plugins" directory).
>>>
>>>Regards,
>>>
>>>Michael
>>>
>>>PS Your database seems to be a bit inconsistent: it contains data for
>>>tags that are not defined in the .fdt file. For example, the .mst file
>>>seems to have two Date tags: 45 and 50, but only 50 is defined in the
>>>.fdt file.
>>>
>>>
>>>
>>>ruben pandolfi wrote:
>>>
>>>
>>>
>>>>Hi,
>>>>
>>>>John R. McPherson wrote:
>>>>
>>>>
>>>>
>>>>>Normally, a "not well-formed" error in the XML Parser means that a
>>>>>source file has badly encoded data, and the plugin has not detected
>>>>
>>>>
>>>>this
>>>>
>>>>
>>>>>and has made a non-utf8 archive .xml file. It might also mean that the
>>>>>plugin has used or passed in an invalid xml tag.
>>>>
>>>>
>>>>
>>>>yes, I can see there is an encoding problem.
>>>>
>>>>Anyway , I have set GAPplug ArcPlug RecPlug and isisPlug to dos 850
>>>>
>>>>(I'm 50 % sure this is the correct code , altough I thought it was
>>>>called ibm 850 )
>>>>
>>>>It contains italian, french and portuguese characters.
>>>>
>>>>
>>>>
>>>>
>>>>>Most of the plugins are careful enough to convert any wrongly encoded
>>>>>metadata/text into the correct encoding, so perhaps the ISIS plugin
>>>>>doesn't. Are you able to make your input documents available for
>>>>
>>>>
>>>>testing?
>>>>
>>>>
>>>>>That might be the quickest way for a developer to work out where the
>>>>>problem is.
>>>>
>>>>
>>>>
>>>>if someone have time and want to check ;-) , you can temporarly
>>>>download the complete db isis files here:
>>>>
>>>>http://www.evk2cnr.org/ruben/Babel809.zip
>>>>
>>>>
>>>>thank you for your help!
>>>>
>>>>ruben
>>>>
>>>>John R. McPherson wrote:
>>>>
>>>>
>>>>
>>>>>On Sat, Feb 11, 2006 at 02:54:15PM +0100, ruben pandolfi wrote:
>>>>>
>>>>>
>>>>>
>>>>>>Jonathan Gorman wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>Check "How do I fix XML::Parser errors during import.pl?" in the FAQ.
>>>>>>>
>>>>>>>Jon Gorman
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>Thank you Jon,
>>>>>>
>>>>>>I do not think the error is due to perl.
>>>>>>
>>>>>>Infact I only have warnings from perl:
>>>>>>
>>>>>>buildcol.pl> not well-formed (invalid token) at line 31, column 34,
>>>>>>byte 1572 at /usr/lib/perl5/XML/Parser.pm line 187
>>>>>>buildcol.pl> WARNING: No plugin could process
>>>>>>HASH7bca/b456434f/1d719200/0bs809.dir/doc.xml
>>>>>>
>>>>>
>>>
>

--
..................

Ruben Pandolfi

-------------------------------------------------------------
"...I Think This is the Beginning of a Beautiful Friendship."
-------------------------------------------------------------

_______________________________________________
greenstone-users mailing list
greenstone-users@list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users


--
..................

Ruben Pandolfi

-------------------------------------------------------------
"...I Think This is the Beginning of a Beautiful Friendship."
-------------------------------------------------------------