[greenstone-users] Unicode data in classifiers

From Vladimir Risojevic
DateThu Dec 6 21:13:38 2007
Subject [greenstone-users] Unicode data in classifiers
In-Reply-To (1259-59-92-173-245-1196789470-squirrel-ncsi-iisc-ernet-in)
Dear Anuradha,

I have a collection of scanned magazines published in late 19th and
early 20th
century. It's conception is very similar to PagedImg example from Greenstone
tutorials. We created item files for this collection but didn't go with OCR.
Item files have the following structure:

<PagedDocument>
<Metadata name="Series">AAA</Metadata>
<Metadata name="Date">19100101</Metadata>
<Metadata name="Volume">1</Metadata>
<Metadata name="Number">1</Metadata>
<PageGroup>
<Metadata name="dc.Title">xxx</Metadata>
<Metadata name="dc.Creator">abc</Metadata>
<Page pagenum="1" imgfile="001.tif"/>
<Page pagenum="2" imgfile="002.tif"/>
<Page pagenum="3" imgfile="003.tif"/>
<Page pagenum="4" imgfile="004.tif"/>
</PageGroup>
<PageGroup>
<Metadata name="dc.Title">yyy</Metadata>
<Metadata name="dc.Creator">def</Metadata>
<Page pagenum="5" imgfile="005.tif"/>
</PageGroup>
<PageGroup>
<Metadata name="dc.Title">zzz</Metadata>
<Metadata name="dc.Creator">xyz</Metadata>
<Page pagenum="5" imgfile="005.tif"/>
<Page pagenum="6" imgfile="006.tif"/>
<Page pagenum="7" imgfile="007.tif"/>
<Page pagenum="8" imgfile="008.tif"/>
</PageGroup>
</PagedDocument>

Besides dc.Title and dc.Creator I also have dc.Description and dc.Subject
metadata. All metadata are non-English and there are entries in Cyrillic, as
well as Latin script so they are UTF-8 encoded. I wanted to create
hierarchical browsing of volumes, issues and titles in magazines, as well as
an index of authors. Browsing of volumes, issues and titles was easy and I
created it with
AZCompactSectionList -metadata Series -sort Number
However, the index of authors wasn't that easy because Greenstone
classifiers
are not very good at handling UTF-8 metadata. One exception is GenericList
classifier which I used here. As Michael pointed out for GenericList
classifier
to work with Unicode one has to download allkeys.txt from
http://www.unicode.org/Public/UCA/latest/allkeys.txt
This file essentially defines how to sort Unicode data. You have to put it
into %Greenstone%binwindowsperllibUnicodeCollate directory, where
%Greenstone% is the directory where Greenstone is installed. This path is
obviously Windows biased, but I think that on other platforms you have to
find the directory where Perl modules are installed and put the allkeys.txt
into Unicode/Collate directory.
After that the classifier from my previous post worked, and I got my
index of
authors. The option -sort_using_unicode_collation enables the use of
allkeys.txt.

Hope this helps. If you have further questions feel free to ask.

Regards,

Vladimir


anu@ncsi.iisc.ernet.in wrote:
> Hello Vladimir,
>
> Could you please give me more details of the collection you have created?
>
> 1. Have you created metadata in non english and unicode compatible format?
>
> 2. What is allkeys.txt and where it should be kept and what is the format
> of allkeys.txt?
>
> Regards,
> Anuradha
>
>
>> Hi Michael,
>>
>> Thank you very much for your reply. It was allkeys.txt that was missing.
>> I have finally created the classifier that works. I reproduce it here
>> in hope that it can be useful to someone.
>> GenericList -sort_leaf_nodes_using dc.Creator -always_bookshelf_last_level
>> -metadata dc.Creator -partition_type_within_level per_letter
>> -classify_sections -sort_using_unicode_collation
>>
>> Thanks again.
>>
>> Regards
>>
>> Vladimir
>>
>>
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>> _______________________________________________
>> greenstone-users mailing list
>> greenstone-users@list.scms.waikato.ac.nz
>> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
>>
>>
>
>
>
>


--
Vladimir Risojevic
Teaching Assistant
Faculty of Electrical Engineering
University of Banjaluka
Patre 5
78000 Banjaluka
Bosnia and Herzegovina

Phone: +387 51 221 847, +387 51 221 876
Fax: +387 51 211 408
Email: vlado@etfbl.net
WWW: http://www.etfbl.net