RE: [greenstone-users] Sorting Russian PDF in Classifiers

From Jonathan Tremblay
DateFri, 16 Jun 2006 10:19:39 -0400
Subject RE: [greenstone-users] Sorting Russian PDF in Classifiers
In-Reply-To (043E0C44967387478662B95CDAEBC0A71E4D147E-quebec-praxnet-local)
Hi Michael,

You were right about the "-no_metadata_formatting" option for hierarchy
classifiers. I'm still puzzled about what it has to do with my dc.SortCode
problem, but I'll use it!

The main reason why I use hierarchy is because I wanted a HList at the top
to allow the user to select the language of the documents independently of
its interface language.

I also wanted to create different classifiers for each language, so that
bookshelves are labeled and sorted in the user language. (My macro code
generates the navigation bar according to the user language to enable the
required classifiers.)

I guess that it could be possible to create something similar with
GenericList but now that my metadata is all ready and formatted, I prefer to
stick with hierarchy.

Greenstone is a very powerful software, but it's hard for a new user to make
the right decisions when there are 2 or more options to achieve the same
result...

By the way I would have liked to get the CSVPlug at the beginning of my
project... XML files are quite hard to manage when the data comes from
Excel!

Keep up your good work!

Thanks,

Jonathan Tremblay


-----Message d'origine-----
De : Michael Dewsnip [mailto:mdewsnip@cs.waikato.ac.nz]
Envoyé : 16 juin 2006 01:01
À : Jonathan Tremblay
Objet : Re: [greenstone-users] Sorting Russian PDF in Classifiers

PS Out of curiosity, what will the finished subject classifier look
like? I'm just interested to see whether GenericList can be used for
this classifier (if so it might be easier than those horrible hierarchy
files and numbers).

Michael Dewsnip wrote:

>Hi Jonathan,
>
>The "-no_metadata_formatting" option will fix your problem in this case.
>
>All the best,
>
>Michael
>
>
>
>Jonathan Tremblay wrote:
>
>
>
>>Thanks Michael, GenericList seems to work. However, the same problem also
>>exists with hierarchy classifiers. I hope hierarchy is not the same kind
of
>>"absolute shambles" as the AZCompactList is...
>>
>>Here's an updated test collection with a hierarchy classifier to
demonstrate
>>the problem. Click on the link "_textdescrSubject_" and open the section:
>>"Gender, Women and Girls".
>>
>>Thanks,
>>
>>Jonathan Tremblay
>>
>>-----Message d'origine-----
>>De : Michael Dewsnip [mailto:mdewsnip@cs.waikato.ac.nz]
>>Envoyé : 15 juin 2006 17:59
>>À : Jonathan Tremblay
>>Objet : Re: [greenstone-users] Sorting Russian PDF in Classifiers
>>
>>Hi Jonathan,
>>
>>Thanks for the collection -- it helps a lot with this sort of problem.
>>
>>This certainly does look like a bug in AZCompactList. Unfortunately
>>AZCompactList is an absolute shambles and I don't have time to track the
>>bug down right now, but I have a better solution for you: try using
>>GenericList instead. This is a classifier I wrote a while back that has
>>most of the features of AZCompactList, but is a lot smaller and better
>>organised, and has much better Unicode, metadata and sorting capabilities.
>>
>>I've tried using GenericList with your collection and it seems to work
fine:
>>
>> classify GenericList -metadata dc.Publisher
>>-sort_leaf_nodes_using dc.SortCode
>>
>>It even sorts the four documents correctly based on their title:
>>
>> classify GenericList -metadata dc.Publisher
>>-sort_leaf_nodes_using dc.Title
>>
>>(This may not be true in general though so you're probably better
>>sticking with dc.SortCode).
>>
>>Let me know if you have any problems with GenericList, as I'd like this
>>to eventually replace AZCompactList (and AZList, and AZSectionList, and
>>AZCompactSectionList...) completely.
>>
>>All the best,
>>
>>Michael
>>
>>
>>
>>Jonathan Tremblay wrote:
>>
>>
>>
>>
>>
>>>Hi Michael,
>>>
>>>I was already specifying the "-Sort" option.
>>>
>>>I tried the "-no_metadata_formatting" option but it made no change.
>>>
>>>To show you the problem I made a small test collection with only one
>>>classifier and four documents (one of each used languages: English,
French,
>>>Spanish and Russian). You will find this in the attached zip file.
>>>
>>>To see the problem, click on the link "_textdescrPublisher_" and open the
>>>only available section: "Commission on the Status of Women". You will
see
>>>the four documents and their "SortCode".
>>>
>>>Let me know what you think about this.
>>>
>>>Thanks,
>>>
>>>Jonathan Tremblay
>>>
>>>-----Message d'origine-----
>>>De : Michael Dewsnip [mailto:mdewsnip@cs.waikato.ac.nz]
>>>Envoyé : 14 juin 2006 18:45
>>>À : Jonathan Tremblay
>>>Cc : 'Greenstone'
>>>Objet : Re: [greenstone-users] Sorting Russian PDF in Classifiers
>>>
>>>Hi Jonathan,
>>>
>>>Please check you have specified the "-sort" and
>>>"-no_metadata_formatting" options to the classifier. If this still
>>>doesn't work then it is probably a bug: please let us know and we'll
>>>look into it further.
>>>
>>>All the best,
>>>
>>>Michael
>>>
>>>
>>>
>>>Jonathan Tremblay wrote:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>>Hi,
>>>>
>>>>
>>>>
>>>>My project contains English, Spanish, French and Russian documents
>>>>(all in PDF).
>>>>
>>>>
>>>>
>>>>Since the beginning, sorting has been a problem. So I created a
>>>>metadata field specifically for sorting. At first I used numbers, but
>>>>since I still got problems, I start using characters in that field
>>>>(AAAA, AAAB, AAAC, etc.)
>>>>
>>>>
>>>>
>>>>It worked perfectly for the search results. But Russian documents are
>>>>not sorted correctly in the classifiers (AZCompactList and Hierarchy):
>>>>they always appear before other documents (ex. RAAA, RAAB, AAAA, AAAB,
>>>>AAAC, etc.)
>>>>
>>>>
>>>>
>>>>Why?
>>>>
>>>>
>>>>
>>>>I got a similar problem with an English PDF which contained no
>>>>editable text (it contained only images from a scan). As soon as I
>>>>replace the document with a PDF version containing text, the document
>>>>got sorted correctly. By the way, all my Russian documents contain
>>>>editable text.
>>>>
>>>>
>>>>
>>>>Thanks,
>>>>
>>>>
>>>>
>>>>Jonathan Tremblay
>>>>
>>>>
>>>>
>>>>------------------------------------------------------------------------
>>>>
>>>>_______________________________________________
>>>>greenstone-users mailing list
>>>>greenstone-users@list.scms.waikato.ac.nz
>>>>https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>
>
>
>