Re: [greenstone-users] Sort order problem in french (accentuated words at the bottom of the lists)

From Michael Dewsnip
DateWed, 10 Jan 2007 14:27:12 +1300
Subject Re: [greenstone-users] Sort order problem in french (accentuated words at the bottom of the lists)
In-Reply-To (45A2D5C0-8030208-argon7-be)

I have just added a new option to the GenericList classifier:
-sort_using_unicode_collation. This uses the Unicode::Collate module to
sort the metadata values so "Pé" is sorted with "Pe" etc.

To use this option you need to:

1. Make sure you're using Greenstone 2.72.
2. Download the updated version of from
and overwrite your existing version in the Greenstone "perllib/classify"
3. Download the file
and put this in your system Perl "lib/Unicode/Collate" directory (Unix)
or in the Greenstone "bin/windows/perl/lib/Unicode/Collate" folder
4. Change your "Titles A-Z" and "Creators A-Z" classifiers to use
GenericList with the "-sort_using_unicode_collation" option.
5. Rebuild the collection.

Let me know if you have any problems with this new functionality.



Argon7 User List wrote:

> Hi,
> I've a problem with my "Titles A-Z" and "Creators A-Z" lists:
> accentuated words are not sorted properly... for example, you can
> check this page:
> as you will see "Pé", "Péché Jean-Jacques", etc... are placed at the
> bottom of the list, under "Puttemans"...
> I've attached the main and the collection configs.
> Thanks for your help!
> -- J
> o
>indexesdocument:text document:dc.Title document:dc.Subject,text
>indexsubcollectionsdisque,video,film,autre disque video film autre
>pluginHTMLPlug -input_encoding utf8 -smart_block -rename_assoc_files -no_metadata
>pluginRecPlug -use_metadata_files
>classifyAZList -metadata dc.Title
>classifyAZCompactList -metadata dc.Creator -allvalues -sort dc.Title
>#classifyAZCompactList -metadata dc.Creator
>classifyDateList -metadata dc.Date -sort dc.Title -nogroup
>classifyHierarchy -hfile themes.txt -metadata dc.Subject -sort dc.Title -allvalues -buttonname Subject
>format HList "[link][ex.Title][/link] "
>format CL2HList "[link][ex.Title][/link] "
>format DocumentButtons ""
>#format DocumentButtons "Highlight"
>#format CL1VList "<td valign=top>[link][icon][/link]</td>
>#<td valign=top>[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>
>#<td valign=top>[highlight]
>#format CL4VList "<td valign=top>[link][icon][/link]</td>
>#<td valign=top>[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>
>#<td valign=top>{If}{[numleafdocs],<b>[Title]</b>}<b>[dc.Subject]</b> [dc.Title]</td>"
>format CL2VList "<td valign=top>[link][icon][/link]</td> <td valign=top>[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td> <td valign=top>{If}{[numleafdocs],[link][Title][/link]}[link][dc.Creator][/link]{If}{[numleafdocs], ([numleafdocs] fiches)}{If}{[dc.Title], - [dc.Title]}{If}{[dc.Date],&sbquo; [dc.Date]}</td>"
>format CL4VList "<td valign=top>[link][icon][/link]</td> <td valign=top>[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td> <td valign=top>[link][Title][/link]{If}{[numleafdocs], ([numleafdocs] fiches)}{If}{[dc.Title],[link][dc.Title][/link]}{If}{[dc.Creator],&sbquo; de [dc.Creator]}{If}{[dc.Date],&sbquo; [dc.Date]}</td>"
>format VList "<td valign=top>[link][icon][/link]</td><td valign=top>[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td><td valign=top>[link]{Or}{[dc.Title],[ex.Title],Sans titre}[/link]{If}{[dc.Creator],&sbquo; de [dc.Creator]}{If}{[dc.Date],&sbquo; [dc.Date]}{If}{[dc.Source],&sbquo; [dc.Source]}{If}{[dc.Format], ([dc.Format])}</td>"
>format DateList "<td>[link][icon][/link]</td><td>[link]{Or}{[dc.Title],[ex.Title],Sans titre}[/link]{If}{[dc.Creator],&sbquo; de [dc.Creator]}</td><td>[ex.Date]</td>"
>format DocumentHeading "&nbsp;"
>#format DocumentHeading "{If}{[dc.Creator],[link][dc.Creator][/link]}"
>#format DocumentHeading "<H1>[dc.Title]</H1><hr>
>#{If}{[dc.Creator], [dc.Creator], Réalisateur inconnu} {If}{[dc.Date], [dc.Date]}"
>format DocumentText "[Text]"
>collectionmetacollectionname [l=fr] "ccfbcata"
>collectionmetacollectionextra [l=fr] "Le catalogue de la CinémathÚque peut dorénavant être consulté sur ce site. Vous trouverez des films culturels et éducatifs de portée générale qui restent d'actualité mais aussi de trÚs nombreux films didaticques ne présentant plus aucune valeur pédagogique actuelle, mais qui n'en conservent pas moins une valeur incomparable pour l'éducation ou la simple évocation d'une époque ou d'un moment de notre histoire."
>collectionmeta.document:text [l=fr] "Fiches"
>collectionmeta.document:dc.Subject,text [l=fr] "Sujets"
>collectionmeta.document:dc.Title [l=fr] "Titres"
> [l=fr] "Pellicules"
> [l=fr] "Vidéos"
>collectionmeta.disque [l=fr] "CD - DVD"
>collectionmeta.autre [l=fr] "Autres"
>collectionmeta.disque,video,film,autre [l=fr] "Tous les supports"
>collectionmacro Style:cssheader '
> <link rel="stylesheet" href="_httpcollection_/images/style.css" type="text/css" media="screen">
># This file must be utf-8 encoded
># This is the main configuration file for configuring
># your Greenstone receptionist (the bit responsible for the way
># things are displayed) and contains information common
># to the interface of all collections served by the site.
># Email address of the webmaster of this Greenstone installation
># If maintainer is set to "NULL" EmailEvents and EmailUserEvents
># will be disabled.
>maintainer NULL
># Outgoing (SMTP) mail server for this Greenstone installation.
># This will default to mail.maintainer-domain if it's not set
># (i.e. if maintainer is then MailServer
># will default to If MailServer doesn't
># resolve to a valid SMTP server then the EmailEvents and
># EmailUserEvents options (see below) won't be functional. Likewise,
># turning off EmailEvents and EmailUserEvents will remove any
># reliance on MailServer.
>MailServer NULL
># Set status to "enabled" if you want the Maintenance and
># Administration facility to be available.
>status enabled
># Set collector to "disabled" if you don't want the "collector"
># end-user collection building facility to be available.
>collector disabled
># Set depositor to "disabled" if you don't want the "depositor"
># (aka institutional repository) facility to be available.
>depositor disabled
># Set gliapplet to "disabled" if you don't want the remote users
># to be able to build collections on your server through an applet
># version of GLI
>gliapplet disabled
># Set logcgiargs to true to keep a log of usage information in
># $GSDLHOME/etc/usage.txt.
>logcgiargs false
># Set usecookies to true to use cookies to identify users (cookie
># information will be written to the usage log if logcgiargs is
># true).
>usecookies false
># LogDateFormat sets the format that timestamps will be stored in the usage
># log (i.e. if logcgiargs is enabled). It takes the following values:
># LocalTime: (the default) The local time and date in the form
># "Thu Dec 07 23:47:00 NZDT 2000".
># UTCTime: Coordinated universal time (GMT) in the same format as LocalTime.
># Absolute: Integer value representing the number of seconds since
># 00:00:00 1/1/1970 GMT
>LogDateFormat LocalTime
># Log any events that Greenstone deems important in
># $GSDLHOME/etc/events.txt.
># The only events that are currently implemented come from the
># collector (e.g. someone just built/deleted the following collection)
># LogEvents may take values of:
># AllEvents: All important events
># CollectorEvents: Just those events originating from the collector
># (e.g. someone just built a collection)
># disabled: Don't log events
>LogEvents disabled
># Email the maintainer whenever any event occurs. EmailEvents
># takes the same values as LogEvents.
># Note that perl must be installed for EmailEvents or
># EmailUserEvents to work.
>EmailEvents disabled
># In some cases it may be appropriate to email the user about a
># certain event (e.g. notification from the collector that a collection
># was built successfully)
>EmailUserEvents false
># The list of display macro files used by this receptionist
>macrofiles \
> \
> \
> \
> \
> \
> \
> \
> \
> \
> \
> \
> \
> \
> \
> \
># Define the interface languages and encodings supported by this receptionist
># An "Encoding" line defines an encoding to be used by the receptionist.
># Uncomment "Encoding" lines to include an encoding on your "preferences" page.
># Encoding line options are:
># shortname -- The standard charset label for the given encoding. The
># shortname option is mandatory.
># longname -- The display name of the given encoding. If longname isn't set
># it will default to using shortname instead.
># map -- The name of the map file (i.e. the .ump file) for use when
># converting between unicode and the given encoding. The map
># option is mandatory for all encoding lines except the
># special case for utf8.
># multibyte -- This optional argument should be set for all encodings that use
># multibyte characters.
># The utf8 encoding is handled internally and doesn't require a map file.
># As a rule the utf8 encoding should always be enabled, especially if you
># have collections of documents that may not all be in the same
># language/encoding.
>Encoding shortname=utf-8 "longname=Unicode (UTF-8)"
># This is very experimental, and you almost certainly don't need it
>#Encoding shortname=utf-16be "longname=Unicode (UTF-16BE)"
># The ISO-8859 series
>Encoding shortname=iso-8859-1 "longname=Western (ISO-8859-1)" map=8859_1.ump
>#Encoding shortname=iso-8859-2 "longname=Central European (ISO-8859-2)" map=8859_2.ump
>#Encoding shortname=iso-8859-3 "longname=Latin 3 (ISO-8859-3)" map=8859_3.ump
>#Encoding shortname=iso-8859-4 "longname=Latin 4 (ISO-8859-4)" map=8859_4.ump
>#Encoding shortname=iso-8859-5 "longname=Cyrillic (ISO-8859-5)" map=8859_5.ump
>#Encoding shortname=iso-8859-6 "longname=Arabic (ISO-8859-6)" map=8859_6.ump
>#Encoding shortname=iso-8859-7 "longname=Greek (ISO-8859-7)" map=8859_7.ump
>#Encoding shortname=iso-8859-8 "longname=Hebrew (ISO-8859-8)" map=8859_8.ump
>#Encoding shortname=iso-8859-9 "longname=Turkish (ISO-8859-9)" map=8859_9.ump
>#Encoding shortname=iso-8859-15 "longname=Western (ISO-8859-15)" map=8859_15.ump
># Windows codepages
>Encoding shortname=windows-1250 "longname=Central European (Windows-1250)" map=win1250.ump
>Encoding shortname=windows-1251 "longname=Cyrillic (Windows-1251)" map=win1251.ump
>#Encoding shortname=windows-1252 "longname=Western (Windows-1252)" map=win1252.ump
>Encoding shortname=windows-1253 "longname=Greek (Windows-1253)" map=win1253.ump
>Encoding shortname=windows-1254 "longname=Turkish (Windows-1254)" map=win1254.ump
>Encoding shortname=windows-1255 "longname=Hebrew (Windows-1255)" map=win1255.ump
>Encoding shortname=windows-1256 "longname=Arabic (Windows-1256)" map=win1256.ump
>#Encoding shortname=windows-1257 "longname=Baltic (Windows-1257)" map=win1257.ump
>#Encoding shortname=windows-1258 "longname=Vietnamese (Windows-1258)" map=win1258.ump
>#Encoding shortname=windows-874 "longname=Thai (Windows-874)" map=win874.ump
>#Encoding shortname=cp866 "longname=Cyrillic (DOS)" map=dos866.ump
>#Encoding shortname=cp850 "longname=Latin-1 (DOS)" map=dos850.ump
>#Encoding shortname=cp852 "longname=Central European (DOS)" map=dos852.ump
># KOI8 Cyrillic encodings
>#Encoding shortname=koi8-r "longname=Cyrillic (KOI8-R)" map=koi8_r.ump
>#Encoding shortname=koi8-u "longname=Cyrillic (KOI8-U)" map=koi8_u.ump
># CJK encodings (note that Shift-JIS Japanese isn't currently supported)
>Encoding shortname=gbk "longname=�语 (Chinese Simplified GBK)" map=gbk.ump multibyte
>Encoding shortname=big5 "longname=挢� (Chinese Traditional Big5)" map=big5.ump multibyte
>Encoding shortname=euc-jp "longname=Japanese (EUC)" map=euc_jp.ump multibyte
>Encoding shortname=euc-kr "longname=Korean (UHC)" map=uhc.ump multibyte
># A "Language" line defines an interface language to be used by the
># interface. Note that it is possible to display only a subset of the
># specified languages on the preferences page for a given collection by
># using the "PreferenceLanguages" format option in your collect.cfg
># configuration file.
># Arguments are:
># shortname -- ISO 639 two letter language symbol. The shortname
># argument is mandatory.
># longname -- The display name for the given language. If longname
># isn't set it will default to using shortname instead.
># default_encoding -- The encoding to use by default when using the given
># interface language. This should be set to the
># "shortname" of a valid "Encoding" line
>Language shortname=ar longname=Arabic default_encoding=windows-1256
>Language shortname=bn "longname=àŠ¬àŠŸàŠ�àŠ²àŠŸ (Bengali)" default_encoding=utf-8
>Language shortname=ca "longname=Català (Catalan)" default_encoding=utf-8
>Language shortname=cs "longname=�esky (Czech)" default_encoding=utf-8
>Language shortname=de "longname=Deutsch (German)" default_encoding=utf-8
>Language shortname=el "longname=�λληΜικά (Greek)" default_encoding=windows-1253
>Language shortname=en longname=English default_encoding=utf-8
>Language shortname=es "longname=Español (Spanish)" \
> default_encoding=utf-8
>Language shortname=fa longname=Farsi default_encoding=utf-8
>Language shortname=fi longname=Finnish default_encoding=utf-8
>Language shortname=fr "longname=Français (French)" \
>Language shortname=gd "longname=Gaelic (Scottish)" default_encoding=utf-8
>Language shortname=gl longname=Galician default_encoding=utf-8
>Language shortname=he longname=Hebrew default_encoding=windows-1255
>Language shortname=hi longname=Hindi default_encoding=utf-8
>Language shortname=hr longname=Croatian default_encoding=windows-1250
>Language shortname=hy longname=Armenian default_encoding=utf-8
>Language shortname=id "longname=Bahasa Indonesia (Indonesian)" default_encoding=utf-8
>Language shortname=it longname=Italiano default_encoding=utf-8
>Language shortname=ja "longname=��� (Japanese)" default_encoding=utf-8
>Language shortname=ka longname=Georgian default_encoding=utf-8
>Language shortname=kk "longname=�аза� (Kazakh)" default_encoding=utf-8
>Language shortname=kn longname=Kannada default_encoding=utf-8
>Language shortname=ky "longname=���г�з�а (Kirghiz)" default_encoding=utf-8
>Language shortname=lv longname=Latvian default_encoding=utf-8
>Language shortname=mi "longname=M�ori" default_encoding=utf-8
>Language shortname=mn "longname=�ПМгПл (Mongolian)" default_encoding=utf-8
>Language shortname=nl "longname=Nederlands (Dutch)" default_encoding=utf-8
>Language shortname=pl "longname=polski (Polish)" default_encoding=utf-8
>Language shortname=pt-br "longname=português-BR (Brasil)" \
>Language shortname=pt-pt "longname=português-PT (Portugal)" \
>Language shortname=ru "longname=����кОй (Russian)" default_encoding=windows-1251
>Language shortname=sk "longname=Sloven�ina (Slovak)" default_encoding=utf-8
>Language shortname=sr longname=Serbian default_encoding=utf-8
>Language shortname=th longname=Thai default_encoding=utf-8
>Language shortname=tr longname=Turkish default_encoding=windows-1254
>Language shortname=uk longname=Ukrainian default_encoding=utf-8
>Language shortname=vi "longname=Tiếng Vi�t (Vietnamese)" default_encoding=utf-8
>Language shortname=zh "longname=��䞭� (Simplified Chinese)" default_encoding=gbk
>Language shortname=zh-tr "longname=��䞭� (Traditional Chinese)" default_encoding=big5
># Define any additional page parameters to be used by the above macro files
># (the current default page parameters are c (collection) and l (language)
># Define v (version -- text or graphic) page parameter and give it a default
># value of 0 (0 = text version off)
>pageparam v 0
># Set the precedence given to the page parameters. This effects which macro
># will be selected for display when there are multiple versions of the same
># macro with different page parameters.
># e.g. Given a macroprecedence of "c,v,l" and the following macro definitions:
># _content_ []
># _content_ [l=en]
># _content_ [c=demo]
># _content_ [v=1]
># _content_ [l=fr,v=1,c=hdl]
># If the corresponding cgi arguments were set to l=en&v=1&c=hdl then the
># _content_[v=1] macro would be selected for display. It would be selected
># ahead of the _content_[l=en] macro because "v" has a higher precedence
># than "l". The _content_[l=fr,v=1,c=hdl] macro would not be selected
># because one of the page parameters is completely wrong ("l").
>macroprecedence c,v,l
># Define any additional cgi arguments. Most cgi arguments are built into
># Greenstone but it's possible to define them here (or set defaults for
># existing built-in cgi arguments).
># define the "v" cgi argument (to correspond to the "v" page parameter defined
># above).
>cgiarg shortname=v longname=version multiplechar=false argdefault=0 \
> defaultstatus=weak savedarginfo=must
># set a default value for the built-in "a" cgi argument
>cgiarg shortname=a argdefault=p
># set a default value for the built-in "p" cgi argument
>cgiarg shortname=p argdefault=home
># set the default encoding to utf-8
>cgiarg shortname=w argdefault=utf-8
>cgiarg shortname=l argdefault=fr
>cgiarg shortname=m argdefault=200
>cgiarg shortname=o argdefault=40
>greenstone-users mailing list