Re: [greenstone-devel] AZCompactList sorting

From Stephen.DeGabrielle@cdu.edu.au
DateMon, 8 Sep 2003 09:42:23 +0930
Subject Re: [greenstone-devel] AZCompactList sorting

Hi Gordon,

these are the change we made to AZCompactList.pm to get it to behave the way we wanted.

It is two things; a little one to specify how leaves were sorted and a bigger one to change how the branches were sorted:

We removed the conditional that calls the <evil>&sorttools::format_string_name_english ($formatted_metavalue);</evil> line (it fails on corporate authors and my surname. All our metadata is Lastname,Firstname so we are better off without this.

The second one is to pass a sort argument but I forget what it passes it too.

I also have a MultiAZCompact List sent to me by Michael Dewsnip, which you may find helpful. ( I have added the text of the classifier to the end of this message - but beware weird formatting happens - we are using this but changed the hlist to a vlist)

Sorry I can't tell you much more, I have been using the db2txt utility and the '-mode infodb' flag for building to speed up my testing and quickly get a look at the output from my classifiers lines.

I am thinking of the a 'new books' facility for our collection, so I am keen to have a look at anything you come up with.

Let us know how you go,

 

Stephen

 

 

 

________________________________________________
Stephen De Gabrielle
Digitisation Officer
AraDA Project

Northern Territory University Library
http://www.ntu.edu.au/library
Tel: (08) 8946 7009 from overseas: 61 8 8946 7009
Postal address: P.O.Box 41246, Casuarina, NT, 0811, Australia
CRICOS Provider No: 00300K

>Hi all,
>

>Does anyone know exactly how AZCompactList classifiers sort the documents
>inside each category?  The global sortmeta in import doesn't work, and
>there's no classifier-specific option.  The code is not easy to read (gsdl
>2.39).
>

>In fact, what I really want to do is a reverse-sort-by-date.  Can anyone
>suggest a way to do this (given I have a Date metadata field)?
>

>Gordon
>

>_______________________________________________
>greenstone-devel mailing list
>greenstone-devel@list.scms.waikato.ac.nz

>https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-devel

------------------------

sub classify
{
   my $self = shift (@_);
   my ($doc_obj) = @_;

    my $doc_OID = $doc_obj->get_OID();

    my @sectionlist = ();
   my $topsection = $doc_obj->get_top_section();

    my $metaname = $self->{'metaname'};
   my $outhandle = $self->{'outhandle'};

    $metaname =~ s/(/.*)//; # grab first name in n1/n2/n3 list

    if ($self->{'doclevel'} =~ /^top(level)?/i)
   {
push(@sectionlist,$topsection);
   }
   else
   {
my $thissection = $doc_obj->get_next_section($topsection);
while (defined $thissection)
{
    push(@sectionlist,$thissection);
    $thissection = $doc_obj->get_next_section ($thissection);
}
   }

    my $thissection;
   foreach $thissection (@sectionlist)
   {
my $full_doc_OID
   = ($thissection ne "") ? "$doc_OID.$thissection" : $doc_OID;

 if (defined $self->{'list'}->{$full_doc_OID})
{
    print $outhandle "WARNING: NTUAZCompactList::classify called multiple times for $full_doc_OID "
}
$self->{'list'}->{$full_doc_OID} = [];
$self->{'listmetavalue'}->{$full_doc_OID} = [];

 my $metavalues = $doc_obj->get_metadata($thissection,$metaname);
my $metavalue;
foreach $metavalue (@$metavalues)
{
    # if this document doesn't contain the metadata element we're
    # sorting by we won't include it in this classification
    if (defined $metavalue && $metavalue =~ /w/)
   {
 if ($self->{'removeprefix'}) {
     $metavalue =~ s/^$self->{'removeprefix'}//;
 }

  my $formatted_metavalue = $metavalue;

############# THIS #######################################################
     &sorttools::format_string_english ($formatted_metavalue);
############## REPLACED THIS ###############################################
#  if ($self->{'metaname'} =~ m/^Creator(:.*)?$/)
#  {
#      &sorttools::format_string_name_english ($formatted_metavalue);
#  }
#  else
#  {
#      &sorttools::format_string_english ($formatted_metavalue);
#  }
###### SD IR 2003 ###########################################################
 
 #### prefix-str
 if (! defined($formatted_metavalue)) {
     print $outhandle "Warning: NTUAZCompactList: metavalue is "
     print $outhandle "empty "
     $formatted_metavalue=""
 }

  push(@{$self->{'list'}->{$full_doc_OID}},$formatted_metavalue);
 push(@{$self->{'listmetavalue'}->{$full_doc_OID}} ,$metavalue);

  last if ($self->{'onlyfirst'});
    }
}
my $date = $doc_obj->get_metadata_element($thissection,"Date");
$self->{'reclassify'}->{$full_doc_OID} = [$doc_obj,$date];
   }
}

sub reinit
{
   my ($self,$classlist_ref) = @_;
   my $outhandle = $self->{'outhandle'};
   
   my %mtfreq = ();
   my @single_classlist = ();
   my @multiple_classlist = ();

    # find out how often each metavalue occurs
   map
  {
my $mv;
foreach $mv (@{$self->{'listmetavalue'}->{$_}} )
{
    $mtfreq{$mv}++;
}
   } @$classlist_ref;

    # use this information to split the list: single metavalue/repeated value
   map
   {
my $i = 1;
my $metavalue;
foreach $metavalue (@{$self->{'listmetavalue'}->{$_}})
{
    if ($mtfreq{$metavalue} >= $self->{'mingroup'})
    {
 push(@multiple_classlist,[$_,$i,$metavalue]);
   }
   else
    {
 push(@single_classlist,[$_,$metavalue]);
$metavalue =~ tr/[A-Z]/[a-z]/;
 $self->{'reclassifylist'}->{"Metavalue_$i.$_"} = $metavalue;
    }
    $i++;
}
   } @$classlist_ref;
   
   
   # Setup sub-classifiers for multiple list

    $self->{'classifiers'} = {};

    my $pm;
   foreach $pm ("List", "SectionList")
   {
my $listname
   = &util::filename_cat($ENV{'GSDLHOME'},"perllib/classify/$pm.pm");
if (-e $listname) { require $listname; }
else
{
   print $outhandle "NTUAZCompactList ERROR - couldn't find classifier "$listname" "
   die " "
}
   }

    # Create classifiers objects for each entry >= mingroup
   my $metavalue;
   foreach $metavalue (keys %mtfreq)
   {
if ($mtfreq{$metavalue} >= $self->{'mingroup'})
{
    my $listclassobj;
    my $doclevel = $self->{'doclevel'};
    my $metaname  = $self->{'metaname'};
    my @metaname_list = split('/',$metaname);
    $metaname = shift(@metaname_list);
    if (@metaname_list==0)
    {
 my @args;
 push @args, ("-metadata", "$metaname");
# buttonname is also used for the node's title
 push @args, ("-buttonname", "$metavalue");
###################################################
#  push @args, ("-sort", "Date");
###################################################
## SORT LEAVES  (s.degabrielle/I.Rohoza 2003)
 push @args, ("-sort", "Title");
###################################################
 if ($doclevel =~ m/^top(level)?/i)
 {
     eval ("$listclassobj = new List(@args)"); warn $@ if $@;
 }
 else
 {
     eval ("$listclassobj = new SectionList(@args)");
 }
    }
    else
    {
 $metaname = join('/',@metaname_list);
 
 my @args;
 push @args, ("-metadata", "$metaname");
# buttonname is also used for the node's title
 push @args, ("-buttonname", "$metavalue");
 push @args, ("-doclevel", "$doclevel");
 push @args, "-recopt"


eval ("$listclassobj = new NTUAZCompactList(@args)");
    }
    if ($@) {
 print $outhandle "$@"
 die " "
    }
   
   $listclassobj->init();

     if (defined $metavalue && $metavalue =~ /w/)
   {
 my $formatted_node = $metavalue;

  if ($self->{'removeprefix'}) {
     $formatted_node =~ s/^$self->{'removeprefix'}//;
 }
############# THIS #######################################################
      &sorttools::format_string_english($formatted_node);
############## REPLACED THIS ###############################################
#  if ($self->{'metaname'} =~ m/^Creator(:.*)?$/)
#  {
#  &sorttools::format_string_name_english($formatted_node);
#  }
#  else
#  {
#      &sorttools::format_string_english($formatted_node);
#  }
######## SD/IR 2003        #################################################

  # In case our formatted string is empty...
 if (! defined($formatted_node)) {
     print $outhandle "Warning: NTUAZCompactList: metavalue is "
     print $outhandle "empty "
     $formatted_node=""
 }

  $self->{'classifiers'}->{$metavalue}
= { 'classifyobj'   => $listclassobj,
     'formattednode' => $formatted_node };
    }
}
   }


  return (@single_classlist,@multiple_classlist);
}

#############

###########################################################################
#
# MultiAZCompactList.pm --
# A component of the Greenstone digital library software
# from the New Zealand Digital Library Project at the
# University of Waikato, New Zealand.
#
# Copyright (C) 1999 New Zealand Digital Library Project
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
#
###########################################################################
#Unfortunately, the classifier starts to look pretty ugly when
# browsing past the first metadata level. This can only be fixed
# by editing the C++ receptionist code, which I might do when
# time allows.
#
#Anyway, have a play if you want (just remember it's only beta).
# I've tentatively called it the MultiAZCompactList, so you
# would add something like:
#
#classify    MultiAZCompactList -metadata Subject,Title
#               -groupsize 20,20
#
#to your collection configuration file. This would create a
# classifier which first classifies on Subject, then on Title.
# The groupsize specifies the number of child items allowed
# (for the corresponding metadata element) before an hlist
# partition is added. If you try it on the demo collection,
# use something small like -groupsize 2,2 to see the effect.
#
#


package MultiAZCompactList;


use BasClas;


sub BEGIN {
   @ISA = ('BasClas');
}


my $arguments =
  [ { 'name' => "metadata",
'desc' => "Metadata fields used for classification, comma separated.",
'type' => "metalist",
'reqd' => "yes" } ,
     { 'name' => "buttonname",
  'desc' => "Button name for this classifier.",
'type' => "string",
'deft' => "First metadata field specified with -metadata",
'reqd' => "no" },
     { 'name' => "alwaysgroup",
'desc' => "Create a bookshelf icon even if there is only one item in the group.",
'type' => "string",
'deft' => "True for all metadata fields except the last",
'reqd' => "no" },
     { 'name' => "groupsize",
'desc' => "The number of items in each hlist group.",
'type' => "string"} ];

my $options = { 'name'     => "MultiAZCompactList",
 'desc'     => "",
 'inherits' => "Yes",
 'args'     => $arguments };


sub new
{
   my $class = shift(@_);
   my $self = new BasClas($class, @_);

    # To allow for proper inheritance of arguments
   local $option_list = $self->{'option_list'};
   push(@{$option_list}, $options);

    local $metadata;
   local $buttonname;
   local $alwaysgroup;
   local $groupsize;
   if (!parsargv::parse(@_,
   q^metadata/.*/^, $metadata,
   q^buttonname/.*/^, $buttonname,
   q^alwaysgroup/.*/^, $alwaysgroup,
   q^groupsize/.*/^, $groupsize,
   "allow_extra_options")) {
print STDERR " Incorrect options passed to $class, check your collect.cfg file "
$self->print_txt_usage();
die " "
   }

    # The metadata elements to use (required)
   if (!$metadata) {
die "Error: No metadata fields specified for MultiAZCompactList. "
   }
   local @metalist = split(/,/, $metadata);
   $self->{'metalist'} = @metalist;

    # Create an empty list for the OID values
   $self->{'OIDlist'} = [];

    # Create an empty hash for the metadata values of each metadata element
   foreach $metaelem (@metalist) {
$self->{$metaelem . ".list"} = {};
   }

    # The classifier button name
   if (!$buttonname) {
# Default: the first metadata field specified
$buttonname = $metalist[0];
   }
   $self->{'title'} = $buttonname;

    # Whether to group single items into a bookshelf
   if (!$alwaysgroup) {
# Default: true for all metadata fields except the last
foreach $metaelem (@metalist) {
    $self->{$metaelem . ".alwaysgroup"} = "t"
}
local $lastelem = $metalist[$#metalist];
$self->{$lastelem . ".alwaysgroup"} = "f"
   }
   else {
local @alwaysgrouplist = split(/,/, $alwaysgroup);

 # Assign values based on the always group parameter
foreach $metaelem (@metalist) {
    local $alwaysgroupelem = shift(@alwaysgrouplist);
    if (defined($alwaysgroupelem)) {
 $self->{$metaelem . ".alwaysgroup"} = $alwaysgroupelem;
    }
    else {
 if ($metaelem ne $metalist[$#metalist]) {
     $self->{$metaelem . ".alwaysgroup"} = "t"
 }
 else {
     $self->{$metaelem . ".alwaysgroup"} = "f"
 }
    }
}
   }

    # The number of items in each group
   if (!$groupsize) {
# Default: 20 in first level, 19 in second level, ... etc.
local $thisgroupsize = 20;
foreach $metaelem (@metalist) {
    $self->{$metaelem . ".groupsize"} = $thisgroupsize;
    $thisgroupsize--;
}
   }
   else {
local @groupsizelist = split(/,/, $groupsize);

 # Assign values based on the groupsize parameter
foreach $metaelem (@metalist) {
    local $groupsizeelem = shift(@groupsizelist);
    if (defined($groupsizeelem)) {
 $self->{$metaelem . ".groupsize"} = $groupsizeelem;
    }
    else {
 $self->{$metaelem . ".groupsize"} = $self->{$metalist[0] . ".groupsize"};
    }
}
   }

    return bless $self, $class;
}


sub init
{
   # Nothing to do...
   local $self = shift(@_);
}


sub classify
{
   local $self = shift(@_);
   local $doc_obj = shift(@_);

    local $doc_OID = $doc_obj->get_OID();
   local $doc_top = $doc_obj->get_top_section();

    local @metalist = @{$self->{'metalist'}};

    # Only classify the document if it has a value for the first metadata element
   local $firstelem = $metalist[0];
   if (defined($doc_obj->get_metadata_element($doc_top, $firstelem))) {
push(@{$self->{'OIDlist'}}, $doc_OID);

 # Get the value of each metadata element for this document
foreach $metaelem (@metalist) {
    local $metavalue = $doc_obj->get_metadata_element($doc_top, $metaelem);

     # If there is no value for this metadata element, use "Unknown"
    if (!defined($metavalue)) {
 $metavalue = "Unknown"
    }

     # Make the value title case
    substr($metavalue, 0, 1) =~ tr/a-z/A-Z/;
    # print "Metaelem: $metaelem, Value: $metavalue "
    $self->{$metaelem . ".list"}->{$doc_OID} = $metavalue;
}
   }
}


sub get_classify_info
{
   local $self = shift(@_);

    # The metadata elements to classify by
   local @metalist = @{$self->{'metalist'}};

    # The OID values of the documents to include in the classification
   local @OIDlist = @{$self->{'OIDlist'}};
   # print "Number of OIDs to include in classification: " . @OIDlist . " "

    # The root node of the classification hierarchy
   local %classifyinfo = ( 'thistype' => "Invisible",
      'Title' => $self->{'title'},
      'contains' => [] );

    # Recursively create the classification hierarchy, one level for each metadata element
   &add_az_list($self, @metalist, @OIDlist, %classifyinfo);
   return %classifyinfo;
}


sub add_az_list
{
   local $self = shift(@_);
   local @metalist = @{shift(@_)};
   local @OIDlist = @{shift(@_)};
   local $classifyinfo = shift(@_);
   # print " Adding AZ list for " . $classifyinfo->{'Title'} . " "

    local $metaelem = $metalist[0];
   # print "Processing metadata element: " . $metaelem . " "
   # print "Number of OID values: " . @OIDlist . " "

    local %OIDtometavaluehash = %{$self->{$metaelem . ".list"}};

    # Create a mapping from metadata value to OID
   local %metavaluetoOIDhash = ();
   foreach $OID (@OIDlist) {
local $metavalue = $OIDtometavaluehash{$OID};
push(@{$metavaluetoOIDhash{$metavalue}}, $OID);
   }
   # print "Number of distinct values: " . scalar(keys %metavaluetoOIDhash) . " "

    # Partition the values (if necessary)
   local $groupsize = $self->{$metaelem . ".groupsize"};
   if (scalar(keys %metavaluetoOIDhash) > $groupsize) {
local @sortedmetavalues = sort(keys %metavaluetoOIDhash);
local $itemsdone = 0;
local %metavaluetoOIDsubhash = ();
local $lastpartitionend = ""
local $partitionstart;
foreach $metavalue (@sortedmetavalues) {
    # print "Metavalue: $metavalue "
    $metavaluetoOIDsubhash{$metavalue} = $metavaluetoOIDhash{$metavalue};
    $itemsdone++;
    local $itemsinpartition = scalar(keys %metavaluetoOIDsubhash);

     # Is this the start of a new partition?
    if ($itemsinpartition == 1) {
 $partitionstart = &generate_partition_start($metavalue, $lastpartitionend);
    }

     # Is this the end of the partition?
    if ($itemsinpartition == $groupsize || $itemsdone == @sortedmetavalues) {
 local $partitionend = &generate_partition_end($metavalue, $partitionstart);

  local $partitionname = $partitionstart;
 if ($partitionend ne $partitionstart) {
     $partitionname = $partitionname . "-" . $partitionend;
 }
 # print "Partition: $partitionname "
 &add_hlist_partition($self, @metalist, $classifyinfo, $partitionname, %metavaluetoOIDsubhash);
 %metavaluetoOIDsubhash = ();
 $lastpartitionend = $partitionend;
    }
}

 # The partitions are stored in an HList
$classifyinfo->{'childtype'} = "HList"
   }

    # Otherwise just add all the values to a VList
   else {
&add_vlist($self, @metalist, $classifyinfo, %metavaluetoOIDhash);
$classifyinfo->{'childtype'} = "VList"
   }
}


sub generate_partition_start
{
   local $metavalue = shift(@_);
   local $lastpartitionend = shift(@_);

    local $partitionstart = substr($metavalue, 0, 1);
   if ($partitionstart le $lastpartitionend) {
$partitionstart = substr($metavalue, 0, 2);
# Give up after three characters
if ($partitionstart le $lastpartitionend) {
    $partitionstart = substr($metavalue, 0, 3);
}
   }

    return $partitionstart;
}


sub generate_partition_end
{
   local $metavalue = shift(@_);
   local $partitionstart = shift(@_);

    local $partitionend = substr($metavalue, 0, length($partitionstart));
   if ($partitionend gt $partitionstart) {
$partitionend = substr($metavalue, 0, 1);
if ($partitionend le $partitionstart) {
    $partitionend = substr($metavalue, 0, 2);
    # Give up after three characters
    if ($partitionend le $partitionstart) {
 $partitionend = substr($metavalue, 0, 3);
    }
}
   }

    return $partitionend;
}


sub add_hlist_partition
{
   local $self = shift(@_);
   local @metalist = @{shift(@_)};
   local $classifyinfo = shift(@_);
   local $partitionname = shift(@_);
   local $metavaluetoOIDhash = shift(@_);

    # Create an hlist partition
   local %subclassifyinfo = ( 'Title' => $partitionname,
         'childtype' => "VList",
         'contains' => [] );

    # Add the children to the hlist partition
   &add_vlist($self, @metalist, %subclassifyinfo, %metavaluetoOIDsubhash);
   push(@{$classifyinfo->{'contains'}}, %subclassifyinfo);
}


sub add_vlist
{
   local $self = shift(@_);
   local @metalist = @{shift(@_)};
   local $classifyinfo = shift(@_);
   local $metavaluetoOIDhash = shift(@_);

    local $metaelem = shift(@metalist);

    # Create an entry in the vlist for each value
   foreach $metavalue (sort(keys %{$metavaluetoOIDhash})) {
local @OIDlist = @{$metavaluetoOIDhash->{$metavalue}};

 # If there is only one item and 'alwaysgroup' is false, add the item to the list
if (@OIDlist == 1 && $self->{$metaelem . ".alwaysgroup"} eq "f") {
    push(@{$classifyinfo->{'contains'}}, { 'OID' => $OIDlist[0] });
}

 # Otherwise create a sublist (bookshelf) for the metadata value
else {
    local %subclassifyinfo = ( 'Title' => $metavalue,
          'childtype' => "VList",
          'contains' => [] );

     # If there are metadata elements remaining, recursively apply the process
    if (@metalist > 0) {
 &add_az_list($self, @metalist, @OIDlist, %subclassifyinfo);
    }
    # Otherwise just add the documents as children of this list
    else {
 foreach $OID (@OIDlist) {
     push(@{$subclassifyinfo{'contains'}}, { 'OID' => $OID });
 }
    }

     # Add the sublist to the list
    push(@{$classifyinfo->{'contains'}}, %subclassifyinfo);
}
   }
}


1;

###########

 

 

>_________________________________________________
>Stephen De Gabrielle
>Digitisation Officer
>AraDA Project
>
>Northern Territory University Library
>http://www.ntu.edu.au/library
>Tel: (08) 8946 7009 from overseas: 61 8 8946 7009
>Postal address: P.O.Box 41246, Casuarina, NT, 0811, Australia
>CRICOS Provider No: 00300K

>
>"Gordon Paynter" <gordon.paynter@ucr.edu>
>Sent by: greenstone-devel-bounces@list.scms.waikato.ac.nz
>07/09/2003 05:06 AM MST
>Please respond to gordon.paynter
>
> To: greenstone-devel@list.scms.waikato.ac.nz
> cc:
> bcc:
> Subject: [greenstone-devel] AZCompactList sorting
>
>
>

>Hi all,
>

>Does anyone know exactly how AZCompactList classifiers sort the documents
>inside each category?  The global sortmeta in import doesn't work, and
>there's no classifier-specific option.  The code is not easy to read (gsdl
>2.39).
>

>In fact, what I really want to do is a reverse-sort-by-date.  Can anyone
>suggest a way to do this (given I have a Date metadata field)?
>

>Gordon
>

>_______________________________________________
>greenstone-devel mailing list
>greenstone-devel@list.scms.waikato.ac.nz

>https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-devel