[greenstone-users] Re: MGPP -- Can't Load Document

From Katherine Don
DateMon, 04 Aug 2003 09:53:03 +1200
Subject [greenstone-users] Re: MGPP -- Can't Load Document
In-Reply-To (Sea1-F51OMHffNPL83m0001606b-hotmail-com)
hi dave,

I tried your collection here, and it worked fine. I could search and view the
document. (using linux)
however, the document was treated as a single section. to use the sections that
you have added to the HTML file, you need to have -description_tags with the
HTMLPlug in the configuration file,
ie
plugin HTMLPlug -description_tags

Note that you do not need to put Document tags into the HTML files. - these are
ignored.

when I tried to build the collection with -description_tags turned on, it
didn't build. there was a bug in the perl building code (my fault). so I have
attached a modified mgppbuilder.pm file, which you should put in the
gsdl/perllib directory.

So, please try and build your collection again with the -description_tags (you
will also need to reimport: use -removeold with import.pl to get rid of the old
archives files) and the new perl module. hopefully that will work, otherwise
maybe its something to do with windows, and we'll have to look into it further.

good luck,
katherine


Dave S wrote:

> Hi Katherine,
>
> Really appreciate the help you have been providing. Enclosed with this
> email is a sample html file that I inserted the <document>, <section> tags
> into, and my config file for the collection. If you find anything wrong,
> please let me know so I can fix the problem.
> thanks a lot for the help.
>
> Best Regards,
> Dave S
>
> >From: Katherine Don <kjdon@cs.waikato.ac.nz>
> >To: Dave S <precalc_x@hotmail.com>
> >Subject: Re: MGPP -- Can't Load Document
> >Date: Fri, 01 Aug 2003 13:32:11 +1200
> >
> >hi dave,
> >
> >I dont know whats going wrong. If you like, you can send me your
> >configuration
> >file and a few documents, and I can have a go at building your collection
> >here.
> >
> >cheers,
> >katherine
> >
> >Dave S wrote:
> >
> > > Hi,
> > >
> > > Yes this is the first time I attempted to use MGPP. Currently, I am
> >using
> > > Greenstone V.2.4 on Windows XP. I am really not sure what's wrong now.
> >I
> > > tried to built the collection from command line, but the same problem
> >still
> > > occur. It just comes up with a page saying, "Couldn't load text data".
> > > What can I do now? I have tried to change the <document>, <section> tags
> > > that I added in the html files. I have also change the configuratio
> >files.
> > > But all these don't help at all.
> > >
> > > Thanks
> > > Dave S,
> > >
> > > >From: Katherine Don <kjdon@cs.waikato.ac.nz>
> > > >To: Dave S <precalc_x@hotmail.com>
> > > >CC: greenstone-users@list.scms.waikato.ac.nz
> > > >Subject: Re: [greenstone-users] MGPP -- Can't Load Document
> > > >Date: Thu, 31 Jul 2003 16:47:58 +1200
> > > >
> > > >hi
> > > >this sounds like your compressed text files are strange somehow.
> > > >have you been able to use mgpp collections before or is this your first
> > > >attempt? did you build the collection using the collector or on the
> >command
> > > >line? try rebuilding using the command line building method, and make
> >sure
> > > >that
> > > >no errors have occurred during the text compression phase.
> > > >you can also try running the Queryer program to see if that can access
> >the
> > > >text
> > > >of the documents (although I suspect it wont be able to).
> > > >
> > > >if you are using windows and an older version of greenstone, it may
> >help to
> > > >get
> > > >the latest release.
> > > >
> > > >good luck,
> > > >Katherine.
> > > >
> > > >
> > > >Dave S wrote:
> > > >
> > > > > HI,
> > > > >
> > > > > I have problem loading documents that were built with MGPP. I
> >learnt
> > > >from
> > > > > the MGPP user guide that we need to add in one more level of
> > > > > <document>..</document> tag in order for MGPP to handle the document
> > > > > properly. However, even after adding those tags, I still can't open
> >the
> > > > > document. I just got the error message: Couldn't load text data.
> > > >However,
> > > > > when i do the search, Greenstone indeed return the right document,
> >but I
> > > > > just couldnt get access to it. Does anyone encounter this problem
> > > >before?
> > > > >
> > > > > Thanks a lot for the help,
> > > > > Dave S,


<<attachment>>
Type: text/plain
Filename: mgppbuilder.pm

###########################################################################
#
# mgppbuilder.pm -- MGBuilder object
# A component of the Greenstone digital library software
# from the New Zealand Digital Library Project at the
# University of Waikato, New Zealand.
#
# Copyright (C) 1999 New Zealand Digital Library Project
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
#
###########################################################################

package mgppbuilder;

use classify;
use cfgread;
use colcfg;
use plugin;
use util;
use FileHandle;


BEGIN {
# set autoflush on for STDERR and STDOUT so that mgpp
# doesn't get out of sync with plugins
STDOUT->autoflush(1);
STDERR->autoflush(1);
}

END {
STDOUT->autoflush(0);
STDERR->autoflush(0);
}

$maxdocsize = 12000;

%level_map = ('document'=>'Doc',
'section'=>'Sec',
'paragraph'=>'Para',
'Doc'=>'Document',
'Sec'=>'Section',
'Para'=>'Paragraph');

#$doc_level = "Doc";
#$sec_level = "Sec";
#$para_level = "Para";

%wanted_index_files = ('td'=>1,
't'=>1,
'tl'=>1,
'ti'=>1,
'idb'=>1,
'ib1'=>1,
'ib2'=>1,
'ib3'=>1,
'i'=>1,
'il'=>1,
'w'=>1,
'wa'=>1);

# change this so a user can add their own ones in via a file or cfg
#add AND, OR, NOT NEAR to this list - these cannot be used as field names
#also add the level names (Doc, Sec, Para)
%static_indexfield_map = ('Title'=>'TI',
'TI'=>1,
'Subject'=>'SU',
'SU'=>1,
'Creator'=>'CR',
'CR'=>1,
'Organization'=>'ORG',
'ORG'=>1,
'Source'=>'SO',
'SO'=>1,
'Howto'=>'HT',
'HT'=>1,
'ItemTitle'=>'IT',
'IT'=>1,
'ProgNumber'=>'PN',
'PN'=>1,
'People'=>'PE',
'PE'=>1,
'allfields'=>'ZZ',
'ZZ'=>1,
'text'=>'TX',
'TX'=>1,
'AND'=>1,
'OR'=>1,
'NOT'=>1,
'NEAR'=>1,
'Doc'=>1,
'Sec'=>1,
'Para'=>1);

sub new {
my ($class, $collection, $source_dir, $build_dir, $verbosity,
$maxdocs, $debug, $keepold, $allclassifications,
$outhandle, $no_text) = @_;

$outhandle = STDERR unless defined $outhandle;
$no_text = 0 unless defined $no_text;

# create an mgppbuilder object
my $self = bless {'collection'=>$collection,
'source_dir'=>$source_dir,
'build_dir'=>$build_dir,
'verbosity'=>$verbosity,
'maxdocs'=>$maxdocs,
'debug'=>$debug,
'keepold'=>$keepold,
'allclassifications'=>$allclassifications,
'outhandle'=>$outhandle,
'no_text'=>$no_text,
'notbuilt'=>[], # indexes not built
'indexfieldmap'=>%static_indexfield_map
}, $class;


# read in the collection configuration file
my $colcfgname = "$ENV{'GSDLCOLLECTDIR'}/etc/collect.cfg";
if (!-e $colcfgname) {
die "mgppbuilder::new - couldn't find collect.cfg for collection $collection ";
}
$self->{'collect_cfg'} = &colcfg::read_collect_cfg ($colcfgname);

# sort out the indexes
#indexes are specified with spaces, but we put them into one index
my $indexes = $self->{'collect_cfg'}->{'indexes'};
$self->{'collect_cfg'}->{'indexes'} = [];
push (@{$self->{'collect_cfg'}->{'indexes'}}, join(',', @$indexes));


# sort out subcollection indexes
if (defined $self->{'collect_cfg'}->{'indexsubcollections'}) {
my $indexes = $self->{'collect_cfg'}->{'indexes'};
$self->{'collect_cfg'}->{'indexes'} = [];
foreach $subcollection (@{$self->{'collect_cfg'}->{'indexsubcollections'}}) {
foreach $index (@$indexes) {
push (@{$self->{'collect_cfg'}->{'indexes'}}, "$index:$subcollection");
}
}
}

# sort out language subindexes
if (defined $self->{'collect_cfg'}->{'languages'}) {
my $indexes = $self->{'collect_cfg'}->{'indexes'};
$self->{'collect_cfg'}->{'indexes'} = [];
foreach $language (@{$self->{'collect_cfg'}->{'languages'}}) {
foreach $index (@$indexes) {
if (defined ($self->{'collect_cfg'}->{'indexsubcollections'})) {
push (@{$self->{'collect_cfg'}->{'indexes'}}, "$index:$language");
}
else { # add in an empty subcollection field
push (@{$self->{'collect_cfg'}->{'indexes'}}, "$index::$language");
}
}
}
}

# make sure that the same index isn't specified more than once
my %tmphash = ();
my @tmparray = @{$self->{'collect_cfg'}->{'indexes'}};
$self->{'collect_cfg'}->{'indexes'} = [];
foreach my $i (@tmparray) {
if (!defined ($tmphash{$i})) {
push (@{$self->{'collect_cfg'}->{'indexes'}}, $i);
$tmphash{$i} = 1;
}
}


# get the levels (Section, Paragraph) for indexing and compression
$self->{'levels'} = {};
$self->{'levelorder'} = ();
if (defined $self->{'collect_cfg'}->{'levels'}) {
foreach $level ( @{$self->{'collect_cfg'}->{'levels'}} ){
$level =~ tr/A-Z/a-z/;
$self->{'levels'}->{$level} = 1;
push (@{$self->{'levelorder'}}, $level);
}
} else { # default to document
$self->{'levels'}->{'document'} = 1;
push (@{$self->{'levelorder'}}, 'document');
}

$self->{'doc_level'} = "document";
if (! $self->{'levels'}->{'document'}) {
if ($self->{'levels'}->{'section'}) {
$self->{'doc_level'} = "section";
} else {
die "you must have either document or section level specified!! ";
}
}
print $outhandle "doclevel = ". $self->{'doc_level'}." ";
# get the list of plugins for this collection
my $plugins = [];
if (defined $self->{'collect_cfg'}->{'plugin'}) {
$plugins = $self->{'collect_cfg'}->{'plugin'};
}

# load all the plugins
$self->{'pluginfo'} = &plugin::load_plugins ($plugins, $verbosity, $outhandle);
if (scalar(@{$self->{'pluginfo'}}) == 0) {
print $outhandle "No plugins were loaded. ";
die " ";
}

# get the list of classifiers for this collection
my $classifiers = [];
if (defined $self->{'collect_cfg'}->{'classify'}) {
$classifiers = $self->{'collect_cfg'}->{'classify'};
}

# load all the classifiers
$self->{'classifiers'} = &classify::load_classifiers ($classifiers, $build_dir, $outhandle);

# load up any dontgdbm fields
$self->{'dontgdbm'} = {};
if (defined ($self->{'collect_cfg'}->{'dontgdbm'})) {
foreach $dg (@{$self->{'collect_cfg'}->{'dontgdbm'}}) {
$self->{'dontgdbm'}->{$dg} = 1;
}
}

# load up the document processor for building
# if a buildproc class has been created for this collection, use it
# otherwise, use the mgpp buildproc
my ($buildprocdir, $buildproctype);
if (-e "$ENV{'GSDLCOLLECTDIR'}/perllib/${collection}buildproc.pm") {
$buildprocdir = "$ENV{'GSDLCOLLECTDIR'}/perllib";
$buildproctype = "${collection}buildproc";
} else {
$buildprocdir = "$ENV{'GSDLHOME'}/perllib";
$buildproctype = "mgppbuildproc";
}
require "$buildprocdir/$buildproctype.pm";

eval("$self->{'buildproc'} = new $buildproctype($collection, " .
"$source_dir, $build_dir, $verbosity, $outhandle)");
die "$@" if $@;


return $self;
}

sub init {
my $self = shift (@_);

if (!$self->{'debug'} && !$self->{'keepold'}) {
# remove any old builds
&util::rm_r($self->{'build_dir'});
&util::mk_all_dir($self->{'build_dir'});

# make the text directory
my $textdir = "$self->{'build_dir'}/text";
&util::mk_all_dir($textdir);
}
}

sub set_strip_html {
my $self = shift (@_);
my ($strip) = @_;

$self->{'strip_html'} = $strip;
$self->{'buildproc'}->set_strip_html($strip);
}

sub compress_text {

my $self = shift (@_);
my ($textindex) = @_;

my $exedir = "$ENV{'GSDLHOME'}/bin/$ENV{'GSDLOS'}";
my $exe = &util::get_os_exe ();
my $mgpp_passes_exe = &util::filename_cat($exedir, "mgpp_passes$exe");
my $mgpp_compression_dict_exe = &util::filename_cat($exedir, "mgpp_compression_dict$exe");
my $outhandle = $self->{'outhandle'};

&util::mk_all_dir (&util::filename_cat($self->{'build_dir'}, "text"));

my $basefilename = "text/$self->{'collection'}";
my $fulltextprefix = &util::filename_cat ($self->{'build_dir'}, $basefilename);

my $osextra = "";
if ($ENV{'GSDLOS'} =~ /^windows$/i) {
$fulltextprefix =~ s@/@\@g;
}
else {
$osextra = " -d /";
}


# define the section names and possibly the doc name for mgpasses
# the compressor doesn't need to know about paragraphs - never want to
# retrieve them
my $mgpp_passes_sections = "";
my ($doc_level) = $self->{'doc_level'};
$mgpp_passes_sections .= "-J " . %level_map->{$doc_level} . " ";
foreach $level (keys %{$self->{'levels'}}) {
if ($level ne $doc_level && $level ne "paragraph") {
$mgpp_passes_sections .= "-K " . %level_map->{$level} . " ";
}
}

print $outhandle " *** creating the compressed text " if ($self->{'verbosity'} >= 1);

# collect the statistics for the text
# -b $maxdocsize sets the maximum document size to be 12 meg
print $outhandle " collecting text statistics (mgpp_passes -T1) " if ($self->{'verbosity'} >= 1);

my ($handle);
if ($self->{'debug'}) {
$handle = STDOUT;
} else {
#print $outhandle "trying to run (compress 1) mgpp_passes$exe $mgpp_passes_sections -f "$fulltextprefix" -T1 $osextra ";
if (!-e "$mgpp_passes_exe" ||
!open (PIPEOUT, "| mgpp_passes$exe $mgpp_passes_sections -f "$fulltextprefix" -T1 $osextra")) {
die "mgppbuilder::compress_text - couldn't run $mgpp_passes_exe ";
}
$handle = mgppbuilder::PIPEOUT;
}
$self->{'buildproc'}->set_output_handle ($handle);
$self->{'buildproc'}->set_mode ('text');
$self->{'buildproc'}->set_index ($textindex);
$self->{'buildproc'}->set_indexing_text (0);
if ($self->{'no_text'}) {
$self->{'buildproc'}->set_store_text(0);
} else {
$self->{'buildproc'}->set_store_text(1);
}
$self->{'buildproc'}->set_indexfieldmap ($self->{'indexfieldmap'});
$self->{'buildproc'}->set_levels ($self->{'levels'});
$self->{'buildproc'}->reset();
&plugin::begin($self->{'pluginfo'}, $self->{'source_dir'},
$self->{'buildproc'}, $self->{'maxdocs'});
&plugin::read ($self->{'pluginfo'}, $self->{'source_dir'},
"", {}, $self->{'buildproc'}, $self->{'maxdocs'});
&plugin::end($self->{'pluginfo'});
close (PIPEOUT);

close ($handle) unless $self->{'debug'};

$self->print_stats();

# create the compression dictionary
# the compression dictionary is built by assuming the stats are from a seed
# dictionary (-S), if a novel word is encountered it is spelled out (-H),
# and the resulting dictionary must be less than 5 meg with the most
# frequent words being put into the dictionary first (-2 -k 5120)
# note: these options are left over from mg version
if (!$self->{'debug'}) {
print $outhandle " creating the compression dictionary " if ($self->{'verbosity'} >= 1);
if (!-e "$mgpp_compression_dict_exe") {
die "mgppbuilder::compress_text - couldn't run $mgpp_compression_dict_exe ";
}
system ("mgpp_compression_dict$exe -f "$fulltextprefix" -S -H -2 -k 5120 $osextra");

if (!$self->{'debug'}) {
#print $outhandle "trying to run (compress 2) mgpp_passes$exe $mgpp_passes_sections -f "$fulltextprefix" -T2 $osextra ";
if (!-e "$mgpp_passes_exe" ||
!open ($handle, "| mgpp_passes$exe $mgpp_passes_sections -f "$fulltextprefix" -T2 $osextra")) {
die "mgppbuilder::compress_text - couldn't run $mgpp_passes_exe ";
}
}
}

$self->{'buildproc'}->reset();
# compress the text
print $outhandle " compressing the text (mgpp_passes -T2) " if ($self->{'verbosity'} >= 1);
&plugin::read ($self->{'pluginfo'}, $self->{'source_dir'},
"", {}, $self->{'buildproc'}, $self->{'maxdocs'});
close ($handle) unless $self->{'debug'};

$self->print_stats();
}

sub want_built {
my $self = shift (@_);
my ($index) = @_;

if (defined ($self->{'collect_cfg'}->{'dontbuild'})) {
foreach $checkstr (@{$self->{'collect_cfg'}->{'dontbuild'}}) {
if ($index =~ /^$checkstr$/) {
push (@{$self->{'notbuilt'}}, $self->{'index_mapping'}->{$index});
return 0;
}
}
}

return 1;
}

sub build_indexes {
my $self = shift (@_);
my ($indexname) = @_;
my $outhandle = $self->{'outhandle'};

my $indexes = [];
if (defined $indexname && $indexname =~ /w/) {
push @$indexes, $indexname;
} else {
$indexes = $self->{'collect_cfg'}->{'indexes'};
}

# create the mapping between the index descriptions
# and their directory names (includes subcolls and langs)
$self->{'index_mapping'} = $self->create_index_mapping ($indexes);

# build each of the indexes
foreach $index (@$indexes) {
if ($self->want_built($index)) {
print $outhandle " *** building index $index in subdirectory " .
"$self->{'index_mapping'}->{$index} " if ($self->{'verbosity'} >= 1);
$self->build_index($index);
} else {
print $outhandle " *** ignoring index $index " if ($self->{'verbosity'} >= 1);
}
}
}

# creates directory names for each of the index descriptions
sub create_index_mapping {
my $self = shift (@_);
my ($indexes) = @_;

my %mapping = ();
$mapping{'indexmaporder'} = [];
$mapping{'subcollectionmaporder'} = [];
$mapping{'languagemaporder'} = [];

# dirnames is used to check for collisions. Start this off
# with the manditory directory names
my %dirnames = ('text'=>'text',
'extra'=>'extra');
my %pnames = ('index' => '', 'subcollection' => '', 'languages' => '');

foreach $index (@$indexes) {
my ($fields, $subcollection, $languages) = split (":", $index);

# the directory name starts with a processed version of index fields
#my ($pindex) = $self->process_field($fields);
#$pindex = lc ($pindex);
# now we only ever have one index, and its called 'idx'
$pindex = 'idx';

# next comes a processed version of the subcollection if there is one.
my $psub = $self->process_field ($subcollection);
$psub = lc ($psub);

# next comes a processed version of the language if there is one.
my $plang = $self->process_field ($languages);
$plang = lc ($plang);

my $dirname = $pindex . $psub . $plang;

# check to be sure all index names are unique
while (defined ($dirnames{$dirname})) {
$dirname = $self->make_unique (%pnames, $index, $pindex, $psub, $plang);
}

$mapping{$index} = $dirname;

# store the mapping orders as well as the maps
# also put index, subcollection and language fields into the mapping thing -
# (the full index name (eg text:subcol:lang) is not used on
# the query page) -these are used for collectionmeta later on
if (!defined $mapping{'indexmap'}{"$fields"}) {
$mapping{'indexmap'}{"$fields"} = $pindex;
push (@{$mapping{'indexmaporder'}}, "$fields");
if (!defined $mapping{"$fields"}) {
$mapping{"$fields"} = $pindex;
}
}
if ($psub =~ /w/ && !defined ($mapping{'subcollectionmap'}{$subcollection})) {
$mapping{'subcollectionmap'}{$subcollection} = $psub;
push (@{$mapping{'subcollectionmaporder'}}, $subcollection);
$mapping{$subcollection} = $psub;
}
if ($plang =~ /w/ && !defined ($mapping{'languagemap'}{$languages})) {
$mapping{'languagemap'}{$languages} = $plang;
push (@{$mapping{'languagemaporder'}}, $language);
$mapping{$languages} = $plang;
}
$dirnames{$dirname} = $index;
$pnames{'index'}{$pindex} = "$fields";
$pnames{'subcollection'}{$psub} = $subcollection;
$pnames{'languages'}{$plang} = $languages;
}

return %mapping;
}

# returns a processed version of a field.
# if the field has only one component the processed
# version will contain the first character and next consonant
# of that componant - otherwise it will contain the first
# character of the first two components
sub process_field {
my $self = shift (@_);
my ($field) = @_;

return "" unless (defined ($field) && $field =~ /w/);

my @components = split /,/, $field;
if (scalar @components >= 2) {
splice (@components, 2);
map {s/^(.).*$/$1/;} @components;
return join("", @components);
} else {
my ($a, $b) = $field =~ /^(.).*?([bcdfghjklmnpqrstvwxyz])/i;
($a, $b) = $field =~ /^(.)(.)/ unless defined $a && defined $b;
return "$a$b";
}
}

sub make_unique {
my $self = shift (@_);
my ($namehash, $index, $indexref, $subref, $langref) = @_;
my ($fields, $subcollection, $languages) = split (":", $index);

if ($namehash->{'index'}->{$$indexref} ne "$fields") {
$self->get_next_version ($indexref);
} elsif ($namehash->{'subcollection'}->{$$subref} ne $subcollection) {
$self->get_next_version ($subref);
} elsif ($namehash->{'languages'}->{$$langref} ne $languages) {
$self->get_next_version ($langref);
}
return "$$indexref$$subref$$langref";
}

sub get_next_version {
my $self = shift (@_);
my ($nameref) = @_;

if ($$nameref =~ /(dd)$/) {
my $num = $1; $num ++;
$$nameref =~ s/dd$/$num/;
} elsif ($$nameref =~ /(d)$/) {
my $num = $1;
if ($num == 9) {$$nameref =~ s/dd$/10/;}
else {$num ++; $$nameref =~ s/d$/$num/;}
} else {
$$nameref =~ s/.$/0/;
}
}

sub build_index {
my $self = shift (@_);
my ($index) = @_;
my $outhandle = $self->{'outhandle'};

# get the full index directory path and make sure it exists
my $indexdir = $self->{'index_mapping'}->{$index};
&util::mk_all_dir (&util::filename_cat($self->{'build_dir'}, $indexdir));
my $fullindexprefix = &util::filename_cat ($self->{'build_dir'},
$indexdir,
$self->{'collection'});
my $fulltextprefix = &util::filename_cat ($self->{'build_dir'}, "text",
$self->{'collection'});

# get any os specific stuff
my $exedir = "$ENV{'GSDLHOME'}/bin/$ENV{'GSDLOS'}";

my $exe = &util::get_os_exe ();
my $mgpp_passes_exe = &util::filename_cat($exedir, "mgpp_passes$exe");

# define the section names for mgpasses
# define the section names and possibly the doc name for mgpasses
my $mgpp_passes_sections = "";
my ($doc_level) = $self->{'doc_level'};
$mgpp_passes_sections .= "-J " . %level_map->{$doc_level} ." ";

foreach $level (keys %{$self->{'levels'}}) {
if ($level ne $doc_level) {
$mgpp_passes_sections .= "-K " . %level_map->{$level}. " ";
}
}

my $mgpp_perf_hash_build_exe =
&util::filename_cat($exedir, "mgpp_perf_hash_build$exe");
my $mgpp_weights_build_exe =
&util::filename_cat ($exedir, "mgpp_weights_build$exe");
my $mgpp_invf_dict_exe =
&util::filename_cat ($exedir, "mgpp_invf_dict$exe");
my $mgpp_stem_idx_exe =
&util::filename_cat ($exedir, "mgpp_stem_idx$exe");

my $osextra = "";
if ($ENV{'GSDLOS'} =~ /^windows$/i) {
$fullindexprefix =~ s@/@\@g;
} else {
$osextra = " -d /";
if ($outhandle ne "STDERR") {
# so mgpp_passes doesn't print to stderr if we redirect output
$osextra .= " 2>/dev/null";
}
}

# get the index expression if this index belongs
# to a subcollection
my $indexexparr = [];

# there may be subcollection info, and language info.
my ($fields, $subcollection, $language) = split (":", $index);
my @subcollections = ();
@subcollections = split /,/, $subcollection if (defined $subcollection);

foreach $subcollection (@subcollections) {
if (defined ($self->{'collect_cfg'}->{'subcollection'}->{$subcollection})) {
push (@$indexexparr, $self->{'collect_cfg'}->{'subcollection'}->{$subcollection});
}
}

# add expressions for languages if this index belongs to
# a language subcollection - only put languages expressions for the
# ones we want in the index

my @languages = ();
@languages = split /,/, $language if (defined $language);
foreach $language (@languages) {
my $not=0;
if ($language =~ s/^!//) {
$not = 1;
}
foreach $lang (@{$self->{'collect_cfg'}->{'languages'}}) {
if ($lang eq $language) {
if ($not) {
push (@$indexexparr, "!Language/$language/");
} else {
push (@$indexexparr, "Language/$language/");
}
last;
}
}
}

# Build index dictionary. Uses verbatim stem method
print $outhandle " creating index dictionary (mgpp_passes -I1) " if ($self->{'verbosity'} >= 1);
my ($handle);
if ($self->{'debug'}) {
$handle = STDOUT;
} else {
if (!-e "$mgpp_passes_exe" ||
!open (PIPEOUT, "| mgpp_passes$exe $mgpp_passes_sections -f "$fullindexprefix" -I1 $osextra")) {
die "mgppbuilder::build_index - couldn't run $mgpp_passes_exe ";
}
$handle = mgppbuilder::PIPEOUT;
}

# set up the document processr
$self->{'buildproc'}->set_output_handle ($handle);
$self->{'buildproc'}->set_mode ('text');
$self->{'buildproc'}->set_index ($index, $indexexparr);
$self->{'buildproc'}->set_indexing_text (1);
$self->{'buildproc'}->set_store_text(1);
$self->{'buildproc'}->set_indexfieldmap ($self->{'indexfieldmap'});
$self->{'buildproc'}->set_levels ($self->{'levels'});
$self->{'buildproc'}->reset();
&plugin::read ($self->{'pluginfo'}, $self->{'source_dir'},
"", {}, $self->{'buildproc'}, $self->{'maxdocs'});
close ($handle) unless $self->{'debug'};

$self->print_stats();

if (!$self->{'debug'}) {
# create the perfect hash function
if (!-e "$mgpp_perf_hash_build_exe") {
die "mgppbuilder::build_index - couldn't run $mgpp_perf_hash_build_exe ";
}
system ("mgpp_perf_hash_build$exe -f "$fullindexprefix" $osextra");

if (!-e "$mgpp_passes_exe" ||
!open ($handle, "| mgpp_passes$exe $mgpp_passes_sections -f "$fullindexprefix" -I2 $osextra")) {
die "mgppbuilder::build_index - couldn't run $mgpp_passes_exe ";
}
}

# invert the text
print $outhandle " inverting the text (mgpp_passes -I2) " if ($self->{'verbosity'} >= 1);

$self->{'buildproc'}->reset();
&plugin::read ($self->{'pluginfo'}, $self->{'source_dir'},
"", {}, $self->{'buildproc'}, $self->{'maxdocs'});

$self->print_stats ();

if (!$self->{'debug'}) {

close ($handle);

# create the weights file
print $outhandle " create the weights file " if ($self->{'verbosity'} >= 1);
if (!-e "$mgpp_weights_build_exe") {
die "mgppbuilder::build_index - couldn't run $mgpp_weights_build_exe ";
}
system ("mgpp_weights_build$exe -f "$fullindexprefix" $osextra");

# create 'on-disk' stemmed dictionary
print $outhandle " creating 'on-disk' stemmed dictionary " if ($self->{'verbosity'} >= 1);
if (!-e "$mgpp_invf_dict_exe") {
die "mgppbuilder::build_index - couldn't run $mgpp_invf_dict_exe ";
}
system ("mgpp_invf_dict$exe -f "$fullindexprefix" $osextra" );


# creates stem index files for the various stemming methods
print $outhandle " creating stem indexes " if ($self->{'verbosity'} >= 1);
if (!-e "$mgpp_stem_idx_exe") {
die "mgppbuilder::build_index - couldn't run $mgpp_stem_idx_exe ";
}
system ("mgpp_stem_idx$exe -b 4096 -s1 -f "$fullindexprefix" $osextra");
system ("mgpp_stem_idx$exe -b 4096 -s2 -f "$fullindexprefix" $osextra");
system ("mgpp_stem_idx$exe -b 4096 -s3 -f "$fullindexprefix" $osextra");

#define the final field lists
$self->make_final_field_list();

# remove unwanted files
my $tmpdir = &util::filename_cat ($self->{'build_dir'}, $indexdir);
opendir (DIR, $tmpdir) || die
"mgppbuilder::build_index - couldn't read directory $tmpdir ";
foreach $file (readdir(DIR)) {
next if $file =~ /^./;
my ($suffix) = $file =~ /.([^.]+)$/;
if (defined $suffix && !defined $wanted_index_files{$suffix}) {
# delete it!
print $outhandle "deleting $file " if $self->{'verbosity'} > 2;
#&util::rm (&util::filename_cat ($tmpdir, $file));
}
}
closedir (DIR);
}
}

sub make_infodatabase {
my $self = shift (@_);
my $outhandle = $self->{'outhandle'};


my $textdir = &util::filename_cat($self->{'build_dir'}, "text");
my $assocdir = &util::filename_cat($self->{'build_dir'}, "assoc");
&util::mk_all_dir ($textdir);
&util::mk_all_dir ($assocdir);

# get db name
my $dbext = ".bdb";
$dbext = ".ldb" if &util::is_little_endian();
my $fulldbname = &util::filename_cat ($textdir, "$self->{'collection'}$dbext");
$fulldbname =~ s///\/g if ($ENV{'GSDLOS'} =~ /^windows$/i);

my $exedir = "$ENV{'GSDLHOME'}/bin/$ENV{'GSDLOS'}";
my $exe = &util::get_os_exe ();
my $txt2db_exe = &util::filename_cat($exedir, "txt2db$exe");

# define the indexed field mapping if not already done so (ie if infodb called separately from build_index)
if (!defined $self->{'build_cfg'}) {
$self->read_final_field_list();
}
print $outhandle " *** creating the info database and processing associated files "
if ($self->{'verbosity'} >= 1);

# init all the classifiers
&classify::init_classifiers ($self->{'classifiers'});

# set up the document processor
my ($handle);
if ($self->{'debug'}) {
$handle = STDOUT;
} else {
if (!-e "$txt2db_exe" || !open (PIPEOUT, "| txt2db$exe "$fulldbname"")) {
die "mgppbuilder::make_infodatabase - couldn't run $txt2db_exe ";
}
$handle = mgppbuilder::PIPEOUT;
}

$self->{'buildproc'}->set_output_handle ($handle);
$self->{'buildproc'}->set_mode ('infodb');
$self->{'buildproc'}->set_assocdir ($assocdir);
$self->{'buildproc'}->set_dontgdbm ($self->{'dontgdbm'});
$self->{'buildproc'}->set_classifiers ($self->{'classifiers'});
$self->{'buildproc'}->set_indexing_text (0);
$self->{'buildproc'}->set_store_text(1);
#$self->{'buildproc'}->set_indexfieldmap ($self->{'indexfieldmap'});

$self->{'buildproc'}->reset();

# do the collection info
print $handle "[collection] ";

# first do the collection meta stuff - everything without a dot
my $collmetadefined = 0;
if (defined $self->{'collect_cfg'}->{'collectionmeta'}) {
$collmetadefined = 1;
foreach $cmeta (keys (%{$self->{'collect_cfg'}->{'collectionmeta'}})) {
next if ($cmeta =~ /^./); # for now, ignore ones with dots
my ($metadata_entry) = $self->create_language_db_map($cmeta, $cmeta);
#write the entry to the file
print $handle $metadata_entry;

} # foreach collmeta key
}
#add the index field macros to [collection]
# eg <TI>Title
# <SU>Subject
# these now come from collection meta. if that is not defined, usses the metadata name
$field_entry="";
foreach $longfield (@{$self->{'build_cfg'}->{'indexfields'}}){
$shortfield = $self->{'buildproc'}->{'indexfieldmap'}->{$longfield};
next if $shortfield eq 1;

# we need to check if some coll meta has been defined
my $collmeta = ".$longfield";
if ($collmetadefined && defined $self->{'collect_cfg'}->{'collectionmeta'}->{$collmeta}) {
$metadata_entry = $self->create_language_db_map($collmeta, $shortfield);
$field_entry .= $metadata_entry;
} else { #use the metadata names, or the text macros for allfields and textonly
if ($longfield eq "allfields") {
$field_entry .= "<$shortfield>all fields ";
} elsif ($longfield eq "text") {
$field_entry .= "<$shortfield>text only ";
} else {
$field_entry .= "<$shortfield>$longfield ";
}
}
}
print $handle $field_entry;

# now add the level names
$level_entry = "";
foreach $level (@{$self->{'collect_cfg'}->{'levels'}}) {
my $collmeta = ".$level"; # based on the original specification
$level =~ tr/A-Z/a-z/; # make it lower case
my $levelid = %level_map->{$level}; # find the actual value we used in the index
if ($collmetadefined && defined $self->{'collect_cfg'}->{'collectionmeta'}->{$collmeta}) {
$metadata_entry = $self->create_language_db_map($collmeta, $levelid);
$level_entry .= $metadata_entry;
} else {
# use the default macro
$level_entry .= "<$levelid>" . %level_map->{$levelid} . " ";
}
}
print $handle $level_entry;
#end the collection entry
print $handle " " . ('-' x 70) . " ";

&plugin::read ($self->{'pluginfo'}, $self->{'source_dir'},
"", {}, $self->{'buildproc'}, $self->{'maxdocs'});

# output classification information
&classify::output_classify_info ($self->{'classifiers'}, $handle,
$self->{'allclassifications'});

#output doclist
my @doclist = $self->{'buildproc'}->get_doc_list();
my $docs = join (";",@doclist);
print $handle "[browselist] ";
print $handle "<hastxt>0 ";
print $handle "<childtype>VList ";
print $handle "<numleafdocs>" . ($#doclist+1) . " ";
print $handle "<thistype>Invisible ";
print $handle "<contains>$docs";
print $handle " " . ('-' x 70) . " ";
close ($handle) if !$self->{'debug'};

}

sub create_language_db_map {
my $self = shift (@_);
my ($metaname, $mapname) = @_;
my $outhandle = $self->{'outhandle'};
my $defaultfound=0;
my $first=1;
my $metadata_entry = "";
my $default="";
#iterate through the languages
foreach $lang (keys (%{$self->{'collect_cfg'}->{'collectionmeta'}->{$metaname}})) {
if ($first) {
$first=0;
#set the default default to the first entry
$default=$self->{'collect_cfg'}->{'collectionmeta'}->{$metaname}->{$lang};
}
if ($lang =~ /default/) {
$defaultfound=1;
#the default entry goes first
$metadata_entry = "<$mapname>" .
$self->{'collect_cfg'}->{'collectionmeta'}->{$metaname}->{'default'} . " " . $metadata_entry;
}
else {
my ($l) = $lang =~ /^[l=(w*)]$/;
if ($l) {
$metadata_entry .= "<$mapname:$l>" .
$self->{'collect_cfg'}->{'collectionmeta'}->{$metaname}->{$lang} . " ";
}
}
} #foreach lang
#if we haven't found a default, put one in
if (!$defaultfound) {
$metadata_entry = "<$mapname>$default " . $metadata_entry;
}
return $metadata_entry;

}
sub collect_specific {
my $self = shift (@_);
}

# at the end of building, we have an indexfieldmap with all teh mappings, plus
# some extras, and indexmap with any indexes in it that weren't specified in the index definition.
# we want to make an ordered list of fields that are indexed, and a list of mappings that are used. this will be used for the build.cfg file, and for collection meta definition
# we store these in a build.cfg bit
sub make_final_field_list {
my $self = shift (@_);

$self->{'build_cfg'} = {};

# store the indexfieldmap information
my @indexfieldmap = ();
my @indexfields = ();
my $specifiedfields = {};
my @specifiedfieldorder = ();
# go through the index definition and add each thing to a map, so we can easily check if it is already specified - when doing the metadata, we print out all the individual fields, but some may already be specified in the index definition, so we dont want to add those again.
foreach $field (@{$self->{'collect_cfg'}->{'indexes'}}) {
my @fs = split(',', $field);
foreach $f(@fs) {
$specifiedfields->{$f}=1;
push (@specifiedfieldorder, "$f");
}
}

#add all fields bit
foreach $field (@specifiedfieldorder) {
if ($field eq "metadata") {
foreach $newfield (keys %{$self->{'buildproc'}->{'indexfields'}}) {
if (!defined $specifiedfields->{$newfield}) {
push (@indexfieldmap, "$newfield->$self->{'buildproc'}->{'indexfieldmap'}->{$newfield}");
push (@indexfields, "$newfield");
}
}

} elsif ($field eq 'text') {
push (@indexfieldmap, "text->TX");
push (@indexfields, "text");
} elsif ($field eq 'allfields') {
push (@indexfieldmap, "allfields->ZZ");
push (@indexfields, "allfields");
} else {
push (@indexfieldmap, "$field->$self->{'buildproc'}->{'indexfieldmap'}->{$field}");
push (@indexfields, "$field");

}
}
$self->{'build_cfg'}->{'indexfieldmap'} = @indexfieldmap;
$self->{'build_cfg'}->{'indexfields'} = @indexfields;


}


# recreate the field list from the build.cfg file, look first in building, then in index to find it. if there is no build.cfg, we cant do the field list (there is unlikely to be any index anyway.)
sub read_final_field_list {
my $self = shift (@_);
$self->{'build_cfg'} = {};
my @indexfieldmap = ();
my @indexfields = ();

if (scalar(keys %{$self->{'buildproc'}->{'indexfieldmap'}}) == 0) {
# set the default mapping
$self->{'buildproc'}->set_indexfieldmap ($self->{'indexfieldmap'});
}
# we read the stuff in from the build.cfg file - if its there
$buildconfigfile = &util::filename_cat($self->{'build_dir'}, "build.cfg");

if (!-e $buildconfigfile) {
# try the index dir - but do we know where it is?? try here
$buildconfigfile = &util::filename_cat($ENV{'GSDLCOLLECTDIR'}, "index", "build.cfg");
if (!-e $buildconfigfile) {
#we cant find a config file - just ignore the field list
return;
}
}
$buildcfg = &colcfg::read_build_cfg( $buildconfigfile);
if (defined $buildcfg->{'indexfields'}) {
foreach $field (@{$buildcfg->{'indexfields'}}) {
push (@indexfields, "$field");
}
}
if (defined $buildcfg->{'indexfieldmap'}) {
foreach $field (@{$buildcfg->{'indexfieldmap'}}) {
push (@indexfieldmap, "$field");
($f, $v) = $field =~ /^(.*)->(.*)$/;
$self->{'buildproc'}->{'indexfieldmap'}->{$f} = $v;
}
}

$self->{'build_cfg'}->{'indexfieldmap'} = @indexfieldmap;
$self->{'build_cfg'}->{'indexfields'} = @indexfields;

}
sub make_auxiliary_files {
my $self = shift (@_);
my ($index);

my $build_cfg = {};
# this already includes indexfieldmap and indexfields
if (defined $self->{'build_cfg'}) {
$build_cfg = $self->{'build_cfg'};
}
#my %build_cfg = ();

my $outhandle = $self->{'outhandle'};
print $outhandle " *** creating auxiliary files " if ($self->{'verbosity'} >= 1);

# get the text directory
&util::mk_all_dir ($self->{'build_dir'});

# store the build date
$build_cfg->{'builddate'} = time;
$build_cfg->{'buildtype'} = "mgpp"; #do we need this??

# store the level info
my @indexlevels = ();
foreach $l (@{$self->{'levelorder'}}) {
push (@indexlevels, %level_map->{$l});
}
$build_cfg->{'indexlevels'} = @indexlevels;

if ($self->{'levels'}->{'section'}) {
$build_cfg->{'textlevel'} = %level_map->{'section'};
} else {
$build_cfg->{'textlevel'} = %level_map->{'document'};
}
# store the number of documents and number of bytes
$build_cfg->{'numdocs'} = $self->{'buildproc'}->getNo.docs();
$build_cfg->{'numbytes'} = $self->{'buildproc'}->getNo.bytes();

# store the mapping between the index names and the directory names
my @indexmap = ();
foreach $index (@{$self->{'index_mapping'}->{'indexmaporder'}}) {
push (@indexmap, "$index->$self->{'index_mapping'}->{'indexmap'}->{$index}");
}
$build_cfg->{'indexmap'} = @indexmap;

my @subcollectionmap = ();
foreach $subcollection (@{$self->{'index_mapping'}->{'subcollectionmaporder'}}) {
push (@subcollectionmap, "$subcollection->" .
$self->{'index_mapping'}->{'subcollectionmap'}->{$subcollection});
}
$build_cfg->{'subcollectionmap'} = @subcollectionmap if scalar (@subcollectionmap);

my @languagemap = ();
foreach $language (@{$self->{'index_mapping'}->{'languagemaporder'}}) {
push (@languagemap, "$language->" .
$self->{'index_mapping'}->{'languagemap'}->{$language});
}
$build_cfg->{'languagemap'} = @languagemap if scalar (@languagemap);

$build_cfg->{'notbuilt'} = $self->{'notbuilt'};

# write out the build information
&cfgread::write_cfg_file("$self->{'build_dir'}/build.cfg", $build_cfg,
'^(builddate|buildtype|numdocs|numbytes|textlevel)$',
'^(indexmap|subcollectionmap|languagemap|indexfieldmap|notbuilt|indexfields|indexlevels)$');

}

sub deinit {
my $self = shift (@_);
}

sub print_stats {
my $self = shift (@_);

my $outhandle = $self->{'outhandle'};
my $indexing_text = $self->{'buildproc'}->get_indexing_text();
my $index = $self->{'buildproc'}->get_index();
my $num_bytes = $self->{'buildproc'}->getNo.bytes();
my $num_processed_bytes = $self->{'buildproc'}->getNo.processed_bytes();

if ($indexing_text) {
print $outhandle "Stats (Creating index $index) ";
} else {
print $outhandle "Stats (Compressing text from $index) ";
}
print $outhandle "Total bytes in collection: $num_bytes ";
print $outhandle "Total bytes in $index: $num_processed_bytes ";

if ($num_processed_bytes < 50 && ($indexing_text || !$self->{'no_text'})) {
print $outhandle "*************** ";
if ($indexing_text) {
print $outhandle "WARNING: There is very little or no text to process for $index ";
} elsif (!$self->{'no_text'}) {
print $outhandle "WARNING: There is very little or no text to compress ";
}
print $outhandle " Was this your intention? ";
print $outhandle "*************** ";
}

}

1;