Re: MG++ build collection fails

From Puneet Pawaia
DateMon, 20 Jan 2003 09:39:45 +0530
Subject Re: MG++ build collection fails
Hello Katherine,

Based on your suggestions, I attempted to build my mgpp collection. I first
modified my collect.cfg to a bare minimum as follows

creator puneet@pawaia.com
maintainer puneet@pawaia.com
public true
beta true
buildtype mgpp
groupsize 200

indexes text
defaultindex text

plugin ZIPPlug
plugin GMLPlug
plugin TEXTPlug
plugin HTMLPlug
plugin EMAILPlug
plugin PDFPlug
plugin RTFPlug
plugin WordPlug
plugin ArcPlug
plugin RecPlug


classify AZList -metadata Title

collectionmeta collectionname "mgpp test"
collectionmeta iconcollection ""
collectionmeta collectionextra ""
collectionmeta .text "documents"

format DocumentText "[Text][SearchMe]"

I then attempted to build the collection with just 1 html file and this was
the result

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

C:Documents and SettingsPuneet>cd program files

C:Program Files>cd gsdl

C:Program Filesgsdl>setup
Your environment has successfully been set up to run Greenstone.
Note that these settings will only have effect within this MS-DOS
session. You will therefore need to rerun setup.bat if you want
to run Greenstone programs from a different MS-DOS session.
C:Program Filesgsdl>perl -S import.pl mgpptest
RecPlug: getting directory C:Program Filesgsdlcollectmgpptestimport
HTMLPlug: processing file1.html

*********************************************
Import Complete
*********************************************
* 1 document was considered for processing
* 1 was processed and included in the collection

C:Program Filesgsdl>perl -S buildcol.pl mgpptest

*** creating the compressed text

collecting text statistics (mgpp_passes -T1)
ArcPlug: processing C:Program
Filesgsdlcollectmgpptestarchivesarchives.inf

WARNING - no plugin could process HASHae69.dirdoc.xml
doc.xml: no plugin could process this file
Stats (Compressing text from text)
Total bytes in collection: 0
Total bytes in text: 0
***************
WARNING: There is very little or no text to compress
Was this your intention?
***************

creating the compression dictionary

compressing the text (mgpp_passes -T2)
ArcPlug: processing C:Program
Filesgsdlcollectmgpptestarchivesarchives.inf

WARNING - no plugin could process HASHae69.dirdoc.xml
doc.xml: no plugin could process this file
Stats (Compressing text from text)
Total bytes in collection: 0
Total bytes in text: 0
***************
WARNING: There is very little or no text to compress
Was this your intention?
***************

*** building index text in subdirectory tx

creating index dictionary (mgpp_passes -I1)
ArcPlug: processing C:Program
Filesgsdlcollectmgpptestarchivesarchives.inf

WARNING - no plugin could process HASHae69.dirdoc.xml
doc.xml: no plugin could process this file
Stats (Creating index text)
Total bytes in collection: 0
Total bytes in text: 0
***************
WARNING: There is very little or no text to process for text
Was this your intention?
***************

inverting the text (mgpp_passes -I2)
ArcPlug: processing C:Program
Filesgsdlcollectmgpptestarchivesarchives.inf

WARNING - no plugin could process HASHae69.dirdoc.xml
doc.xml: no plugin could process this file
Stats (Creating index text)
Total bytes in collection: 0
Total bytes in text: 0
***************
WARNING: There is very little or no text to process for text
Was this your intention?
***************

create the weights file

creating 'on-disk' stemmed dictionary

creating stem indexes

*** creating the info database and processing associated files
ArcPlug: processing C:Program
Filesgsdlcollectmgpptestarchivesarchives.inf

WARNING - no plugin could process HASHae69.dirdoc.xml
doc.xml: no plugin could process this file
'C:Program' is not recognized as an internal or external command,
operable program or batch file.

*** creating auxiliary files

C:Program Filesgsdl>perl -S buildcol.pl mgpptest > test.txt

*** creating the compressed text

collecting text statistics (mgpp_passes -T1)
ArcPlug: processing C:Program
Filesgsdlcollectmgpptestarchivesarchives.inf

WARNING - no plugin could process HASHae69.dirdoc.xml
doc.xml: no plugin could process this file
Stats (Compressing text from text)
Total bytes in collection: 0
Total bytes in text: 0
***************
WARNING: There is very little or no text to compress
Was this your intention?
***************

creating the compression dictionary

compressing the text (mgpp_passes -T2)
ArcPlug: processing C:Program
Filesgsdlcollectmgpptestarchivesarchives.inf

WARNING - no plugin could process HASHae69.dirdoc.xml
doc.xml: no plugin could process this file
Stats (Compressing text from text)
Total bytes in collection: 0
Total bytes in text: 0
***************
WARNING: There is very little or no text to compress
Was this your intention?
***************

*** building index text in subdirectory tx

creating index dictionary (mgpp_passes -I1)
ArcPlug: processing C:Program
Filesgsdlcollectmgpptestarchivesarchives.inf

WARNING - no plugin could process HASHae69.dirdoc.xml
doc.xml: no plugin could process this file
Stats (Creating index text)
Total bytes in collection: 0
Total bytes in text: 0
***************
WARNING: There is very little or no text to process for text
Was this your intention?
***************

inverting the text (mgpp_passes -I2)
ArcPlug: processing C:Program
Filesgsdlcollectmgpptestarchivesarchives.inf

WARNING - no plugin could process HASHae69.dirdoc.xml
doc.xml: no plugin could process this file
Stats (Creating index text)
Total bytes in collection: 0
Total bytes in text: 0
***************
WARNING: There is very little or no text to process for text
Was this your intention?
***************

create the weights file

creating 'on-disk' stemmed dictionary

creating stem indexes

*** creating the info database and processing associated files
ArcPlug: processing C:Program
Filesgsdlcollectmgpptestarchivesarchives.inf

WARNING - no plugin could process HASHae69.dirdoc.xml
doc.xml: no plugin could process this file
'C:Program' is not recognized as an internal or external command,
operable program or batch file.

*** creating auxiliary files

C:Program Filesgsdl>cd program files

Going through the above I came to the conclusion that there is probably
some bug which prevents it from processing names of files/directories with
spaces.

I then copied the gsdl directory and all its subdirectories to the root.
Modified the setup.bat to set the home directory to this directory and then
attempted to build the collection again.

This is the result I got.

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

C:Documents and SettingsPuneet>cd gsdl

C:gsdl>setup
Your environment has successfully been set up to run Greenstone.
Note that these settings will only have effect within this MS-DOS
session. You will therefore need to rerun setup.bat if you want
to run Greenstone programs from a different MS-DOS session.
C:gsdl>perl -S import.pl mgpptest
RecPlug: getting directory C:gsdlcollectmgpptestimport
HTMLPlug: processing file1.html

*********************************************
Import Complete
*********************************************
* 1 document was considered for processing
* 1 was processed and included in the collection

C:gsdl>perl -S buildcol.pl mgpptest

*** creating the compressed text

collecting text statistics (mgpp_passes -T1)
ArcPlug: processing C:gsdlcollectmgpptestarchivesarchives.inf
WARNING - no plugin could process HASHae69.dirdoc.xml
doc.xml: no plugin could process this file
Stats (Compressing text from text)
Total bytes in collection: 0
Total bytes in text: 0
***************
WARNING: There is very little or no text to compress
Was this your intention?
***************

creating the compression dictionary

compressing the text (mgpp_passes -T2)
ArcPlug: processing C:gsdlcollectmgpptestarchivesarchives.inf
WARNING - no plugin could process HASHae69.dirdoc.xml
doc.xml: no plugin could process this file
Stats (Compressing text from text)
Total bytes in collection: 0
Total bytes in text: 0
***************
WARNING: There is very little or no text to compress
Was this your intention?
***************

*** building index text in subdirectory tx

creating index dictionary (mgpp_passes -I1)
ArcPlug: processing C:gsdlcollectmgpptestarchivesarchives.inf
WARNING - no plugin could process HASHae69.dirdoc.xml
doc.xml: no plugin could process this file
Stats (Creating index text)
Total bytes in collection: 0
Total bytes in text: 0
***************
WARNING: There is very little or no text to process for text
Was this your intention?
***************

inverting the text (mgpp_passes -I2)
ArcPlug: processing C:gsdlcollectmgpptestarchivesarchives.inf
WARNING - no plugin could process HASHae69.dirdoc.xml
doc.xml: no plugin could process this file
Stats (Creating index text)
Total bytes in collection: 0
Total bytes in text: 0
***************
WARNING: There is very little or no text to process for text
Was this your intention?
***************

create the weights file

creating 'on-disk' stemmed dictionary

creating stem indexes

*** creating the info database and processing associated files
ArcPlug: processing C:gsdlcollectmgpptestarchivesarchives.inf
WARNING - no plugin could process HASHae69.dirdoc.xml
doc.xml: no plugin could process this file

*** creating auxiliary files

C:gsdl>

As you can see the earlier problem is gone but the collection still would
not build.

I then installed an earlier version of gsdl 2.36 at a c:greenstone. When
I attempted to build the same collection with this, it succeeded. I then
added my sample of 500 html files and built the collection successfully.

Hope this helps.

Regards
Puneet

At 04:04 AM 1/14/2003, you wrote:
>hi
>
>the collector doesn't give very informative error messaegs so its hard to
>see what is happening. If you can build the demo collection ok with mg but
>not with mgpp its possible that the mgpp indexing
>programs aren't available or working.
>please try to build the collection using the command line programs
>
>(in brief:
>
>cd to the gsdl directory
>setup.bat
>perl -S import.pl demo
>perl -S buildcol.pl demo
>then copy building to index
>see the developers guide for more details )
>
>this will give much better error messages and you should be able to work
>out what the problem is from these.
>
>hope this helps
>Katherine
>
>Puneet Pawaia wrote:
>
> > Hi All,
> >
> > The configuration that I am using is as follows
> > Windows XP Home SP1 (administrative rights login), Greenstone 2.38, Active
> > Perl 5.6.1 build 631, IE6 SP1
> >
> > I attempted to build the demo collection using mg++ with the configuration
> > provided with the compilation after adding the buildtype information and
> > changing the indexes information
> >
> > creator greenstone@cs.waikato.ac.nz
> > maintainer greenstone@cs.waikato.ac.nz
> > public true
> > buildtype mgpp <-- changed
> > indexes text <-- changed
> >
> > ;indexes section:text section:Title document:text <-- commented out
> > ;defaultindex section:text <-- commented out
> >
> > plugin GAPlug
> > plugin HTMLPlug -description_tags -input_encoding iso_8859_1
> > -cover_image
> > plugin ArcPlug
> > plugin RecPlug -use_metadata_files
> >
> > classify Hierarchy -hfile sub.txt -metadata Subject -sort Title
> > classify Hierarchy -hfile AZList.txt -metadata AZList -sort Title
> > -buttonname Title -hlist_at_top
> > classify Hierarchy -hfile org.txt -metadata Organization -sort Title
> > #classify Hierarchy -hfile keyword.txt -metadata Keyword -sort Title
> > -buttonname Howto
> > classify List -metadata Keyword -buttonname Howto
> >
> > format SearchVList
> > "<td valign=top>[link][icon][/link]</td><td>{If}{[parent(All':
> > '):Title],[parent(All': '):Title]:}[link][Title][/link]</td>"
> > format VList
> > "<td valign=top>[link][icon][/link]</td><td
> >
> valign=top>[highlight]{Or}{[Title],Untitled}[/highlight]<i><small>{If}{[Date],<br>publication date: [Date]}{If}{[NumPages],<br>no. of pages: [NumPages]}{If}{[Source],<br>source ref: [Source]}</small></i></td>"
> >
> > format CL4VList "<br>[link][Keyword][/link]"
> >
> > format DocumentText "<h3>[Title]</h3>\n\n<p>[Text]"
> > format DocumentImages true
> > format DocumentButtons "Expand Text|Expand
> Contents|Detach|Highlight"
> > format HelpBookDocs true
> >
> > collectionmeta collectionname "greenstone demo"
> > collectionmeta collectionextra "This is a demonstration collection for
> > the Greenstone digital library software. It contains a small subset (11
> > books) of the Humanity Development Library"
> > collectionmeta collectionextra [l=fr] "C'est une collection pour
> > d??monstration du logiciel Greenstone. Elle contient une petite partie du
> > projet de biblioth??ques humanitaires et de d??veloppement (11 livres)."
> > collectionmeta iconcollectionsmall
> > "/gsdl/collect/demo/images/demosm.gif"
> > collectionmeta
> iconcollection "/gsdl/collect/demo/images/demo.gif"
> > collectionmeta .section:Title "section titles"
> > collectionmeta .section:Title [l=fr] "titres des sections"
> > collectionmeta .document:text "entire books"
> > collectionmeta .document:text [l=fr] "livres entiers"
> > collectionmeta .section:text "chapters"
> > collectionmeta .section:text [l=fr] "chapitres"
> >
> > This compilation resulted in the following errors
> >
> > The collection could not be built (buildcol.pl failed).
> > The build log contains the following information:
> > GAPLug: processing HASHf6dc.dirdoc.xml
> > GAPLug: processing HASH0163.dirdoc.xml
> > GAPLug: processing HASHebf5.dirdoc.xml
> > GAPLug: processing HASH0182.dirdoc.xml
> > *** creating auxiliary files
> > build: ERROR: buildcol.pl failed
> > Please restart the collector and try again.
> >
> > I would like to bring to your notice here that when I used my own sample of
> > 500 files for building a collection using mg++, I got 4 file errors at that
> > time too. When I reduced the number of files to 100, I still got 4 file
> > errors and when I further reduced the files to 10, again 4 file errors were
> > reported. Could this be significant ?
> >
> > Regards
> > Puneet
> >