building mgpp collection from non-ascii

From r c
DateWed, 19 Feb 2003 15:59:33 +0100 (CET)
Subject building mgpp collection from non-ascii
Hi all,

I’m trying to build MGPP collection from pages that contain
non-ascii characters (encoding Windows 1250). Import as well as
building went smoothly but displayed results contain "cabalistic
characters" (as soomeone pointed before).

I’m sure that
-import was OK (I made 2nd mg collection from the same archive
files). -encoding preferences for receptionist were set (Windows
1250, UTF-8 later on ; mg collection is displayed right)

The metadata is the only thing I can see well- ie. when I get
search results there may be displayed First200 metadata which
contains all non-ascii characters(I read that mgpp doesn’t
compress metadata by default). I can’t search for strings
that contain non-ascii characters at all(zero hits).

Is there anyone who built *mgpp* collection from non-ascii
encoded files? Please, let me know what I am doing wrong. Here
are both collects.cfg and some building messages - (interesting
is the different number of reported bytes for both collections -
don't be stressed by the rest amount of text).

Regards
Roman Chyla


##### mg collect.cfg######
indexes document:text,Title
defaultindex document:text,Title

#plugin ZIPPlug
plugin GAPlug
plugin TEXTPlug
plugin HTMLPlug -first 200 -input_encoding windows_1250
-default_language cs -extract_email -keep_head
-rename_assoc_files -file_is_url plugin EMAILPlug
plugin PDFPlug
plugin RTFPlug
#plugin WordPlug
plugin PSPlug
plugin ArcPlug
plugin RecPlug


classify AZList -metadata Title
classify AZList -metadata URL

collectionmeta collectionname "ikaros"
collectionmeta iconcollection ""
collectionmeta collectionextra ""
collectionmeta .document:text,Title "text"
##### mgpp collect.cfg #######
buildtype mgpp
indexes text,Title

dafaultindex text,Title

#plugin ZIPPlug
plugin GAPlug
plugin TEXTPlug
plugin HTMLPlug -first 200 -input_encoding windows_1250
-default_language cs -extract_email -keep_head
-rename_assoc_files -file_is_url plugin EMAILPlug
plugin PDFPlug
plugin RTFPlug
#plugin WordPlug
plugin PSPlug
plugin ArcPlug
plugin RecPlug


classify AZList -metadata Title
classify AZList -metadata URL

collectionmeta collectionname "ik_mgpp"
collectionmeta iconcollection ""
collectionmeta collectionextra ""

collectionmeta .text,Title "text a Title"
######################
Building mg collection.
######################

C:\gsdl>perl -S buildcol.pl ikaros

*** creating the compressed text

collecting text statistics
ArcPlug: processing C:\gsdl\collect\ikaros\archives\archives.inf
................
Stats (Compressing text from section:text)
Total bytes in collection: 470192
Total bytes in section:text: 470192

creating the compression dictionary

compressing the text
.............
Stats (Compressing text from section:text)
Total bytes in collection: 470192
Total bytes in section:text: 470192

*** building index document:text,Title in subdirectory dtt

creating index dictionary
.................
Stats (Creating index document:text,Title)
Total bytes in collection: 470192
Total bytes in document:text,Title: 473023

inverting the text
ArcPlug: processing C:\gsdl\collect\ikaros\archives\archives.inf
.............
Stats (Creating index document:text,Title)
Total bytes in collection: 470192
Total bytes in document:text,Title: 473023
ivf.pass2 : M

create the weights file
...
L = 7.988359
U = 305.014160
B = 1.014330

creating 'on-disk' stemmed dictionary

creating stem indexes

*** creating the info database and processing associated files
.
.
*** creating auxiliary files

C:\gsdl>
#####################
Here is mgpp building
#####################

C:\gsdl>perl -S buildcol.pl ik_mgpp

*** creating the compressed text

collecting text statistics (mgpp_passes -T1)
ArcPlug: processing C:\gsdl\collect\ik_mgpp\archives\archives.inf
...........
Stats (Compressing text from text)
Total bytes in collection: 470192
Total bytes in text: 470192

creating the compression dictionary

compressing the text (mgpp_passes -T2)
...............
Stats (Compressing text from text)
Total bytes in collection: 470192
Total bytes in text: 470192

*** building index text,Title in subdirectory tt

creating index dictionary (mgpp_passes -I1)
...............
Stats (Creating index text,Title)
Total bytes in collection: 470192
Total bytes in text,Title: 304245

inverting the text (mgpp_passes -I2)
.................
Stats (Creating index text,Title)
Total bytes in collection: 470192
Total bytes in text,Title: 304245
create the weights file

creating 'on-disk' stemmed dictionary

creating stem indexes

*** creating the info database and processing associated files
...............


*** creating auxiliary files

C:\gsdl>