|On Wed, Mar 26, 2003 at 11:27:42AM +0100, Roman Chyla wrote:
> this mode I can not use operator NEAR - I tried advanced mode too
> (near not supported yet?) -form searching/simple - only searching
> with short wovels is possible - I may search '?ena' but not
> '?ens' --> it results as '0 counts for word ?ensk' -form
> searching/advanced - this is not responding at all
> -output from PDFPlug mangled - but if I do it manualy then text
> is extracted right(-enc utf8) - are you sure that output from
> pdftohtml is read using utf8? - I was succesfull only when
> collect.cfg contained plugin HTMLPlug -input_encoding utf8
> (with -input_encoding windows_1250 long wovels are missed)
Hi. It's hard to tell exactly what the problem is here. mgpp
handles unicode ok - have a look at one of our demo collections:
One of the authors is named "?er?ns"
(C-scaron, e,r a-macron, n, s), or do a search on the title for
"Efficient learning good examples" (without the speech marks), and
you can search on unicode words.
If the output from pdftohtml is garbage, then the PDF file was
created using binary bitmap fonts and no textual data is extractable
from it. The input_encoding argument tells greenstone what encoding
is used by the input files, so only put utf-8 if your input files
are in utf-8.
It could possibly be a problem with your web browser settings,
as greenstone expects that any query fields are encoded using the
same encoding greenstone is using. So if greenstone is using utf-8
(the default), it expects that anything in the query field is also
utf-8. You could try using iso-8859-2 for Central Europe, if windows
and your browser are defaulting to that... I'm not familiar with
windows 98's internationalisation support.
> Thank you for getting here. I have installation disks of Mandrake
> or I may use Windows XP; will this help me to create
> full-mgpp-searchable and windows-exportable collection quickly?
Possibly, if the problem is in fact a browser issue. But greenstone
itself should behave the same on all platforms.
On another topic, I'm looking into why the Czech images were
corrupted - it is some interaction between gimp and perl that makes
gimp use only single-byte encodings.