[greenstone-users] mgpp_passes bit buffer overrun problem : solution hint

From Peter
DateFri Nov 30 00:05:07 2007
Subject [greenstone-users] mgpp_passes bit buffer overrun problem : solution hint
Hi

There have been various postings on this list concerning a nasty error
when building greenstone 2 collections:

mgpp_passes: bit buffer overrun

I had this problem and now its gone. I have not yet fully investigated
why, but in my case it is clearly connected to improper filename encoding.

I am runnning:
greenstone 2.74 on a grml distribution (grml.org debain based) inside
a vmware workstation 6 (vmware.com) guest on a windows 2000 host system
using samba 3.

I did the following:

1) Within Vmware Workstation ( a virtualizer) , I tried to copy the
import data for the collection that I could not build to the Vmware
shared folder on the windows host (do not worry if you do not know about
vmware, I just give the details for further investigation, the problem
is not related to vmware). The Linux copy function (cp) reported errors,
that many files could not be copied ("files or directory do not exist"
messages). However, the greenstone import.pl function did NOT report
errors.

2) I then noticed, . that the files to import opened as iso8859-15
encoding in various editors. This is a latin1 based european encoding
including the euro sign. However, my linux box is set to UTF-8 and I ran
tidy (see w3c.org for tidy ) on the import data converting everything
to UTF-8

3) I then noticed, that samba (for win - linux filesharing) defaults to
iso8859-15 on my installation (grml distribution, a debian based dist)
and that I imported the data using samba.

4) I then configured samba to use UTF-8 and reimported the data in the
Linux guest for collection building.

Everything works fine !!

I will further investigate this, but for the people with bit buffer
overrun, the hint is: check filename encoding (and data encoding)
especially if you downloaded html from the web (for ex with wget). Also
use tidy outside greenstone (that is, as a standalone program) in order
to control the cleaning ( and recoding ) process _before_ importing html
files. You will be amazed how dirty most html is.

Hope that helps
Peter