Fw: Greenstone Word/PDF/PS import problems

From Stefan Boddie
DateTue, 28 Jan 2003 20:57:38 +1300
Subject Fw: Greenstone Word/PDF/PS import problems

This is a follow up to my previous long boring message about word and PDF
encoding problems with Greenstone. It is also intended to help Cao Minh Kiem
who has had some problems importing Vietnamese documents into Greenstone.

Cao Minh Kiem: I'm not sure if you are on the greenstone mailing list so if
not see the message I wrote to Leonid Kalinchenko earlier (attached below).
Many of the problems he was having importing Russian documents were having
the same effect on your Vietnamese documents. Take particular note of the
bug caused by wvware's attachment to cygwin as fixing it should allow your
MS-Word documents to import correctly.

Anyway, my findings when testing the sample Vietnamese documents Cao Minh
Kiem supplied were:

1. As already mentioned the sample Word document imported fine once cygwin
was installed. I'm now fairly confident that our word plugin can handle most
languages and encodings and would be interested to hear if anyone has any
docs that it fails on.

2. The sample vietnamese PDF document was something of a failure
unfortunately. The pdftohtml converter doesn't manage to convert it to UTF-8
successfully. All I can suggest is that someone might like to rummage in the
pdftohtml source code and attempt to add support for the vietnamese
character set. Alternatively, maybe it'd be possible to persuade the
pdftohtml developer's to add this support.

3. The sample vietnamese HTML document wasn't working because it was encoded
with HTML entities rather than raw characters. That is, it was made up of
entities looking something like "ộ". There was a bug in gsdl-2.38
preventing these entities from being converted correctly. The bug was fixed
in the CVS tree some time ago but you can patch a gsdl-2.38 installation by
downloading http://www.greenstone.org/tmp/gsdl-patch-htmlplug.zip and
installing the files it contains as follows:

* HTMLPlug.pm --> gsdlperllibpluginsHTMLPlug.pm
* BasPlug.pm --> gsdlperllibpluginsBasPlug.pm
* ghtml.pm --> gsdlperllibghtml.pm

All these changes and bug fixes will of course be included in the next
Greenstone release. If anyone has similar problems with their documents I'd
be interested in hearing about them though. It'd be nice to get as many of
these encoding related problems fixed as possible before releasing


----- Original Message -----
From: "Stefan Boddie" <sjboddie@cs.waikato.ac.nz>
To: "Leonid Kalinichenko" <leonidk@synth.ipi.ac.ru>
Cc: <greenstone@tripath.colosys.net>
Sent: Tuesday, January 28, 2003 5:39 PM
Subject: Greenstone Word/PDF/PS import problems

> Hi Leonid,
> I've copied this message to the Greenstone mailing list in the hope that
> it'll help some others out too. I hope you don't mind.
> First a bit of background for everyone elses benefit. Leonid has been
> to import Russian language Word, PDF, and Postscript documents using
> gsdl-2.38. All were failing miserably for various reasons. Many others
> reported similar problems with Word and PDF docs, particularly (but not
> exclusively) those containing non-ascii characters.
> Those interested in making their Greenstone installation import these
> documents properly (or at least better than at present) should read on.
> Everyone else, please stop now as this is going to be long.
> A couple of general comments before going into detail about the import
> problems.
> 1. If you're having trouble getting a collection to build the way you
> it should, please try building it from the command line. Using the
> web interface is fine for simple collections but it does tend to hide
> message and other useful stuff from you. Using the command line is more
> flexible and more likely to give you useful information when things go
> wrong. There's a fairly extensive section on how to build a collection
> the command line in the Greenstone Developer's guide.
> 2. The default character set for output from gsdl-2.38 is iso-8859-1
> (western european/latin1). This means that even if your Russian or other
> non-latin document is imported and built successfully it will be displayed
> with all the non-latin characters turned into question marks. To fix this
> you
> just need to go to the preferences page and select utf-8 or some other
> appropriate encoding from the "Encoding" menu. Alternatively, if you want
> Greenstone to use utf-8 as it's default encoding you can add the following
> line to the bottom of your gsdletcmain.cfg file:
> cgiarg shortname=w argdefault=utf-8
> Note that gsdl-2.39 (due out very soon) will use utf-8 as it's default.
> MS-Word files:
> ----------------
> Greenstone processes MS-Word files by converting them to HTML before
> importing. The conversion process is done with an open source conversion
> utility called wvWare. GSDL-2.38 includes version 0.7.1 of wvWare which
> works perfectly well at converting non-latin character sets. There are a
> couple of problems with gsdl-2.38 itself though.
> 1. Unfortunately the wvWare.exe binary we distributed with the windows
> version of gsdl-2.38 has a reliance on cygwin (cygwin is a unix
> for windows) which prevents it from working if cygwin isn't installed. The
> easiest way around this is to download and install cygwin from
> http://www.cygwin.com.
> 2. You might notice you get some broken images appearing in the html
> Greenstone produces after converting and importing your MS-Word file. This
> is most likely because the image appearing within the Word file itself is
> WMF image. Greenstone doesn't include support for extracting these images
> they end up broken. Since a popular way to use Greenstone is to use the
> extracted text for indexing but retain the source Word document for
> I haven't considered this a huge problem. The wvWare converter itself is
> however quite capable of extracting these images. To do so requires libwmf
> and various other components, the inclusion of which would make Greenstone
> even bigger and slower to download than it is already. Those requiring
> feature can download the required components themselves however. The
> versions of everything required can be found at http://www.wvware.com. You
> simply need to install the new binary files into your gsdlbinwindows (or
> gsdl/bin/linux or whatever) directory, replacing the wvWare binary that's
> already there if required. For Windows users there are pre-compiled
> of wvWare.exe, libwmf, and everything else you need at
> http://sourceforge.net/projects/gnuwin32.
> PDF Files:
> -----------
> Greenstone processes PDF files in the same way as Word docs. That is, it
> first converts them to HTML then imports the HTML. The conversion process
> for PDF is done using another open source tool called pdftohtml. The
> of pdftohtml included with gsdl-2.38 is based on version 0.22, with some
> code added in from version 0.31 (why this was done is a long story). This
> version of pdftohtml does not do well with documents containing non-latin
> characters. The good news is that pdftohtml is once again under active
> development and the latest versions are much improved.
> To patch your glsn-2.38 installation to do the right thing with PDF's you
> should
> download http://www.greenstone.org/gsdl-patch-pdfplug.zip, unzip it and
> install the files it contains as follows:
> * pdftohtml.exe --> gsdlbinwindows
> (linux or other *nix users should get the latest version of pdftohtml from
> http://pdftohtml.sourceforge.net and install the pdftohtml binary in their
> gsdl/bin/$GSDLOS directory).
> * gsConvert.pl --> gsdlbinscript
> * pdftohtml.pl --> gsdlbinscript
> * PDFPlug.pm --> gsdlperllibplugins
> * ConvertToPlug.pm --> gsdlperllibplugins
> Once the patch is installed the Greenstone PDF plugin should handle PDF
> files containing non-latin characters.
> Note that there are some new options available in the patched version of
> PDF plugin. The most important is the new "-complex" option. The default
> configuration (without -complex) will extract all the text from a PDF
> document but it may not look much like the original PDF. This is fine in
> cases where the extracted text is mostly used for searching purposes while
> the original PDF is retained for display. By using the -complex option
> though (i.e. specifying "plugin PDFPlug -complex" in your collection's
> collect.cfg file) you can get the output to be formatted much more like
> original PDF is. An important thing to note though is that for this to
> properly you must have ghostscript installed on your machine. Windows
> can download a free, precompiled version of ghostscript from
> http://www.ghostscript.com. After installing it you should copy the
> gswin32c.exe file (typically installed to C:gsgs8.00bin) to your
> gsdlbinwindows directory. *nix users should simply make sure they've got
> gs installed and that it's on their search path.
> None of the above has had much testing so please let me know if anyone has
> problems with it. Of particular note is that it hasn't been tested at all
> Windows 95/98.
> Postscript documents:
> ----------------------
> The PSPlug uses ghostscript to convert the input postscript document to
> for importing. I had a quick go at getting Leonid's Russian document to
> convert using gswin32c on Windows 2000 but could only convert it to
> Perhaps someone else who knows a little more about postscript and
> ghostscript will be able to help me here. All we really need is the
> to convert the document to utf-8 encoded text.
> cheers,
> Stefan.