Greenstone Word/PDF/PS import problems

From Stefan Boddie
DateTue, 28 Jan 2003 17:39:01 +1300
Subject Greenstone Word/PDF/PS import problems
Hi Leonid,

I've copied this message to the Greenstone mailing list in the hope that
it'll help some others out too. I hope you don't mind.

First a bit of background for everyone elses benefit. Leonid has been trying
to import Russian language Word, PDF, and Postscript documents using
gsdl-2.38. All were failing miserably for various reasons. Many others have
reported similar problems with Word and PDF docs, particularly (but not
exclusively) those containing non-ascii characters.

Those interested in making their Greenstone installation import these
documents properly (or at least better than at present) should read on.
Everyone else, please stop now as this is going to be long.


A couple of general comments before going into detail about the import
problems.

1. If you're having trouble getting a collection to build the way you think
it should, please try building it from the command line. Using the collector
web interface is fine for simple collections but it does tend to hide error
message and other useful stuff from you. Using the command line is more
flexible and more likely to give you useful information when things go
wrong. There's a fairly extensive section on how to build a collection from
the command line in the Greenstone Developer's guide.

2. The default character set for output from gsdl-2.38 is iso-8859-1
(western european/latin1). This means that even if your Russian or other
non-latin document is imported and built successfully it will be displayed
with all the non-latin characters turned into question marks. To fix this
you
just need to go to the preferences page and select utf-8 or some other
appropriate encoding from the "Encoding" menu. Alternatively, if you want
Greenstone to use utf-8 as it's default encoding you can add the following
line to the bottom of your gsdletcmain.cfg file:

cgiarg shortname=w argdefault=utf-8

Note that gsdl-2.39 (due out very soon) will use utf-8 as it's default.


MS-Word files:
----------------

Greenstone processes MS-Word files by converting them to HTML before
importing. The conversion process is done with an open source conversion
utility called wvWare. GSDL-2.38 includes version 0.7.1 of wvWare which
works perfectly well at converting non-latin character sets. There are a
couple of problems with gsdl-2.38 itself though.

1. Unfortunately the wvWare.exe binary we distributed with the windows
version of gsdl-2.38 has a reliance on cygwin (cygwin is a unix environment
for windows) which prevents it from working if cygwin isn't installed. The
easiest way around this is to download and install cygwin from
http://www.cygwin.com.

2. You might notice you get some broken images appearing in the html output
Greenstone produces after converting and importing your MS-Word file. This
is most likely because the image appearing within the Word file itself is a
WMF image. Greenstone doesn't include support for extracting these images so
they end up broken. Since a popular way to use Greenstone is to use the
extracted text for indexing but retain the source Word document for display
I haven't considered this a huge problem. The wvWare converter itself is
however quite capable of extracting these images. To do so requires libwmf
and various other components, the inclusion of which would make Greenstone
even bigger and slower to download than it is already. Those requiring this
feature can download the required components themselves however. The latest
versions of everything required can be found at http://www.wvware.com. You
simply need to install the new binary files into your gsdlbinwindows (or
gsdl/bin/linux or whatever) directory, replacing the wvWare binary that's
already there if required. For Windows users there are pre-compiled binaries
of wvWare.exe, libwmf, and everything else you need at
http://sourceforge.net/projects/gnuwin32.


PDF Files:
-----------

Greenstone processes PDF files in the same way as Word docs. That is, it
first converts them to HTML then imports the HTML. The conversion process
for PDF is done using another open source tool called pdftohtml. The version
of pdftohtml included with gsdl-2.38 is based on version 0.22, with some
code added in from version 0.31 (why this was done is a long story). This
version of pdftohtml does not do well with documents containing non-latin
characters. The good news is that pdftohtml is once again under active
development and the latest versions are much improved.

To patch your glsn-2.38 installation to do the right thing with PDF's you
should
download http://www.greenstone.org/gsdl-patch-pdfplug.zip, unzip it and
install the files it contains as follows:

* pdftohtml.exe --> gsdlbinwindows
(linux or other *nix users should get the latest version of pdftohtml from
http://pdftohtml.sourceforge.net and install the pdftohtml binary in their
gsdl/bin/$GSDLOS directory).

* gsConvert.pl --> gsdlbinscript

* pdftohtml.pl --> gsdlbinscript

* PDFPlug.pm --> gsdlperllibplugins

* ConvertToPlug.pm --> gsdlperllibplugins

Once the patch is installed the Greenstone PDF plugin should handle PDF
files containing non-latin characters.

Note that there are some new options available in the patched version of the
PDF plugin. The most important is the new "-complex" option. The default
configuration (without -complex) will extract all the text from a PDF
document but it may not look much like the original PDF. This is fine in
cases where the extracted text is mostly used for searching purposes while
the original PDF is retained for display. By using the -complex option
though (i.e. specifying "plugin PDFPlug -complex" in your collection's
collect.cfg file) you can get the output to be formatted much more like the
original PDF is. An important thing to note though is that for this to work
properly you must have ghostscript installed on your machine. Windows users
can download a free, precompiled version of ghostscript from
http://www.ghostscript.com. After installing it you should copy the
gswin32c.exe file (typically installed to C:gsgs8.00bin) to your
gsdlbinwindows directory. *nix users should simply make sure they've got
gs installed and that it's on their search path.

None of the above has had much testing so please let me know if anyone has
problems with it. Of particular note is that it hasn't been tested at all on
Windows 95/98.


Postscript documents:
----------------------

The PSPlug uses ghostscript to convert the input postscript document to text
for importing. I had a quick go at getting Leonid's Russian document to
convert using gswin32c on Windows 2000 but could only convert it to garbage.
Perhaps someone else who knows a little more about postscript and
ghostscript will be able to help me here. All we really need is the ability
to convert the document to utf-8 encoded text.

cheers,
Stefan.