Fw: Greenstone Word/PDF/PS import problems

From Stefan Boddie
DateTue, 28 Jan 2003 20:57:38 +1300
Subject Fw: Greenstone Word/PDF/PS import problems
Hi,

This is a follow up to my previous long boring message about word and PDF
encoding problems with Greenstone. It is also intended to help Cao Minh Kiem
who has had some problems importing Vietnamese documents into Greenstone.

Cao Minh Kiem: I'm not sure if you are on the greenstone mailing list so if
not see the message I wrote to Leonid Kalinchenko earlier (attached below).
Many of the problems he was having importing Russian documents were having
the same effect on your Vietnamese documents. Take particular note of the
bug caused by wvware's attachment to cygwin as fixing it should allow your
MS-Word documents to import correctly.

Anyway, my findings when testing the sample Vietnamese documents Cao Minh
Kiem supplied were:

1. As already mentioned the sample Word document imported fine once cygwin
was installed. I'm now fairly confident that our word plugin can handle most
languages and encodings and would be interested to hear if anyone has any
docs that it fails on.

2. The sample vietnamese PDF document was something of a failure
unfortunately. The pdftohtml converter doesn't manage to convert it to UTF-8
successfully. All I can suggest is that someone might like to rummage in the
pdftohtml source code and attempt to add support for the vietnamese
character set. Alternatively, maybe it'd be possible to persuade the
pdftohtml developer's to add this support.

3. The sample vietnamese HTML document wasn't working because it was encoded
with HTML entities rather than raw characters. That is, it was made up of
entities looking something like "ộ". There was a bug in gsdl-2.38
preventing these entities from being converted correctly. The bug was fixed
in the CVS tree some time ago but you can patch a gsdl-2.38 installation by
downloading http://www.greenstone.org/tmp/gsdl-patch-htmlplug.zip and
installing the files it contains as follows:

* HTMLPlug.pm --> gsdlperllibpluginsHTMLPlug.pm
* BasPlug.pm --> gsdlperllibpluginsBasPlug.pm
* ghtml.pm --> gsdlperllibghtml.pm


All these changes and bug fixes will of course be included in the next
Greenstone release. If anyone has similar problems with their documents I'd
be interested in hearing about them though. It'd be nice to get as many of
these encoding related problems fixed as possible before releasing
gsdl-2.39.

cheers,
Stefan.

----- Original Message -----
From: "Stefan Boddie" <sjboddie@cs.waikato.ac.nz>
To: "Leonid Kalinichenko" <leonidk@synth.ipi.ac.ru>
Cc: <greenstone@tripath.colosys.net>
Sent: Tuesday, January 28, 2003 5:39 PM
Subject: Greenstone Word/PDF/PS import problems


> Hi Leonid,
>
> I've copied this message to the Greenstone mailing list in the hope that
> it'll help some others out too. I hope you don't mind.
>
> First a bit of background for everyone elses benefit. Leonid has been
trying
> to import Russian language Word, PDF, and Postscript documents using
> gsdl-2.38. All were failing miserably for various reasons. Many others
have
> reported similar problems with Word and PDF docs, particularly (but not
> exclusively) those containing non-ascii characters.
>
> Those interested in making their Greenstone installation import these
> documents properly (or at least better than at present) should read on.
> Everyone else, please stop now as this is going to be long.
>
>
> A couple of general comments before going into detail about the import
> problems.
>
> 1. If you're having trouble getting a collection to build the way you
think
> it should, please try building it from the command line. Using the
collector
> web interface is fine for simple collections but it does tend to hide
error
> message and other useful stuff from you. Using the command line is more
> flexible and more likely to give you useful information when things go
> wrong. There's a fairly extensive section on how to build a collection
from
> the command line in the Greenstone Developer's guide.
>
> 2. The default character set for output from gsdl-2.38 is iso-8859-1
> (western european/latin1). This means that even if your Russian or other
> non-latin document is imported and built successfully it will be displayed
> with all the non-latin characters turned into question marks. To fix this
> you
> just need to go to the preferences page and select utf-8 or some other
> appropriate encoding from the "Encoding" menu. Alternatively, if you want
> Greenstone to use utf-8 as it's default encoding you can add the following
> line to the bottom of your gsdletcmain.cfg file:
>
> cgiarg shortname=w argdefault=utf-8
>
> Note that gsdl-2.39 (due out very soon) will use utf-8 as it's default.
>
>
> MS-Word files:
> ----------------
>
> Greenstone processes MS-Word files by converting them to HTML before
> importing. The conversion process is done with an open source conversion
> utility called wvWare. GSDL-2.38 includes version 0.7.1 of wvWare which
> works perfectly well at converting non-latin character sets. There are a
> couple of problems with gsdl-2.38 itself though.
>
> 1. Unfortunately the wvWare.exe binary we distributed with the windows
> version of gsdl-2.38 has a reliance on cygwin (cygwin is a unix
environment
> for windows) which prevents it from working if cygwin isn't installed. The
> easiest way around this is to download and install cygwin from
> http://www.cygwin.com.
>
> 2. You might notice you get some broken images appearing in the html
output
> Greenstone produces after converting and importing your MS-Word file. This
> is most likely because the image appearing within the Word file itself is
a
> WMF image. Greenstone doesn't include support for extracting these images
so
> they end up broken. Since a popular way to use Greenstone is to use the
> extracted text for indexing but retain the source Word document for
display
> I haven't considered this a huge problem. The wvWare converter itself is
> however quite capable of extracting these images. To do so requires libwmf
> and various other components, the inclusion of which would make Greenstone
> even bigger and slower to download than it is already. Those requiring
this
> feature can download the required components themselves however. The
latest
> versions of everything required can be found at http://www.wvware.com. You
> simply need to install the new binary files into your gsdlbinwindows (or
> gsdl/bin/linux or whatever) directory, replacing the wvWare binary that's
> already there if required. For Windows users there are pre-compiled
binaries
> of wvWare.exe, libwmf, and everything else you need at
> http://sourceforge.net/projects/gnuwin32.
>
>
> PDF Files:
> -----------
>
> Greenstone processes PDF files in the same way as Word docs. That is, it
> first converts them to HTML then imports the HTML. The conversion process
> for PDF is done using another open source tool called pdftohtml. The
version
> of pdftohtml included with gsdl-2.38 is based on version 0.22, with some
> code added in from version 0.31 (why this was done is a long story). This
> version of pdftohtml does not do well with documents containing non-latin
> characters. The good news is that pdftohtml is once again under active
> development and the latest versions are much improved.
>
> To patch your glsn-2.38 installation to do the right thing with PDF's you
> should
> download http://www.greenstone.org/gsdl-patch-pdfplug.zip, unzip it and
> install the files it contains as follows:
>
> * pdftohtml.exe --> gsdlbinwindows
> (linux or other *nix users should get the latest version of pdftohtml from
> http://pdftohtml.sourceforge.net and install the pdftohtml binary in their
> gsdl/bin/$GSDLOS directory).
>
> * gsConvert.pl --> gsdlbinscript
>
> * pdftohtml.pl --> gsdlbinscript
>
> * PDFPlug.pm --> gsdlperllibplugins
>
> * ConvertToPlug.pm --> gsdlperllibplugins
>
> Once the patch is installed the Greenstone PDF plugin should handle PDF
> files containing non-latin characters.
>
> Note that there are some new options available in the patched version of
the
> PDF plugin. The most important is the new "-complex" option. The default
> configuration (without -complex) will extract all the text from a PDF
> document but it may not look much like the original PDF. This is fine in
> cases where the extracted text is mostly used for searching purposes while
> the original PDF is retained for display. By using the -complex option
> though (i.e. specifying "plugin PDFPlug -complex" in your collection's
> collect.cfg file) you can get the output to be formatted much more like
the
> original PDF is. An important thing to note though is that for this to
work
> properly you must have ghostscript installed on your machine. Windows
users
> can download a free, precompiled version of ghostscript from
> http://www.ghostscript.com. After installing it you should copy the
> gswin32c.exe file (typically installed to C:gsgs8.00bin) to your
> gsdlbinwindows directory. *nix users should simply make sure they've got
> gs installed and that it's on their search path.
>
> None of the above has had much testing so please let me know if anyone has
> problems with it. Of particular note is that it hasn't been tested at all
on
> Windows 95/98.
>
>
> Postscript documents:
> ----------------------
>
> The PSPlug uses ghostscript to convert the input postscript document to
text
> for importing. I had a quick go at getting Leonid's Russian document to
> convert using gswin32c on Windows 2000 but could only convert it to
garbage.
> Perhaps someone else who knows a little more about postscript and
> ghostscript will be able to help me here. All we really need is the
ability
> to convert the document to utf-8 encoded text.
>
> cheers,
> Stefan.
>
>