RE: RV: [greenstone-users] PDFPlug

From Azael Barrera
DateThu, 11 Mar 2004 08:36:53 -0500
Subject RE: RV: [greenstone-users] PDFPlug
John, Katherine, GSDL folks,

I got some problems with PDF files, too (again).

I had used GSDL 2.40a on RHLinux8 and had to go around several times,
and always got
something wrong.

I am using GSDL 2.41 now, in both a local Windows version and with
RHLinux9, which I recompiled to enable z39.50 (as I did with 2.40a).

Problem is this. I worked out files in OpenOffice Writer and saved them
as .dos and then convert them via the internal PDF converter, which I
believed is based in ghostscript-to-pdf, which I heard what it does is
create images (bitmaps or jpegs, I don´t know). When using either the
Collector or GLI with the local gsdl (I do this first before working
the linux gsdl) no PDF file is shown in the final list of the
In fact no file is shown.

Then, going a step back, using GLI and instead of using the pdf version
of the file I used the .doc generated by OOWriter, the list contained a
html text file, and a icon that is supposed to bring me the .doc file
does not work at all). Only the html text works.

Am I missing something with PDFPlug parameters? And with classify
parameters? Is this functionality crippled in the Win-local version but
should work in the linux-server version?

What I need is simple. A list of pdf files, with title and filename next
the pdf driving icon, and perhaps the text versión too, nothing else (no
.doc´s since it seems not to work properly).

Any help? Sorry for the time spent if this has been answered before.
If it has been, then please provide with pointers to this not-so-Faq.

Azael Barrera, Ph.D.
Director - Transferencia de Tecnologías de Información y Comunicación.
Secretaría Nacional de Ciencia, Tecnología e Innovación

-----Mensaje original-----
De: John R. McPherson []
Enviado el: miércoles, 10 de marzo de 2004 18:24
Para: Diego Spano
CC: Greenstone (users)
Asunto: Re: RV: [greenstone-users] PDFPlug

Diego Spano wrote:
> Hi John, you are right, png files are better than jpg, but Greenstone
> doesn´t process it !!! I made a PDF document composed with png files.
> imported it in Greenstone but it doesn´t export each page, so when I
> browse the document in the collection I see no images ! If I use jpg
> files, Greenstone process it with no problems !
> Is something about the PDFPlug? I also use -complex option but nothing
> happens.

It looks like "pdftohtml", the converter program we use, handles .JPG
images differently to other image types. The older versions of pdftohtml

used to always extract images, it looks like the newer version only
extracts JPG images by default, and will only extract other image types
if the -complex option is used.

If the complex option is used, then the images are extracted, but then
pdftohtml does some annoying things things:
1) It uses horrible javascript to place the extracted text in particular

places in the .HTML file, which means that if you add other stuff around

it (such as greenstone html code), all the text is overlapping and out
of place.
2) It makes a big image of the page and uses that as the background,
drawing the extracted text on top of the image.

Basically, if you have too much text at the top of the page, it might
make it all render funny. Also, I don't know how well Internet Explorer
handles the all the placement javascript.

Anyway, I tried it with a pdf file I created using "pdflatex" and
embedding .PNG images in it, and it worked ok when I gave PDFPlug the
"-complex" option. I'm using greenstone v 2.41 on linux.