Re: PDF images problem

From steve.brophy@altarum.org
DateThu, 11 Jul 2002 15:14:13 -0400
Subject Re: PDF images problem
I'm also having a problem with PDF images which might be related to the
pdftohtml version.

Using greenstone 2.38 on Windows NT, the pdf file import works fine except
for the files with security on them. Those result in the pdftohtml
(version 3.1) program error of "decryption support not included"

I tried installing the pdftohtml.exe v3.4, as the newer versions do not
have the encryption restriction, along with the ghostscript 'gswin32c'
program. This setup extracts the text from all the pdf files, but
doesn't do images for any of them. Running the 'pdftohtml.exe' program
directly only creates the .html output file, plus occasionally some .jpeg
files, but none of the .ppm files or the "image.log" file.

Is the greenstone 'pdftohtml.exe' program customized to work with ppm
images, or has the pdftohtml program changed its image support along the
way? Could this be a problem with my windows pdftohtml or ghostscript
setup, or a more general situation similar to what Elias is seeing?

thanks for the continuing help,

Steve Brophy
Altarum, Ann Arbor MI USA


Elias Estrada Torres <eestrada@cie-mexico.com.mx>
Sent by: owner-tripath-users@colosys.net
07/10/2002 11:05 AM


To: "John R. McPherson" <jrm21@cs.waikato.ac.nz>
cc: greenstone@tripath.colosys.net
Subject: Re: PDF images problem

I'm running greenstone 2.37 in a Linux Red Hat 7.2 box.
Yesterday I upgraded the pdftohtml version and the images problem was
solved.
Now the images are generated correctly but the HTML template does not
include the images.□
Any clues?
Thanks!


John R. McPherson wrote:
Elias Estrada Torres wrote:
I have a little problem with some PDF files and wonder if someone had it
too.

I have to upload PDF files with no text in it only images,
when the file is proccesed the result seems to work fine sending a
message like this:

Page 1
Page 2
.
.
.
Page N

and the images are not displayed neither storage in the directory.

Hi,
it would help if you said what operating system and which version
of Greenstone you are using.

To seee how the PDF file is converted before greenstone can
process it, you could run "pdftohtml" from the command line:

1) Start a command prompt if under windows.

2) Change directory to the Greenstone home directory

3) Run "setup.bat" if under windows, or ". setup.bash" under unix.

4) you can now run the pdftohtml converter to see what is extracted
from the pdf file. Go into the import directory (or where-ever
you have a pdf file) and type the command:
"pdftohtml -noframes <<yourfile.pdf>> output.html", replacing
yourfile.pdf with the proper name. This will create output.html
which contains any extracted text, and any images that could be
extracted (these will be .PPM format) and a file called "image.log".

This is the input to Greenstone - it processes the HTML files,
which should link to the images that were extracted (but in .PNG
format - the conversion scripts will convert .PPM images to .PNG
images). If pdftohtml can't handle a particular PDF file, then
Greenstone won't be able to process it.

John McPherson