Re: PDF images problem

From John R. McPherson
DateWed, 10 Jul 2002 12:01:06 +1200
Subject Re: PDF images problem
In-Reply-To (3D2A2723-8080903-cie-mexico-com-mx)
Elias Estrada Torres wrote:
> I have a little problem with some PDF files and wonder if someone had it
> too.
> I have to upload PDF files with no text in it only images,
> when the file is proccesed the result seems to work fine sending a
> message like this:
> Page 1
> Page 2
> .
> .
> .
> Page N
> and the images are not displayed neither storage in the directory.

it would help if you said what operating system and which version
of Greenstone you are using.

To see how the PDF file is converted before greenstone can
process it, you could run "pdftohtml" from the command line:

1) Start a command prompt if under windows.

2) Change directory to the Greenstone home directory

3) Run "setup.bat" if under windows, or ". setup.bash" under unix.

4) you can now run the pdftohtml converter to see what is extracted
from the pdf file. Go into the import directory (or where-ever
you have a pdf file) and type the command:
"pdftohtml -noframes <<yourfile.pdf>> output.html", replacing
yourfile.pdf with the proper name. This will create output.html
which contains any extracted text, and any images that could be
extracted (these will be .PPM format) and a file called "image.log".

This is the input to Greenstone - it processes the HTML files,
which should link to the images that were extracted (but in .PNG
format - the conversion scripts will convert .PPM images to .PNG
images). If pdftohtml can't handle a particular PDF file, then
Greenstone won't be able to process it.

John McPherson