[greenstone-users] Design and Build of pdf in linux - Challenge

From John Rose
DateWed Apr 6 02:37:42 2011
Subject [greenstone-users] Design and Build of pdf in linux - Challenge
Dear Colleagues, I am sending the message below again because I
inadvertently included attachments with the initial sending:

1. I have a problem with pdf files which may be related to Boye's. I
have Ubuntu 10.04 and have set up cups-pdf to print to pdf files. This
works fine and generates pdf version 1.3 files which read with Acrobat
Reader in Ubuntu or in Windows, and no one has complained about them.

However, when I try to put these files into a collection in Greenstone
2.84 GLI, no text is extracted by Greenstone when the collection is
created. On the other hand, when I try pdf files which came from
somewhere else, they are handled fine in Greenstone 2.84.

When I ask about the character encoding in Ubuntu (file -bi filename) it
says charset=binary in both cases [thus no information about the real
character set]

I have tried to see whether I can configure cups-pdf better, but have
not see how to do thus far. Ideally, Greenstone should handle these
files if they are well formed, no? I am sending a sample pdf file
off-list to Anupama.

Anupama said that the PDF Box extension is mainly for recent versions of
pdf. Think that it could also help with the above problem, I followed
the instructions. Now my problem pdf files show text but it is in the
form "glyph19glyph25glyph27glyph29....". What does this mean?

2. It would be interesting to learn whether there is any downside to
using the PDF Box extension, and if not why it is not included by
default in Greenstone 2.84. It is certainly very important for
Greenstone to keep up with extensions of pdf. If I understand correctly,
the major problems with pdftohtml (fidelity of formatting is often not
good with the earliest pdf versions) begin with Extension Level 3 of pdf
version 1.7 (Acrobat 9.0). It would be very useful if list members could
share their experience with the PDF Box extension.

Best regards, John

-------- Original Message --------
> Content-Transfer-Encoding: 7bit
> From: Greenstone Team <greenstone_team@cs.waikato.ac.nz>
> Precedence: list
> MIME-Version: 1.0
> Cc:
> To: Sandton Consulting Ltd <mgmt@sandtonconsulting.com>,
> Anupama of Greenstone Team <greenstone_team@cs.waikato.ac.nz>,
> greenstone-users@list.scms.waikato.ac.nz
> Date: Mon, 04 Apr 2011 17:29:45 +1200
> Message-ID: <4D995749.4070205@cs.waikato.ac.nz>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> Subject: Re: [greenstone-users] Design and Build of pdf in linux - Challenge
>
> Hello Boye,
>
> There's not enough information provided as to the manner in which things
> are going wrong, but there's 2 things I can think of that you could try
> (please also answer my questions below).
> So we know that your Greenstone can work with same PDF files work on
> Windows, and that they are unable to process these files on Linux.
> Perhaps your environment is not set up right.
>
> - You appear to be using GLI (the Greenstone Librarian Interface
> application). Are there any errors shown in the output? More detailed
> error messages would appear if you went to the File menu > Preferences >
> Mode. Then tick "Expert" and click OK. Rebuild your collection and this
> time there may be more specific error messages in the build log area of
> the Create Panel. Can you copy the relevant section of the output (the
> error messages) and send this to us.
> - If your Linux system is an Ubuntu and the Greenstone version you
> happen to be using is 2.83 or earlier, then it may be a problem with
> Perl. In that case, see suggestion 2 below.
>
> 1. Otherwise, try the following first. We're going to try building from
> the command-line, instead of from GLI (the Greenstone Librarian
> application), just to check whether it has something to do with the
> environment.
>
> a) Open a linux terminal (x-term) and go into your Greenstone
> installation folder:
> $ cd /type/the/full/path/to/your/greenstone/installation/
>
> b) Next, set up the Greenstone environment by typing the following in
> your x-term:
> $ source setup.bash
>
> c) Run the import script -- which is the first step of the build process
> - and provide the name of collection you wish to build as argument to it:
> $ import.pl -removeold <type your collection's name>
>
> Are there any errors at this stage (check for errors in the text that
> moves past in the terminal during the execution of the import.pl command)?
>
> d) Next run the 2nd step of the build process, once again providing the
> collection name as argument:
> $ buildcol.pl -removeold <type your collection's name>
>
> Once again, does the output show any errors?
>
> e) If all went well, rename the folder "building" inside your collection
> folder to "index":
> $ cd collect/<type your collection's name>
> $ mv building index
>
> ! If the above worked for some reason, then the environment GLI runs in
> when it is launched is different from the environment that the
> command-line scripts manage to work in.
>
> f) If steps d and e above showed up no errors to do with your PDF files,
> then go back to your Greenstone installation folder and run the
> Greenstone server from there to visit your collection page:
> > cd /type/the/full/path/to/your/greenstone/installation/
> > ./gs2-server.sh
> This should launch a dialog. Press its central button to go to your
> Greenstone digital library's home page and from there manoeuvre to your
> collection.
>
>
> 2. If suggestion 1 above did not work, and your work is urgent, then I
> think it a good idea for you to try out the latest version of Greenstone
> (released last Friday): Greenstone 2.84. This latest Greenstone version
> can work with a plugin extension which makes it cope with later versions
> of PDF. It may be that doing so will bypass your linux-specific problems
> of handling the PDFs you have:
>
> a) Download the Greenstone 2.84 installer for *Linux* by clicking the
> link at the top of http://www.greenstone.org/download
> (or visit http://sourceforge.net/projects/greenstone/ and press the
> green Download button)
>
> b) Then run the installer to install Greenstone2.84. Make sure to
> install it somewhere else than your previous Greenstone installation.
>
> c) Next, point your browser to
> http://trac.greenstone.org/browser/gs2-extensions/pdf-box/trunk/pdf-box-java.tar.gz
> Click on the small red "download" link on that page and make sure to
> save this file (the PDF-Box Greenstone extension) into your Greenstone
> 2.84 installations "ext" folder.
>
> d) Use a terminal (x-term) to cd into your Greenstone installation
> folder and then go into its ext folder where you have saved the tar.gz
> file downloaded above. Then extract this archive file in this location:
> $ cd /type/the/full/path/to/your/Greenstone 2.84/installation/
> $ cd ext
> $ tar -xvzf pdf-box-java.tar.gz
>
> e) Copy your collection folder across from the old Greenstone
> installation into the new one:
> $ cp -r
> /<full/path/to/your/OLD-Greenstone-installation>/collect/<type name of
> PDF collection> /<full/path/to/your/Greenstone
> 2.84/installation>/collect/.
>
> (Note that the folder "collect" is the name of the directory containing
> your collection which is to be copied. The "collect" folder exists in
> all normal Greenstone installations, so you need to type it as shown.
> Just replace the strings inside the <> marks.)
>
> f) In a *fresh* terminal (this is important, so make sure to open a
> brand new x-term), go back to your Greenstone installation folder and
> run GLI from here:
> $ cd /full/path/to/your/Greenstone 2.84/installation/
> $ ./gli/gli.sh
>
> The reason you need a fresh x-term is because when GLI is run this time,
> it will know to set up the Greenstone environment all over again. And
> this time, it will detect the new PDF Box extension that you downloaded
> and unpacked in steps c and d.
>
> g) Go to File > Open. Click the "Change Dir..." button at the bottom
> and, in the dialog that appears, make sure it is pointing to the collect
> folder inside your new Greenstone 2.84 installation. (Else use the
> Change Dir dialog to go to the Greenstone 2.84 installation's collect
> directory.) Now open your collection. This should open the collection
> you copied into your Greenstone 2.84.
>
> h) Go into the Design panel. Make sure that on the left hand side,
> "Document Plugins" is selected. Then, to the right, double click on the
> PDFPlugin in the list of Document plugins. In the Plugin Configuration
> dialog that appears, scroll down to the section titled "Autoload
> Converters" and tick the checkbox next to "pdfbox_conversion".
> This tells GLI to use the PDFBox extension to process PDF files. This
> extension to the PDFplugin allows newer PDF versions to be processed as
> well.
>
> i) Now go to GLI's Create Panel and click the Build button. Hopefully
> there will be no errors this time and your PDFs will get processed. Then
> click the Preview button to preview your collection.
>
>
> If you still have problems after trying the above, then, in your next
> e-mail will you send us the error messages from the build output? And
> also tell us what version of Greenstone you are using and what
> particular Linux (Ubuntu for example).
>
> Best of luck,
> Anupama
>
> > Sandton Consulting Ltd wrote:
>
> > Hello ,
> > We are a company interested in Greenstone Digital library
> software.
> > We have this challenge pdf files in Greenstone linux environment .
> When we attach pdf files , we go on to Design and build our pdf files,
> the pdf files are rejected . The same pdf files are accepted
> > and we build with them in Greenstone for Windows.
> >
> > Please assist us to solve this problem of Design and Build of pdf
> files in Greenstone for linux environment.
> >
> > Please it is urgent.
> >
> > Thanking you
> >
> > Boye Adesanya


--
************

John B. Rose
1 Bis rue des Ch□tre-Sacs
92310 S□vres, France

Email: john.rose1@free.fr
Alternate email: johnrose@alumni.caltech.edu