Re: Formatting Issues of Documents

From John R. McPherson
DateFri, 21 Mar 2003 12:08:18 +1200
Subject Re: Formatting Issues of Documents
In-Reply-To (20030314054419-82567-qmail-web40909-mail-yahoo-com)
atta wrote:
>
> Dear Sir,
> I am in the process of building digital library for UN Systems in Pakistan.
> I have uploaded my digital documents in various format but unfortunately the
> display is not as appealing as that of examples available on GSDL official
> website. I am using documents in pdf and microsoft word format. In my
> opinion most of the examples shown on official site are using HTML docs
> that's why they are properly formatted. what do you say about that?
> you can check my site at the following URL and kindly inform me what can i
> do to improve the formatting of pdf and word docs so that they appear (in
> the converted HTML format) just like the original source docs.
> http://unintra.un.org.pk/gsdl/cgi-bin/library.exe
>
> Also please let me know why my MS-Word docs are showing the word HYPERLINK
> instead of showing the text as a hyperlink. (you will have to visit the
> above URL to understand what i really mean)
> thanking you in antincipation,

Because PDF and MS Word (.DOC) formats are complicated (and in the
case of .doc, closed and proprietary), we rely on 3rd party converters
(pdftohtml and wvWare) to convert them to html. PDF and DOC are designed
to control the visual layout, and some of the layout information is
lost when transforming to HTML. Sometimes it is not possible for these
tools to even extract the text from them. Some features (such as
MS Word hyperlinks) do not appear to be supported in the current
versions of these tools.

You can download the .DOC and .PDF files used in our demonstration
collection and use pdftohtml and wvWare to convert them yourself.

In summary, depending on the advanced features used to create the
pdf and .doc files, some might be converted to html with minimal
lose of information, while some might lose formatting information or
even the textual content.

Hope this helps,
John McPherson