Re: converting from .doc to HTML

From John R. McPherson
DateMon, 1 Apr 2002 23:07:48 +1200
Subject Re: converting from .doc to HTML
In-Reply-To (Pine-LNX-4-33-0204011244170-7102-100000-trinetra-ncb-ernet-in)
On Mon, Apr 01, 2002 at 12:45:45PM +0530, Jayalakshmi-DS wrote:
> Hi,
> We have a problem converting .doc files to html. The WordPlug doesn't
> seem to convert the docs even though the import process completes fine.
> When we click on the icon in the A-Z list what we see is some gibberish ,
> not in English.
> We have tried using the _input-encoding option with ascii and utf
> formats. It just doen't seem to extract the text properly. Thje .doc files
> are saved in Word Document format with lnaguage encoding Western European
> (Windows).
> What could be the problem? Thanks in advance.
> _Jayalakshmi

if wvWare (the 3rd-party program we are using) can't successfully
convert an MS Word file into HTML, then we basically run the unix
"strings" command over the file, which basically extracts all
printable characters, which is why you see the "gibberish".

Perhaps your files were created using Word 2000 or Word XP? The
converter can only handle MS Word versions 2 up to 97.
The people who write wvWare do a good job, but unfortunately it
is a moving target with proprietary formats like MS Word.

Another possibility is that the files have a .DOC extension but
aren't proper Word files. For example, you can give a plain text
file or Rich Text Format (RTF) file a .DOC extension and MS Word
handles this, although I think our import process will detect this
and work if it is in fact RTF, although I don't know how many
different formats this works for.

John McPherson.