Re: [greenstone-devel] MS Office Drawing Object, again

From John R. McPherson
DateWed, 23 Mar 2005 11:27:59 +1200
Subject Re: [greenstone-devel] MS Office Drawing Object, again
In-Reply-To (20050322070419-94009-qmail-web54204-mail-yahoo-com)
On Mon, 2005-03-21 at 23:04 -0800, Leho@nq wrote:
>
> Dear John McPherson,
>
> Thank you for answering at once! I agree with you that this isn't the
> problem of Greenstone. It's the tools like pdftohtml or wvware that
> has some limits. But I tested and GS worked well with Web archive!
> Even in the gsdl-workshop-materials workshop (1 day and 3 days - that
> I got from greenstone.org) they gave us some web archives (html_large
> and html_small) to test, and it's OK!
>
> I upload my test collection with test.doc, test.pdf and test web
> archives in www.yousendit.com
> (http://s23.yousendit.com/d.aspx?id=2TDPU8B1YG8RW0TJZPSE34TDU9) so
> that someone will be able to experience with it!

wvWare can't extract the drawing objects from the .DOC file at all.
pdftohtml doesn't extract the images, but it can extract them as part of
the page background if you use pdftohtml's "complex" layout option.
However, this uses absolute position of text inside the HTML file,
so it can sometimes cause problems (drawing some text over other text,
etc). You can test it to see if it works ok for your documents by having

PDFPlug -complex

in your collect.cfg file.

> If someone's compiled wvware on Windows that supports extracting
> embedded images like Drawing Object, please share me and everyone!
> Thank you so much!

I'm not aware of anyone doing so.

John