Word Metadata (was Re: my 8 questions)

From Illtud Daniel
DateThu, 03 Oct 2002 12:21:26 +0100
Subject Word Metadata (was Re: my 8 questions)
In-Reply-To (3D9B8551-9E7FD6D6-cs-waikato-ac-nz)
"John R. McPherson" wrote:
>
> > Phan Vo Minh Thang wrote:

> > 5. About assigning the value for metadata.
> > You can see that the Dublin Core information for each document in MSWord
> > format is available inside the properties information of themselves (MSWord
> > supports that) so how can we reuse that information. I looked at wvWare
> > output file. Only Title field is available. So what can be the solution !
>
> Since MS Word is a proprietry format, we use an external 3rd-party
> conversion tool. I don't think wvware can extract this metadata. If you
> figure it out, tell the wvware authors how you did it!

wvware can do this (I've no idea if the plugin can). There's a
utility in the wvware package called wvSummary. From the man page:

wvSummary(1) wvSummary(1)

NAME
wvSummary - view word document's summary info

SYNOPSIS
wvSumamry word_doc

DESCRIPTION
wvSummary displays the summary info included in all MSWord
documents. This often includes the Author, time created,
etc...

Here's a sample output:

[ild@bychan pc]$ wvSummary Test.doc
The title is Test Document
The subject is Just to test the document property extraction
The author is Illtud Daniel
The keywords are wvware greenstone word bill gates satan
The comments are This is the comment field, which can
contain lots of words.
The template was Normal.dot
The last author was Illtud Daniel
The rev # was 2
The app name was Microsoft Word 8.0
PageCount is 1
WordCount is 38
CharCount is 130
Security is 0
Codepage is 0x4e4 (1252)

It hasn't picked up the custom properties fields which I added
but I don't see that it would be much work, if wvSummary can do
the standard fields.

It should be possible to write a plugin that uses wvSummary to
extract the properties into metadata for Greenstone.

The above is all based on wvWare 0.6.5, and the current release
is 0.7.2, but the summary stuff doesn't seem to have changed. There
*is* some sort of xml format for specifying the output, and
the docs suggest that you could write an xml file which would
dump the info straight into an xml format of choice (gml?),
but I can't find documentation on how to produce the xml file -
there are many examples, though.

The custom properties can be read from the OLE stream - see
this translated Italian page from MS:
http://translate.google.com/translate?u=http%3A%2F%2Fsupport.microsoft.com%2Fdefault.aspx%3Fscid%3Dkb%253Bit%253BI14944&langpair=it%7Cen&hl=en&ie=UTF-8&oe=UTF-8&safe=off&prev=%2Flanguage_tools

The references section contains:

The information on Ole Structured Storage are in text MS Press: "Inside
OLE - Second Edition ", Brockschmidt, Understood It 7: Structured
Storage.
The information on the modalita` of access to the ' Summary Information
Property
Set' of a Ole Document Compound is on MSDM in: Platform SDKCOM and
ActiveX object servicesCOMStructured StorageUsing Property Set.

I might have a look at the book mentioned - the libole2 stuff in
wvWare is simple enough even for me to understand. It'll be on
the shelves here somewhere!

--
Illtud Daniel illtud.daniel@llgc.org.uk
Uwch Ddadansoddwr Systemau Senior Systems Analyst
Llyfrgell Genedlaethol Cymru National Library of Wales
Yn siarad drosof fy hun, nid LlGC - Speaking personally, not for NLW