[greenstone-users] Instructions for improved Word, PPT, PDF Plugins

From Chi-Yu Huang
DateThu, 18 Aug 2005 10:06:40 +1200
Subject [greenstone-users] Instructions for improved Word, PPT, PDF Plugins
Hi everyone,

These are some simple instructions on how to play around with the improved Word, PPT, PDF plugins.

WordPlug: You can take advantage of VB script by switching on -windows_scripting to convert the Word document to HTML. It also allows user-defined header setting for up to three levels and extraction of metadata from the document. These features are only available on Windows.

Example:
WordPlug -windows_scripting -level1_header (level1Header1|level2Header2|...) -level2_header(level2Header1|level2Header2|...) -extracted_word_metadata_fields Author<Creator>,Subject,Keyword<Subject),...

The headers set in the regular expression are the possible user-defined heading styles from the documents you collected. The default is to split the document on <H1>,<H2> tags and so forth. Word  documents that use the built in "Heading 1", "Heading 2" styles automatically get mapped to <H1>, <H2> respectively. If the Word documents you are using are indeed like this then you do not need to activate these options to take advantage of the enhanced hierarchical section ability as this will happen automatically.

With the extracted_word_metadata_fields option, a comma separated list of metada fields needs to be specified. This works similarly to HTMLPlug metadata_fields. Use 'tag<tagname>' to have the contents of the first ‘tag’ put in a metadata element called 'tagname'. Capitalise this as you want the metadata capitalised in Greenstone, since the tag extraction is case insensitive. This is only available when windows_scripting is on. UNIT

PPTPlug: Under Windows, Powerpoint document can be converted to HTML, TEXT, GIF, PNG, JPEG. In Linux, the documents can only be converted to HTML and TEXT. To enable the conversion to image types on Windows, you need to switch on windows_scripting. Once the document has been converted, it can then be processed by PagedImgPlug and each slide in the PPT document will be displayed as a single image.

Example:
PPTPlug -windows_scripting -convert_to pagedimg_gif

UNIT UNIT UNIT PDFPlug: A PDF document can be converted to HTML, TEXT, GIF, JPEG and PNG formats. To convert the PDF document to different types of image, the convert utility of the cross-platform, open source ImageMagick software package is applied. Therefore, this conversion process requires ImageMagick to be installed. ImageMagick is bundled with the CD-ROM distribution of Greenstone, is typically already present on Linux systems, and can as a last resort be downloaded from the ImageMagick web site, www.imagemagick.org. Once the document has been converted, it can then be processed by PagedImgPlug and each page will be viewed as a single image.

Example:
PDFPlug -convert_to pagedimg_gif


Hope these instructions are useful.  Please let us know if any of these does not work the way they should be. It will help us to improve and stabilise the new features.



Regards,
Chi