Re: [greenstone-devel] Duplicate meta field in HTML

From Katherine Don
DateFri, 15 Aug 2003 09:15:20 +1200
Subject Re: [greenstone-devel] Duplicate meta field in HTML
In-Reply-To (5944ef4c-b35640d7-818ba00-express-cites-uiuc-edu)
hi

I assume that you are using the modified version to extract metadata from meta tags in the html header. HTMLPlug has two things that it can do, which it cant do together.
1. extract metadata from the header
2. use section tags that have been inserted in the documents.

I think that you probably have -description_tags specified in your collect.cfg file next to HTMLPlug. this means it will be doing 2 and not 1. TO get it to do 1, remove that option from the config file.

Having said that, if the plugin cant find any section tags at all, it _will_ try to extract metadata. by the warning you get ("appears to contain no Section tags so will be processed as a single section document") the plugin should have tried to extract metadata.

Another thing to remember is that by default it will only extract Title metadata, so if you have other ones you need to add -metadata_fields <comma separated list of metadata> to the HTMLPlug options.

hope this helps,
Katherine Don


xiaohu@uiuc.edu wrote:

> Dear Katherine, Greg and all:
>
> Sorry for bothering you again. I modified the HTMLPlugin.pm as Katherine and Greg said, but couldn't make it work. I examined the new HTMLPlugin.pm and diff with the original one. I am sure there is no grammar error in my new HTMLPlugin.pm. As I am not an expert in Perl, would you please give any hint on what the problem would probably be?
>
> Here is the Greenstone feedback to import.pl command:
>
> HTMLPlug: processing 19980417v104i23.html
> HTMLPlug: WARNING: 19980417v104i23.html appears to contain no Section tags so
> will be processed as a single section document
> Use of uninitialized value at /usr/local/gsdl/perllib/plugins/HTMLPlug.pm line 277.
> Use of uninitialized value at /usr/local/gsdl/perllib/plugins/HTMLPlug.pm line 283.
> Use of uninitialized value at /usr/local/gsdl/perllib/doc.pm line 830.
>
> ***********
> Actually line 277 and 283 are not in extract_meta method which is modified.
>
> And after the import process, doc.xml in /archives directory contains no metadata besides the following:
>
> <Description>
> <Metadata name="gsdlsourcefilename">/usr/local/gsdl/collect/argus8/import/19980410v104i22.html</Metadata>
> <Metadata name="gsdldoctype">indexed_doc</Metadata>
> <Metadata name="Language">en</Metadata>
> <Metadata name="Encoding">iso_8859_1</Metadata>
> <Metadata name="Source">19980410v104i22.html</Metadata>
> <Metadata name="URL">http://19980410v104i22.html</Metadata>
> <Metadata name="Identifier">HASH0140666be3ef98900434117a</Metadata>
> </Description>
>
>
> Thank you very much! Any idea will be greatly appreciated!!
>
> Xiao
>