RE: [greenstone-devel] Duplicate meta field in HTML

From Gregory S. Williamson
DateThu, 14 Aug 2003 12:51:40 -0700
Subject RE: [greenstone-devel] Duplicate meta field in HTML
I've attached a copy of the html plug module that I have used -- try renaming yours and inserting this one. If you are a *nix machine you can do a
diff progname1 progname2

and it will show differences.

Off havd it sounds like something got deleted from a procedure.

HTH,

Greg Williamson

-----Original Message-----
From:xiaohu@uiuc.edu [mailto:xiaohu@uiuc.edu]
Sent:Thu 8/14/2003 9:53 AM
To:Katherine Don; Gregory S. Williamson
Cc:Greenston Digital Library
Subject:Re: [greenstone-devel] Duplicate meta field in HTML

Dear Katherine, Greg and all:

Sorry for bothering you again. I modified the HTMLPlugin.pm as Katherine and Greg said, but couldn't make it work. I examined the new HTMLPlugin.pm and diff with the original one. I am sure there is no grammar error in my new HTMLPlugin.pm. As I am not an expert in Perl, would you please give any hint on what the problem would probably be?

Here is the Greenstone feedback to import.pl command:

HTMLPlug: processing 19980417v104i23.html
HTMLPlug: WARNING: 19980417v104i23.html appears to contain no Section tags so
will be processed as a single section document
Use of uninitialized value at /usr/local/gsdl/perllib/plugins/HTMLPlug.pm line 277.
Use of uninitialized value at /usr/local/gsdl/perllib/plugins/HTMLPlug.pm line 283.
Use of uninitialized value at /usr/local/gsdl/perllib/doc.pm line 830.

***********
Actually line 277 and 283 are not in extract_meta method which is modified.

And after the import process, doc.xml in /archives directory contains no metadata besides the following:

<Description>
<Metadata name="gsdlsourcefilename">/usr/local/gsdl/collect/argus8/import/19980410v104i22.html</Metadata>
<Metadata name="gsdldoctype">indexed_doc</Metadata>
<Metadata name="Language">en</Metadata>
<Metadata name="Encoding">iso_8859_1</Metadata>
<Metadata name="Source">19980410v104i22.html</Metadata>
<Metadata name="URL">http://19980410v104i22.html</Metadata>
<Metadata name="Identifier">HASH0140666be3ef98900434117a</Metadata>
</Description>


Thank you very much! Any idea will be greatly appreciated!!

Xiao

---- Original message ----
>Date: Wed, 30 Jul 2003 10:50:18 +1200
>From: Katherine Don <kjdon@cs.waikato.ac.nz>
>Subject: Re: [greenstone-devel] Duplicate meta field in HTML
>To: xiaohu@uiuc.edu
>Cc: Greenston Digital Library <greenstone-devel@list.scms.waikato.ac.nz>
>
>hi Xiao
>
>I have appended a revised version of the extract_metadata method that will extract both (or more) values. you need to replace the existing extract_metadata method with this new one, in perllib/plugins/HTMLPlug.pm
>
>hope this helps,
>Katherine Don.
>
>xiaohu@uiuc.edu wrote:
>
>> Hi, dear collegues,
>>
>> I am building a collection of HTML files, and my files have duplicate meta field like:
>>
>> <meta name="People" content="James Bond">
>> <meta name="People" content="Tom Hanks">
>> ....
>>
>> I did add -metadata fields option of HTMLPlugin in the configuration settings, but the Plugin only processed the first metadata of the duplicate ones. (for the above example, G-stone only plug in "James Bond" .) My G-stone version is 2.38
>>
>> Did you ever encounter this problem? How did you solve it?
>>
>> Thank you very much!
>>
>> Xiao
>> Xiao Hu
>> *******************************
>> Graduate student
>> Graduate School of Library and Information Science
>> University of Illinois at Urbana-Champaign
>> *******************************
>>
>> _______________________________________________
>> greenstone-devel mailing list
>> greenstone-devel@list.scms.waikato.ac.nz
>> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-devel
>
>sub extract_metadata {
> my $self = shift (@_);
> my ($textref, $metadata, $doc_obj, $section) = @_;
> my $outhandle = $self->{'outhandle'};
> # if we don't want metadata, we may as well not be here ...
> return if (!defined $self->{'metadata_fields'});
>
> # hunt for an author look in the metadata elements:
> if (defined $self->{'hunt_creator_metadata'}) {
> for my $name (split /,/, "AUTHOR,AUTHOR.EMAIL,CREATOR,DC.CREATOR,DC.CREATOR.CORPORATENAME") {
> if ($$textref =~ /<meta(s*?)(?:name|http-equiv)s*=s*"?$name"?([^>]*)/is) {
> my $content = $1 . $2;
> if ($content =~ /contents*=s*"?(.*)"?/is) {
> if (defined $1) {
> my $value = $1;
> $value =~ s/"$//;
> $value =~ s/s+/ /gs;
> $doc_obj->add_utf8_metadata($section, "Creator", $value);
> print $outhandle " extracted Creator metadata "$value" "
> if ($self->{'verbosity'} > 2);
> next;
> }
> }
> }
> }
> }
>
> foreach my $field (split /,/, $self->{'metadata_fields'}) {
> my $found = 0;
> # don't need to extract field if it was passed in from a previous
> # (recursive) plugin
> next if defined $metadata->{$field};
>
> # see if there's a <meta> tag for this field
> while ($$textref =~ /<meta(s*?)(?:name|http-equiv)s*=s*"?$field"?([^>]*)/isg) {
> my $content = $1 . $2;
> if ($content =~ /contents*=s*"?(.*)"?/is) {
> if (defined $1) {
> my $value = $1;
> $value =~ s/"$//;
> $value =~ s/s+/ /gs;
> $value =~ s/".*//gs;
> $doc_obj->add_utf8_metadata($section, $field, $value);
> print $outhandle " extracted "$field" metadata "$value" "
> if ($self->{'verbosity'} > 2);
> $found = 1;
> }
> }
> }
> next if $found;
> # TITLE: extract the document title
>
> if ($field =~ /^title$/i) {
>
> # see if there's a <title> tag
> if ($$textref =~ /<title[^>]*>([^<]*)</title[^>]*>/is) {
> if (defined $1) {
> my $title = $1;
> if ($title =~ /w/) {
> $title =~ s/<[^>]*>/ /g;
> $title =~ s/&nbsp;/ /g;
> $title =~ s/s+/ /gs;
> $title =~ s/^s+//;
> $title =~ s/s+$//;
> $doc_obj->add_utf8_metadata ($section, $field, $title);
> print $outhandle " extracted "$field" metadata "$title" "
> if ($self->{'verbosity'} > 2);
> next;
> }
> }
> }
>
> # if no title use first 100 characters
> my $tmptext = $$textref;
> $tmptext =~ s/</([^>]+)><1>//g; # (eg) </b><b> - no space
> $tmptext =~ s/<[^>]*>/ /g;
> $tmptext =~ s/(?:&nbsp;|xc2xa0)/ /g; # utf-8 for nbsp...
> $tmptext =~ s/^s+//s;
> $tmptext =~ s/s+$//;
> $tmptext =~ s/s+/ /gs;
> $tmptext =~ s/^$self->{'title_sub'}// if ($self->{'title_sub'});
> $tmptext =~ s/^s+//s; # in case title_sub introduced any...
> $tmptext = substr ($tmptext, 0, 100);
> $tmptext =~ s/sS*$/.../;
> $doc_obj->add_utf8_metadata ($section, $field, $tmptext);
> print $outhandle " extracted "$field" metadata "$tmptext" "
> if ($self->{'verbosity'} > 2);
> next;
> }
>
> # tag: extract the text between the first <H1> and </H1> tags
> if ($field =~ /^tag[a-z0-9]+$/i) {
>
> my $tag = $field;
> $tag =~ s/^tag//i;
> my $tmptext = $$textref;
> $tmptext =~ s/s+/ /gs;
> if ($tmptext =~ /<$tag[^>]*>/i) {
> foreach my $word ($tmptext =~ m/<$tag[^>]*>(.*?)</$tag[^>]*>/g) {
> $word =~ s/&nbsp;/ /g;
> $word =~ s/<[^>]*>/ /g;
> $word =~ s/^s+//;
> $word =~ s/s+$//;
> $word =~ s/s+/ /gs;
> if ($word ne "") {
> $doc_obj->add_utf8_metadata ($section, $tag, $word);
> print $outhandle " extracted "$tag" metadata "$word" "
> if ($self->{'verbosity'} > 2);
> }
> }
> }
> next;
> }
> }
>}
>
>


<<attachment>>
Type: application/x-unknown-content-type-pm_auto_file
Filename: HTMLPlug.pm

download