Re: [greenstone-users] How does the indexing of paragraph work?

From Katherine Don
DateMon, 07 Jul 2003 09:39:11 +1200
Subject Re: [greenstone-users] How does the indexing of paragraph work?
as far as I know, if you do a paragraph index, you still get returned sections,
but they will be selected based on the paragraphs inside them. If you do a ranked
(or some) search, the resulting documents are ranked, but if you do a boolean (or
and) search, the resulting docs are not ranked. for a ranked search, 'java C++'
should return those sections with both the words in a paragraph having a higher
rank than ones with only one word (uses cosine ranking ). for a boolean search,
'java AND C++' only sections containing a paragraph with both the words in should
be returned.

the building code uses html <p> tags to indicate where paragraphs start, so you
dont need to add special tags.

in collect.cfg, you need
indexes paragraph:text
collectionmeta .paragraph:text "paragraph"

(almost what you have written below. there should be no space in paragraph:text)

hope this helps.
katherine wrote:

> HI,
> I realize that for my collection of html files, there?s no significant
> difference between section and paragraph indexing search. The only noticeable
> difference is that the documents are returned in a different order. However,
> if I do a Boolean search such as ?Java and C++?, the paragraph searching
> option is not necessarily giving priority to those paragraphs containing both
> words Java and C++ in the same paragraph.
> Therefore, I am just wondering under what kind of situation will it be useful
> to have paragraph indexing and searching? And we know that in order for
> section searching to work, I need to have <section>?</section> type of tags
> inserted in the html files. So are there some similar tags that we need to
> insert into the html files, in order for paragraph indexing to work properly?
> My last question is, how do we specify that we want paragraph level indexing
> in collect.cfg?
> Is it like
> index paragraph: text
> ?.
> collectionmeta .paragraph:text "paragraph"
> Are these the 2 lines we ever need for paragraph indexing and searching in
> collect.cfg?
> Thanks for the help,
> Kim Hsieh
