Re: MGPP

From Katherine Don
DateFri, 24 Jan 2003 15:39:24 +1300
Subject Re: MGPP
In-Reply-To (0FBCE92BFFDBD4119BA300A0C9F2BEF2A4E726-SERVCORREO)
hi albert

I am not the person who implemented mgpp, so I can only give you my guess as to
why it was done the way it is.

apart from returning a list of documents and doing NEAR and phrase searches, we
also want to do searches of metadata, and of different granularities, for
example, we want to be able to do all the following queries on the same index:

find documents containing 'snail'
find sections containing 'snail'
find documents (or sections) that have 'snail' in the Title and 'Smith' in the
Author.
find paragraphs containing snail NEAR farming

with a word position list, identifying what document/section/paragraph a
particular word is in can all be handled the same way-
we store start and end position for each document/section/paragraph and just
see which range it falls into.

the metadata stuff is also handled in this way - we store start and end tags
for each metadata element. so you can check whether a word position is inside a
metadata element, and which section it is inside.

I guess you could do all this with an adaptation of your doc num + offset
method, but our way was probably easier.

hope this helps
katherine

"Juhé, Albert" wrote:

> Hi
>
> I'm Albert a Spanish programmer.
> I'm developing a C++ versión of mg, I have been seeing mgpp.
> I have a question about the implementation of the operator NEAR.
> The Greenstone implementation don't store word + document number + offset in
> the collections file,
> only store Word and Offset. This implementation is good for
> NEAR but is bad for exact queries, are very slow.
>
> Why don't use the implementation explained in Managing
> Gigabytes ?
>
> Word <freq,number Document1 [offset1,offset2,....],number Document2
> [offset1,offset2,....] , ... n >
>
> Thanks for all.