I am not the person who implemented mgpp, so I can only give you my guess as to
why it was done the way it is.
apart from returning a list of documents and doing NEAR and phrase searches, we
also want to do searches of metadata, and of different granularities, for
example, we want to be able to do all the following queries on the same index:
find documents containing 'snail'
find sections containing 'snail'
find documents (or sections) that have 'snail' in the Title and 'Smith' in the
find paragraphs containing snail NEAR farming
with a word position list, identifying what document/section/paragraph a
particular word is in can all be handled the same way-
we store start and end position for each document/section/paragraph and just
see which range it falls into.
the metadata stuff is also handled in this way - we store start and end tags
for each metadata element. so you can check whether a word position is inside a
metadata element, and which section it is inside.
I guess you could do all this with an adaptation of your doc num + offset
method, but our way was probably easier.
hope this helps
"Juhé, Albert" wrote:
> I'm Albert a Spanish programmer.
> I'm developing a C++ versión of mg, I have been seeing mgpp.
> I have a question about the implementation of the operator NEAR.
> The Greenstone implementation don't store word + document number + offset in
> the collections file,
> only store Word and Offset. This implementation is good for
> NEAR but is bad for exact queries, are very slow.
> Why don't use the implementation explained in Managing
> Gigabytes ?
> Word <freq,number Document1 [offset1,offset2,....],number Document2
> [offset1,offset2,....] , ... n >
> Thanks for all.