Re: [greenstone-devel] information about MGPP

From Katherine Don
DateFri, 17 Jun 2005 09:06:05 +1200
Subject Re: [greenstone-devel] information about MGPP
In-Reply-To (1118838654-42b01f7e5dab5-mail-studenti-unicam-it)
Hi

it uses inverted file indexing.
You have a list of words occurring in the collection. Each word has a
list of positions that it occurs in.
For a document level index, these positions are document numbers.
eg
snail 3, 6, 7
would mean that snail occcurred in documents 3, 6 and 7. Per document
counts might also be included.

For a word level index, the positions are word positions.
Then you have another index which tells you where the document start and
end points are.
For example, you might have the document endings like
54,377,858
This would mean there are three documents, the first one includes words
from 0-54, the second from 55-377 etc.
Then if the inverted file contained
snail 266,300, 700
you can tell that the second document has snail twice, and the third
document has snail once.
With a word level index you can do phrase queries and proximity queries
because you can find out the position of each search term and you can
tell if they are next to or near the other search terms.

MGPP also keeps a record of field positions, eg the start and end points
of metadata fields that are to be indexed. So you can do fielded
searching because you can check if a word occurrance falls inside the
range of a field.

I hope this gives you some idea of how it works.
The Managing Gigabytes book talks about all these concepts and I really
recommend it if you want to know how the index works.

Regards,
Katherine Don

BARBINI SAMUELE wrote:
> Thanks for these information.
>
> However I need further help because I haven't clear the concept of "word level
> index".
>
> Can you briefly explain me this index and how it works?
>
> Thank a lot.
>
> Best regards
>
>
>>Hi
>>
>>MGPP is a C++ version of MG, and uses the same indexing and compression
>>algorithms. The best source of information about MG is the Managing
>>Gigabytes book.
>>http://www.cs.mu.oz.au/mg/
>>The main difference between the two is that MG uses a document level
>>index, while MGPP uses a word level index, and can therefore handle
>>sections and metadata fields of documents.
>>
>>There is no architectural documentation specific to MGPP.
>>
>>Regards,
>>Katherine Don
>>
>>BARBINI SAMUELE wrote:
>>
>>>I need some detailed information about the architectural aspects of MGPP.
>>>In particular I am interested in indexing and compressing algorithms.
>>>
>>>Where I can find documentation on these topics?
>>>
>>>Thank you in advance for every suggestions.
>>>
>>>Hi
>>>
>>>
>>>
>>>
>>>_______________________________________________
>>>greenstone-devel mailing list
>>>greenstone-devel@list.scms.waikato.ac.nz
>>>https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-devel
>>>
>>
>>
>
>
>
>
>