Re: [greenstone-devel] re Sort order of search result list and browse list

From Ying-Hsang Liu
DateMon, 3 Jul 2006 19:42:40 -0400
Subject Re: [greenstone-devel] re Sort order of search result list and browse list
In-Reply-To (44A49EBB-1050003-cs-waikato-ac-nz)
Hi

Thanks for the details about the ranking of search results.

I found a discrepancy in ranking results between a cross-collection
search (30 sub-collections) and a search querying the library function
under cgi-bin. (Boolean search with ranked results was used).

Here are the top 10 results:

Cross-collection search:
Word count: temperature: 132750, coli: 200346, low: 535519, E:
646487, protein: 1129787
More than 50 documents matched the query.

[11873911]
[12950181]
[8631696]
[9406430]
[2838606]
[10949518]
[12403178]
[7750561]
[11101142]
[9766205]

Querying cgi-bin library:
Mon Jul 03 19:28:19 -0400 2006c=genomics&a=q&q=[E coli ]:TX AND
[protein ]:TX AND [low temperature ]:TX&m=1000&o=1000
11873911
12950181
9406430
8631696
7752247
12403178
12838606
7750561
10949518
9663545

My question is how I can have exactly the same ranking results
between these two queries?

Thanks,
Ying-Hsang Liu


On Jun 29, 2006, at 11:47 PM, Katherine Don wrote:

> Hi
>
> We do a very simplistic version of ranking for cross collection
> searching - we take the ranks from each subcollection at face value. I
> think that the same document in a different collection would receive a
> different rank - it depends on term frequencies within a collection
> as a
> whole.
> So how you split up the documents into collections could have an
> effect
> on the ranking.
>
> Regards,
> Katherine
>
> Ying-Hsang Liu wrote:
>> Hi
>>
>> Thanks for your helpful information regarding the details of ranking.
>>
>> Since the Greenstone has the function of cross-collection search and
>> I also use this function in my collection, I am wondering if there
>> will be
>> differences in the ranking results? More specifically, my question is
>> will there be a difference in the ranking results if there is only
>> one big
>> collection, or there are several sub-collections (given the same
>> data set)?
>>
>> Thanks!
>>
>>
>> On Jun 18, 2006, at 8:00 PM, Katherine Don wrote:
>>
>>> Hi
>>>
>>>
>>> MG does either boolean or ranked queries (but not both at once)
>>> while
>>>
>>> MGPP ranks boolean queries. So the "display results in ranked/
>>> natural
>>>
>>> order" just switches the ranking on/off.
>>>
>>> Only documents which match the boolean query will be included in the
>>>
>>> results.
>>>
>>>
>>> The ranking is done using a cosine measure (based on term frequency,
>>>
>>> document frequency, document weights...) - see the book mentioned
>>> below
>>>
>>> for more information about this.
>>>
>>>
>>> There is a website about MG, at http://www.cs.mu.oz.au/mg/ and
>>> there is
>>>
>>> a link there to more information about the software. I thought it
>>> may
>>>
>>> have info about the ranking, but it is down at the moment. I'm
>>> not sure
>>>
>>> if this is a permanent error, so you may like to check there.
>>>
>>>
>>> MGPP is a reimplementation of MG which is written in C++ instead
>>> of C,
>>>
>>> and uses word level indexing instead of document level. I think
>>> that the
>>>
>>> compression, indexing and ranking algorithms are pretty much the
>>> same as
>>>
>>> for MG.
>>>
>>>
>>> Regards,
>>>
>>> Katherine
>>>
>>