[greenstone-users] Re: Lucene building support in greenstone

From Chaitra Rao
DateTue, 31 May 2005 12:12:43 +0530 (IST)
Subject [greenstone-users] Re: Lucene building support in greenstone
In-Reply-To (429B8287-6020301-cs-waikato-ac-nz)
Hi Katherine,
I think it's better to explain the scenario here so that you can
simulate it on your system and help me debug.

I am currently using 2 collections, say test and test1. I have 3
files in test(doc,html,text) and 3 files in test1(htm,pdf,text). I have
not assigned any metadata, but I'm using the extracted metadata and 2
AZList classifiers on ex.Title and ex.Source.(in both the collections). I
have checked the "Enable advanced searches" option in both the collections
and have also chosen the BuildType lucene option from the drop down list
(using the new GLI.jar file).

Now, when I build test, I get the following exception:
buildcol.pl> **** Indexing Doc 3
buildcol.pl> org.xml.sax.SAXParseException: Next character must be ";"
terminating reference to entity "A".
buildcol.pl> at
org.apache.crimson.parser.Parser2.fatal(Parser2.java:3182)
buildcol.pl> at
org.apache.crimson.parser.Parser2.fatal(Parser2.java:3176)
buildcol.pl> at
org.apache.crimson.parser.Parser2.nextChar(Parser2.java:3098)
buildcol.pl> at
org.apache.crimson.parser.Parser2.maybeReferenceInContent(Parser2.java:2421)
buildcol.pl> at
org.apache.crimson.parser.Parser2.content(Parser2.java:1833)
buildcol.pl> at
org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1507)
buildcol.pl> at
org.apache.crimson.parser.Parser2.content(Parser2.java:1779)
buildcol.pl> at
org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1507)
buildcol.pl> at
org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:500)
buildcol.pl> at
org.apache.crimson.parser.Parser2.parse(Parser2.java:305)
buildcol.pl> at
org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:442)
buildcol.pl> at javax.xml.parsers.SAXParser.parse(SAXParser.java:345)
buildcol.pl> at Indexer.index(Indexer.java:87)
buildcol.pl> at IndexXML.indexFile(IndexXML.java:43)
buildcol.pl> at GS2LuceneIndexer.main(GS2LuceneIndexer.java:51)
buildcol.pl> parse error:

Each time I rebuild the collection this exception appears for random doc
numbers (Its Doc1 in the first build, Doc2 or Doc3 in the 2nd build and so
on and sometimes without any doc numbers). This works fine when the
indexes are built using mgpp and no exceptions are thrown! This error is
thrown for only one doc but other docs are also not displayed on the
grenstone page.


> I'll have a look. But just to check, do both your collections have the
> same structure? most importantly they need to have the same index names.
> this means that unless you manually rename indexes, you can't cross
> collection search across different types (mg, mgpp, lucene) of collections.

I tried renaming the .cfs files in one of the collections to have the same
name as the other, but I am unable to search the collections.

> > 1. It seems to be working for word docs only. When I use text and pdf
> > files, the text is not displayed on the greenstone page.
>
> this is strange. Lucene works off the archive files, not off the import
> documents, so it shouldn't make any difference what type of documents
> they were.
> Have a look at the documents in index/text/hashid/doc.xml - this is
> where the text is taken from. Is there text in there?
> are there other differences between the word and pdf docs? eg do your
> word docs have sections while the others don't?
> are you using section level in your index or just document level?

The text is present in the doc.xml files. I am using document level in my
indexes

> > 2. No unicode search
> Michael is currently fixing this.

Oh! that's nice!

> > 3. The page display is very slow. It takes about 15-30 secs to open a page
> > on localhost
> >
> yes, we have noticed. Will look into it at some stage.

Okay.

> Just out of interest, why do you want to use lucene?

We have 2 search applications that run on lucene but have different
document bases. We want to integrate these applications and make them
searchable by a single search engine. That explains the stuggles with
lucene!!! :)

Hope the info furnished above helps! :)

Regards,
Chaitra