SEMI-STRUCTURED (FIELD) INDEXING

From David W J Fourie
DateFri, 10 Jan 2003 10:58:57 +0200
Subject SEMI-STRUCTURED (FIELD) INDEXING
We exported some 20.000 documents (records) from DB/TextWorks (Inmagic,
Inc - USA) in tagged format, i.e. with delimiters to indicate each
field,
paragraphs and wraparounds. (This is in ASCII and all in one big file).

For instance the Author field has a tag AU followed by a blank space and

then the names of the authors, the Title field has a tag TI also
followed by
a space and then the document's title, wraparound is indicated by a
blank
space on the next line followed by the rest of the text, a new paragraph
or
forced new line is indicated by a ";" (semicolon) followed by a blank
and
then the text, etc. Each document is ended by a "$" (Dollar) sign and
immediately the next document begins. All 20.00 documents are thus in
one
big file directly following one another. All of the fields are not
present
in all of the documents. Some documents may have descriptors, others
not,
etc. All of the (same) fields in all of the docs are not equal in
length.
All of the different fields are not equal in length as well.

We wish to import this into Greenstone, with the documents separated.
So
Greenstone should tell me I have 20.000 docs or records or titles. I
thought of doing this using the Organizer, but was told that the
Organizer would
not work for this.

(For this exercise, I use Greenstone on my PC, a HP KAYAK XA.)

1. Is this feasible at all?

2. How do I tell the Greenstone that each document ends with the "$"
sign
and the next line in the file is the number (or what the case may be) of
the
next document?

3. If this is feasible, is there a method to maybe later on export the

new collection in ASCII format (in the same way as with DB/TextWorks)?

I know the Collector, but have no experience with the Organizer.

I will appreciate any help or tips. I've read the FAQ, but I don't
find
my specific problem in there. I also have the User's and Developer's
Guides, but haven't found anything in there to help me, my I just don't
see
it...

Thanx
David Fourie


<<attachment>>
Type: text/x-vcard
Filename: david.vcf

begin:vcard
n:D W J FOURIE;David
tel;fax:+27 (0)12 420 4658, +27 (0)12 362 5168
tel;work:+27 (0)12 420 4080
x-mozilla-html:FALSE
url:http://www.up.ac.za/services/it/DavidWJFourie
org:Stelselontwikkeling, Departement Inligtingtegnologie;www.up.ac.za/services/it/DavidWJFourie
version:2.1
email;internet:david@up.ac.za
title:Stelsel/Projekbestuurder, Inligtingorganiseringstelsels
adr;quoted-printable:;;K2-48=0D=0AAdministrasiegebou (oosvleuel)=0D=0AHoofkampus=0D=0AUNIVERSITEIT van PRETORIA;Pretoria;Transvaal;0002;Suid-Afrika
note;quoted-printable:"Malamutes are more decent than most human beings" =0D=0ARobert Zoller, 1994 =0D=0A
fn:David W J Fourie [BBibl] [DDatamet]
end:vcard