Kea-4.0 - Description
Kea-4.0 is a keyphrase extraction algorithm for controlled indexing of documents from the agricultural domain. Compared to Kea-3.0 it has a different candidate selection and term conflation strategy, new stemmers to choose and three new features.
Controlled indexing is realized with the domain-specific thesaurus Agrovoc that contains 16,600 descriptors and 10,600 non-descriptors. It defines three semantic relations between descriptors: links between related terms (RT) and links between broader terms (BT) and narrower ones (NT). All non-descriptors are connected by preferential links to descriptors, which avoids indexing of the same concept with different terms. The Agrovoc database is accessed through several text files that are stored in the directory AGROVOC.
Candidate selection is realized similar to Kea-3.0, but with reference to Agrovoc. To achieve the best possible matching and also to attain a high degree of conflation, each n-gram is transformed into a pseudo phrase in three steps:
This matches similar phrases such as "algorithm efficiency", "the algorithms' efficiency", "an efficient algorithm" and even "these algorithms are very efficient" to the same pseudo phrase "algorithm effici", where "algorithm" and "effici" are the stemmed versions for the corresponding full forms.
In the next step each pseudo phrase is matched against vocabulary terms, also represented as pseudo phrases. If they are the same, the n-gram is identified with the corresponding vocabulary term.
For semantic term conflation, non-descriptors are replaced by their equivalent descriptors using the links in the thesaurus (these are called USE-FOR links in the Agrovoc thesaurus employed in this work).
As an optional extension, the candidate set is enriched with all terms that are related to the candidate terms, even though they may not correspond to pseudo-phrases that appear in the document. For each candidate its one-path related terms (RTs, BTs and NTs), are included. To use this feature, set manually the private variable m_RELused to "true" in KEAFilter.java
Additional stemmers can be selected for both KEAModelBuilder and KEAKeyphraseExtractor with the "-t" option:
New features in Kea-4.0 are length of phrase in words, node degree and phrase appearance. The former two are used by default, while the later can be selected manually, by setting the private variable m_APfeature to "true" in KEAFilter.java. Using this feature only makes sense, when extended candidate selection with m_RELused = "true" is selected.