A. Al-Ani and M. Deriche
Volume 17, 2002
Links to Full Text:Journal of Artificial Intelligence Research 17 {2002} 333-361 Submitted 2/2002;
published 11/2002
A New Technique for Combining Multiple Classifiers using
The Dempster-Shafer Theory of Evidence
Ahmed Al-Ani a.alani@qut.edu.au
Mohamed Deriche m.deriche@qut.edu.au
Signal Processing Research Centre
Queensland University of Technology
GPO Box 2434, Brisbane, Q 4001, Australia Abstract
This paper presents a new classifier combination technique based on the
Dempster-
Shafer theory of evidence. The Dempster-Shafer theory of evidence is a powerful
method
for combining measures of evidence from different classifiers. However,
since each of the available methods that estimates the evidence of classifiers
has its own limitations, we
propose here a new implementation which adapts to training data so that
the overall mean
square error is minimized. The proposed technique is shown to outperform
most available
classifier combination methods when tested on three different classification
problems.
1. Introduction
In the field of pattern recognition, the main ob jective is to achieve the
highest possible clas- sification accuracy. To attain this ob jective, researchers,
throughout the past few decades,
have developed numerous systems working with different features dependingupon
the ap-
plication of interest. These features are extracted from data and can be
of different types like continuous variables, binary values, etc. As such,
a classification algorithm used with a
specific set of features may not be appropriate with a different set of
features. In addition,
classification algorithms are different in their theories, and hence achieve
different degrees
ofsuccess for different applications. Even though, a specific feature set
used with a specific classifier might achieve better results than those obtained
using another feature set and/or classification scheme, we can not conclude
that this set and this classification scheme achieve
the best possible classification results {Kittler, Hatef, Duin,& Matas,
1998}. As different
classifiers may offer complementary information about the patterns to be
classified, combin- ing classifiers, in an efficient way, can achieve better
classification results than any single
classifier {even the best one}.
As explained by Xu et al. {1992}, the problem of combining multiple classifiers
consists
of two parts. The first part, closely dependent on specific applications,
includes the problems of \How many and what type of classifiers should be
used for a specific application?, and
for each classifier what type of features should we use?", as well as other
problems that are
related to the construction of those individual and complementary classifiers.
The second
part, which is common to various applications, includes the problems related
to the question \How to combine the results from different existing classifiers
so that a better result can be
obtained?". In our work, we will be concentrating on problems related to
the second issue. c
fl2002 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.Al-Ani
& Deriche The output information from various classification algorithms can
be categorized into
three levels: the abstract, the rank, and the measurement levels. In the
abstract level,
a classifier only outputs a unique label, as in the case of syntactic classifiers.
For the
rank level, a classifier ranks all labels or a subset of the labels in a
queue with the label at the top being the first choice. This type was discussed
by Ho et al. {1994}. For the
measurement level, a classifier attributes to each class a measurement value
that reflects the degree of confidence that a specific input belongs to a
given class. Among the three levels,
the measurement level contains the highest amount of information while the
abstract level
contains the lowest. For this reason, we adopted, in this work, the measurement
level. Kittler et al. {1998} differentiated between two classifiercombination
scenarios. In the
first scenario, all the classifiers use the same representation of the input
pattern. On the
other hand, each classifier uses its own representation of the input pattern
in the second
scenario. They illustrated that in the first case, each classifier can be
considered to pro- duce an estimate of the same a posteriori class probability.
However, in the second case
it is no longer possible to consider the computed a posteriori probabilities
to be estimates
of the same functional value, as the classification systems operate in different
measure-
ment systems. Kittler et al. {1998} focused on the second scenario, and
they conducted
a comparative study of the performance of several combination schemes namely;
product,
sum, min, max, and median. By assuming the joint probability distributions
to be con- ditionally independent, they found that the sum rule gave the
best results. A well known
approach that has been used in combining the results of different classifiers
is the weighted
sum, where the weights are determined through a Bayesian decision rule {Lam
& Suen, 1995}. An alternative method was presented by Hashem & Schmeiser
{1995}, where a cost
function was used to minimize the mean square error {MSE} in order to calculate
a linear
combination of the corresponding outputs from a number of trained artificial
neural net-
works {ANNs}. The expectation maximization algorithm was used by Chen &
Chi {1998}
to perform the linear combination. The fuzzy integral has been used by Cho
& Kim {1995a,
1995b} to combine multiple ANNs, while {Rogova, 1994; Mandler & Schurmann,
1988} have used the Dempster-Shafer theory of evidence to combine the result
of several ANNs. Many
other combination methods have also been used to combine classifiers, such
as bagging and
boosting {Dietterich, 1999}, which are powerful methods for diversifying
and combining classification results obtained using a single classification
algorithm and a specific feature set. In bagging, we get a family of classifiers
by training on different portions of the training
set. The method works as follows. We first create N training bags. A single
training bag
is obtained by taking a training set of size S and sampling this training
set S times with
replacement. Some training instances will occur multiple times in a bag,
while others may not appear at all. Next, each bag is used to train a classifier.
These classifiers are then
combined. Boosting, on the other hand, is based on multiple learning iterations.
At each iteration, instances that are incorrectly classified are given a
greater weight in the next it-
eration. By doing so, in each iteration, the classifier is forced to concentrate
on instances it
was unable to correctly classify in earlier iterations. In the end, all
of the trained classifiers are combined.
In this paper, we will focus on combining classification results obtained
using N different
feature sets, f
1
; 001 001 001 ; f
N
. Each feature set willbe used to train a classifier, and hence there willbe
N different classifiers, c
1
; 001 001 001 ; c N
. For a specific input x, each classifier c n
produces
334A New Technique for Combining Multiple Classifiers
Ahmed al-AniFeatureExtraction1Classifierc1ClassifierNcxExtractionNFeaturefCombinationzyyfN11NFigure
1: A multi-classifier recognition system
a real vector y
n
= [y
n
{1}; 001 001 001 y
n
{k}; 001 001 001 y
n
{K }] T
, where K is the number of class labels and y
n
{k} corresponds to the degree that c n
considers x has the label k. This degree could be a probability, as in the
Bayesian classifier, or any other scoring system. Fig. 1 shows the
blockdiagram of a multi-classifier recognition system. Unlike statistical-based
combination techniques, the Dempster-Shafer theory of evidence has the ability
to represent uncertainties and lack of knowledge. This is quite important
for
the problem of classifier combination, because there is usually a certain
level of uncertainty
associated with the performance of each of the classifiers. Since available
classifier combi-
nation methods based on this theory do not accurately estimate the evidence
of classifiers,
this paper attempts to solve this issue by proposing a new technique based
on the gradient
descent learning algorithm, which aims at minimizingthe MSE between the
combined out-
put and the target output of a given training set. Aha {1995} gave the following
definition
for learning:
Learningdenotes changes in the system that are adaptive in the sense that
they
enable the system to do the same task or tasks drawn from the same population
more effectively the next time. Basedon the above, we show that instead of
attempting to find an analytical formula which
accurately measures evidence, one can obtain a very good estimate of evidence
by just using
appropriate learning procedures, as will be discussed later. Some basic
concepts of the Dempster-Shafer theory of evidence are presented in the next
section. Section three discusses the existing methods for computing evidence.
The
proposed combination technique is presented in section four. Section five
compares the
proposed algorithm to other conventional methods used by Kittler et al.
{1998}, the fuzzy integral, and a previous implementation of the Dempster-Shafer
theory. Section six provides
a conclusion to the paper.
2. The Dempster-Shafer Theory of Evidence
The Dempster-Shafer {D-S} theory of evidence {Shafer, 1976} is a powerful
tool for rep-
resenting uncertain knowledge. This theory has inspired many researchers
to investigate
335Al-Ani & Deriche different aspects related to uncertainty and lack of
knowledge and their applications to real
life problem. Today , the D-S theory covers several different models, such
as the theory of
hints {Kohlas & Monney, 1995} and the transferable belief model {TBM} {Smets,
1998}. The latter will be adopted in this paper as it represents a powerful
tool for combining
measures of evidence.
Let002 = f022
1 ; :::::; 022
K
g be a finite set of possiblehypotheses. This set is referred to as the
frame of discernment, and its powerset denoted by 2 002
. Following are the basic concepts of
the theory: Basic belief assignment {BBA}. A basic belief assignment m is
a function that assigns
a value in [0; 1] to every subset A of 002 and satisfies the following:
m{;}= 0; and
X
A022002 m{A} = 1 {1} It is worth mentioning that m{;} could be positive
when considering unnormalized combi-
nation rule as will be explained later. While in probability theory a measure
of prob ability is
assigned to atomic hypotheses 022
i
, m{A} is the part of belief that supports A, but does not
supportanything more specific, i.e., strict subsets of A. For A 6= 022
i
, m{A} reflects some ig-
norancebecause it is a belief that we cannot subdivide into finer subsets.
m{A} is a measure
of support we are willingto assign to a composite hypothesis A at the expense
of support m{022
i
} of atomic hypotheses 022 i
. A subset A for which m{A} > 0 is called a foc al element.
The partial ignorant associated with A leads to the following inequality:
m{A} + m{A} 024 1,
whereA is the compliment of A. In other words, the D-S theory of evidence
allows us
to represent only our actual know ledge without being forced to overcommit
when we are ignorant.
Belief function. The belief function, bel{:}, associated with the BBA m{:}
is a function
that assigns a value in [0; 1] to every nonempty subset B of 002. It is
called \degr e e of belief
in B" and is defined by
bel{B} =
X
A022B
m{A} {2}
We can consider a basic belief assignment as a generalization of a probability
density func-
tion whereas a belief function is a generalization of a probability function.
Combination rule. Consider two BBAs m
1
{:} and m 2
{:} for belief functions bel
1
{:} and
bel
2 {:} respectively. Let A
j
and B
k
be focal elements of bel
1
and bel 2
respectively. Then
m
1
{:} and m
2
{:} can be combined to obtain the belief mass committed to C 032 002 according
to the following combination or orthogonal sum formula {Shafer, 1976},
m{C } = m
1
010 m
2
{C } =
X
j;k;A
j
\B
k
=C
m
1
{A
j }m
2
{B
k
}1 000
X j;k;A
j
\B
k
=; m
1
{A j
}m
2 {B
k
} ; C 6= ; {3}
336A New Technique for Combining Multiple Classifiers
The denominator is a normalizing factor, whichintuitively measures how much
m
1 {:} and
m
2
{:} are conflicting. Smets {1990} proposedthe unnormalized combination rule:
m
1 flm
2 {C } =
X A
j
\B
k
=C
m
1
{A
j
}m
2
{B k
}; 8C 022 002 {4}
This rule implies that m{;} could be positive, and in such case reflects
some kind of con- tradiction in the belief state. In this work we willconsider
that m{;} = 0 and use the normalized combination rule. A comparison between
normalized and unnormalized combi-
nation rules for the problem of combining classifiers will be considered
in the future.
Combining several belief functions. The combination rule can be easily extended
to
several belief functions by repeating the rule for new belief functions.
Thus the pairwise
orthogonal sum of n belief functions bel
1
; bel 2
; 001 001 001 ; bel
n
, can be formed as
{(bel 1
010 bel
2
} 010 bel
3
} 001 001 001 010 bel
n
= n
M
i=1 bel
i
{5} Notation. According to Smets {2000}, the full notation for bel andits
related functions is:
bel
002<
Y;t
[E C
Y;t
]{w 0
2 A} = x where Y represents the agent, tthe time, 002 the frame of discernment,
< a boolean algebra
of subsets of 002, w
0
the actual world, A a subset of 002, and E C
Y;t
all what agent Y knows
at t. Thus, the above expression denotes that the degree of belief held
by Y at t that w
0
belongs to the set A of worlds is equal to x. The belief is based on the
evidential corpus
E C
Y;t
held be Y at t. Inpractice, many indices can be omitted for simplicity sake.
Usually < is the power set
of 002, which is 2
002
. When bel is defined on 2
002 , < is not explicitly stated. 'w
0
2 A' is denoted
as 'A'. Y and/or t are omitted when the values of the missing elements are
clearly defined
from the context. Furthermore, E C is usually just a conditioning event.
So, bel{A} is one
of the most often used notations {Smets, 2000}. In the proposed method,
we will adopt the
following notation: bel
n
{022
k
}, where the agent isthe classifier, and the subsets of concern
are the class labels.
It is important to mention that the combination rule given by Eq. 3 assumes
that the
belief functions to be combined are independent. Consider that we have certain
information and would like to measure its belief, then we can think of this
process as a mapping from
the \original information level" to the \belief level". Liu & Bundy {1992}
explainedthat
independence in the original information level would lead to independence
in the belief level. But, if two independentbelief functions are rooted to
the original information level, then
their original information may or may not be independent. For the problem
of combining
multiple classifiers, the original information level consists of outputs
of the classifiers to be
combined, while the belief level consists of the evidence of these classifiers
{or their BBAs}.
The assumption that these BBAs are independent, whether obtained from independent
or
dependent original information, can hence justify the use of D-S theory.
In fact, many
337Al-Ani & Deriche existing classifier combination methods assume the classification
results of different classi-
fiers to be independent {Mandler & Schurmann, 1988; Hansen & Salamon, 1990;
Xu et al.,
1992}. Since the classifiers' evidence plays a crucial role in the combination
performance,
there is an increased interest in the proper estimation of such evidence.
In the next section,
we discuss how a number of existing classifier combination methods estimate
evidence of
classifiers, and in section 4 we present our proposed method. 3. Existing
Methods for Computing Evidence
Mandler & Schurmann {1988} proposed a method that transforms distance measures
of the
different classifiers into evidence. This was achieved by first calculating
a distance between
learning data sets and a number of reference points in order to estimate
statistical distri- butions of intra- and interclass distances. For both,
the a posteriori probability function
was estimated, indicating to which degree an input pattern belongs to a
certain reference
point. Then, for each class label, the class conditional probabilities were
combined into evidence value ranging between 0 and 1, whichwas considered
as the BBA of that class.
Finally, Dempster's combination rule was used to combine the BBAs of the
different classi-
fiers to give the final result. As explained by Rogova {1994}, this method
brought forward questions about the choice of reference vectors and the distance
measure. Moreover, ap-
proximations associated with estimation of parameters of statistical models
for intra- and
interclassdistances can lead to inaccurate measure of the evidence. Xu et
al. {1992} used K + 1 classes to perform the classification task, where for
the
{K + 1}
th
class denotes that the classifier has no idea about which class the input
comes
from. For each classifier c
n
, n= 1::N , recognition, substitution, and rejection rates {
*
n
r
,
*
n
s ,
and 1 000
* n
r
000
* n
s
} were used as a measure of BBA, m
n
, on 002 as follows:
1. If the maximum output of a specific classifier belongs to K + 1, then
m
n
has only a focal element 002 with m n
{002} = 1.
2. When the maximum output belongs to one of the K classes, m
n
has two focal elements 022
k
and022
k
with m
n
{022 k
} =
*
n
r
, m
n
{022
k
}=
*
n s
. As the classifier says nothing about any
other propositions, m
n
{002} = 1000 m
n
{022 k
} 000 m
n
{022
k
}.
The drawback of this method is again the way evidence is measured. There
are two problems
associated with this method. Firstly, many classifiers do not produce binary
outputs, but rather probability like outputs. So, inthe first case, it is
inaccurate to assign 0 to both
m
n
{022
k
} and m
n
{022
k
}. Secondly, this way of measuring evidence ignores the fact that classifiers
normally do not have the same performance with different classes. This had
a clear impact
on the performance of this combination method when compared with other conventional
methods especially the Bayesian {Xu et al., 1992}.
Rogova {1994} used several proximity measures between a reference vector
and a clas- sifier's output vector. The proximity measure that gives the
highest classification accuracy
waslater transformed into evidences. The reference vector used was the mean
vector, 026
n k
, of
the output set of each classifierc
n and each class label k. A number of proximity measures, d
n
k
, for 026
n
k
and y
n
were considered. For each classifier, the proximity measure of each class
338A New Technique for Combining Multiple Classifiers
is transformed into the following BBAs:
m
k
{022
k } = d
n
k ; m
k
{002} = 1000 d
n
k mk
{022
k
} = 1 000
Y
l6=k
{1 000 d n
l
}; mk
{002} = Y
l6=k {1000 d
n
l
}
The evidence of classifier c
n
and class k isobtained by combining the knowledge about
022
k
, thusm
k
010 mk
. Finally, Dempster's combination rule was used to combine evidences
for all classifiers to obtain a measure of confidence for each class label.
Note that the first
combination was performed with respect to the class label {Rogova usedthe
notations k andk}, while in the second one the agent was n. This idea was
a promising one. However, the
ma jor drawback is the way the reference vectors are calculated, where the
mean of output vectors may not be the best choice. Also, trying several proximity
measures and choosing
the one that gives the highest classification accuracy is itself questionable.
4. The Proposed Combination Technique
In this section we willestimate the value of m n
{022
k }, whichrepresents the belief in class label
k that is produced by classifier c
n
. In addition, we willalso estimate m
n
{002}, which reflects
the ignorance associated with classifier c
n
. Since the ultimate ob jective is to minimize the MSE between the combined
classification results and the target output, m
n
{022
k } and m
n
{002}
will be estimated using an iterative procedure that aims at attaining this
ob jective. We will
first compare y
n
, which is the output classification vector produced by classifier c
n
, to a
reference vector, w
n
k , and the obtained distance will be used to estimate the BBAs. These
BBAs will then be combined to obtain a new output vector, z, that represents
the combined
confidence in each class label. w
n
k
will be measured such that the MSE between z and the
target vector, t, of a training dataset is minimized. Note that there are
two indices for w
n
k
. Thus, for class label k, we don't only consider the value assigned to
it by classifier c
n
, but rather the whole output vector {values assigned to each class label}.
Let the frame of discernment 002 = f022
1
; 001 001 001 022
k ; 001 001 001 ; 022 K
g, where 022 k
is the hypothesis that the input x is of class k. Considering a BBA, m
n
, such that m n
{022
k } 025 0, m
n
{002} =
1 000 P
K
k=1 m
n
{022 k
}, and m
n
is 0 elsewhere. Let d
n
{022 k
} be a distance measure and g
n
the
unnormalized ignorance of classifier c
n , then m
n
{022
k
} and m
n
{002} willbe estimated according
to the following formulas:
d
n
{022
k
}= exp{000kw
n
k
000 y
n
k 2
} {6}
m
n
{022 k
} =
d n
{022
k }K X
k=1
d n
{022
k } + g
n
{7}
m
n
{002} =
g
nK
X
k=1
d
n {022
k
} + g
n
{8} where m
n
{022
k
} and m n
{002} are the normalized values of d
n
{022
k
} and g n
respectively. Similar
to w
n k
, the minimized MSE will be used to estimate g
n
. 339Al-Ani & Deriche Evidences of all classifiers are combined according
to the normalized combination rule toobtain a measure of confidence of each
class label. The k
th element of the new combined vector is given by:
z{k} = m{022 k
} = m
1 {022
k
}010 001 001 001 010 m N
{022
k } =
M
n2N
m
n
{022
k
} {9}
For a given classifier c
n
, let I = f1 001 001 001 N g n fng, m
I
=
L
i2I
m
i
, then Eq. 9 can be written as:
z{k}= m
I
{022 k
} 010 m n
{022
k } {10}
where according to Eq. 3, the combination of two BBAs is: m
j
{022 k
} 010 m l
{022
k } =
m
j {022
k
}m
l
{022 k
} + m
j
{022
k }m
l
{002} + m
j
{002}m l
{022
k }1 000 X
p
X
q
q6=p
m
j
{022 p
}m
l {022
q
} {11}
w
n
k
and g
n
will be initialized randomly, then their values will be adjusted according
to a
training dataset so that the MSE of z is minimized.
E rr = kz000 tk
2
{12}
The values of w
n
k and g
n
are adjusted according to the formulas:
w n
k
[new] = w
n
k [old] 000 ff @ E rr@ w
n
k
[old]
{13}
g
n
[new] = g
n
[old] 000 fi
@ E rr@ g
n [old]
{14} where ff and fi are the learning rates. The terms @ E rr=@ w
n
k and @ E rr=@ g
n
are derived as
follows:
@ E rr@ w
n
k
=
@ E rr@ z{k} @ z{k}@ m
n
{022
k
} @ m
n
{022
k
}@ w
n
k
{15}
@ E rrg
n =
@ E rr@ z{k} @ z{k}@ m
n
{022
k
} @ m
n
{022
k
}@ g
n
{16} 340A New Technique for Combining Multiple Classifiers
where, @ E rr@ z{k}
= 2[z{k} 000 t{k}] {17}
@ z{k}@ m
n {022
k
} =
{"
1 000 X
p
X
q
q6=p
m
n
{022 p
}m
I {022
q
} #
[m
I {022
k
} + m
I
{002}] + [m
n
{022 k
}m
I {022
k
} + m
n
{022 k
}m
I {002} + m
n
{002}m
I
{022
k
}]
"
X
p
p6=k
m
I {022
p
} #},
"
1 000 X
p
X
q
q6=p
m
n
{022 p
}m
I {022
q
} #
2
{18} @ m
n
{022
k
}@ w
n
k = 000
2 exp{000kw
n
k
000 y
n
k 2
}[w
n k
000 y
n ][
X
p
p6=k
d
n
{022
p } + g
n
][
X p
d
n
{022
p
} + g
n
]
2 {19}
@ m
n {022
k
}@ g
n = 000
d
n
{022
k }[
X
p
d
n
{022
p } + g
n
]
2
{20}
Fig. 2 shows a flow chart of these learning procedures. It has been found
that adjusting
the values of g n
can be achieved during the first few iterations. By continuing the training
to fine-tune the values of w
n
k
until there is no further improvement on the training set, or
wereach a pre-defined maximum number of epochs
1
, the result could be further enhanced.
Note that the weight values are adjusted by each pattern {not batch training}.
We fix the value of fi = 10 0006
, while ff is first initialized to 5 002 10
0004
, and is then changed according to
the value of MSE, as described in the flow chart.
Althoughthe computational cost involved in implementing our technique is
higher than
that of other combination methods
2
, we only need to perform training once, which can be done off-line. Then,
with the optimal values of w
n
k
and g
n
, we can perform the on-line combination, which is comparable to other combination
methods.
Onthe other hand, as indicated in the beginning of this section, we consider
a reference
vector, w n
k
, for each class. This leads to an increase in training time as the number
of classes
and/or classifiers increases. An alternative is to consider only using a
reference value for
each class, w
n
k
. This will save more than 50045 of training time for the case of several
classifiers and classes. Note that the same learning formulas are applicableby
replacing w
n k
with w
n k
and y
n with y
n
k . We willrefer to these two alternative approaches as DS1 and DS2, respectively.
In the following section, we will compare DS1 and DS2 with other well-known
combination methods.1. The maximum number of epochs is set to 50 in all
experiments described in this paper 2. Training time of most of the experiments
conducted in section 5 required less than 3 minutes on a
conventional PC
341Al-Ani & DericheAhmed al-AniStarta = a*0.7a = a*1.03EndFor each patternErr
= ErroldnewIt_no=It_no+1k{q }and{Q}mmnnComputenkwgnandAdjust-6b=10 a=5*10
,-4Initialize learningratesErr -Err >ThnewoldIFnkwgnandn k=1:N,=1:KzandErrnewComputeIt_no 10and-4IFErrnewErroldIt_noIt_nomax= max. no. of iterationsTh= current
no. of iterations= previous MSE= error threshold= current MSENoYesYesNoRandomly
initializeFigure 2: Training procedure of the proposed technique It is worth
mentioning that although the training procedures of both the proposed method
and the backpropagation algorithm of ANN are based on minimizingthe MSE using
iterative approaches, the proposed method and ANN are not similar. The backpropagation
training operates by passing the weighted sum of its input through an activation
function,
usually in a multi-layer architecture known as multi-layer perceptron {MLP}.
Extracting
rules from a trained MLP is a very challenging problem. On the other hand,
the training of the proposed method operates by measuring a distance between
a classification vector and
a reference vector. This distance would later be used to measure the belief
of each class
label for all classifiers. The final confidence of each class label is obtained
by combining thebeliefs of all classifiers. Unlike MLP, the belief of a given
class label for each classifier indicates its contribution towards the final
confidence. The reader may refer to {Denoeux,
2000} for a description of an ANN classifier based on the D-S theory.
342A New Technique for Combining Multiple Classifiers
5. Performance Analysis of Different Combination Methods The following three
classification problems have been considered: texture classification,
classification of speech segments according to their manner of articulation,
and speaker
identification. ANNs are used to perform classification for the three problems.
For each case, classifiers will be sorted according to their performance,
such that the best classifier
isreferred to as c 1
, the 2
nd best as c
2
, and the worst one as c N
.
For each problem, we willconsider different number of classes, and combine
the results of different number of classifiers, where combining results of
the best, the worst and mixtures
of best and worst classifiers will be investigated. For example, ifwe have
five classifiers
and would like to combine two of these, then we willconsider combining the
best two,
fc
1
; c
2
g, best one and worst one, fc
1
; c
5
g, and worst two classifiers, fc
4 ; c
5
g. The following
combination methods were tested: the weighted sum {WS} 3
, average {Av}, median {Md},
maximum {Mx}, ma jority voting {MV}, fuzzy integral {FI} {Cho & Kim, 1995a}
4
, Rogova's
D-S method {DS0} {Rogova, 1994}, and our proposed method with its two alternatives
{DS1
& DS2}. The training set used to train the ANNs will be used to estimate
the confusion matrix for WS and FI, as well as to estimate the evidence of
DS0, DS1, and DS2. Two measures will be used to compare the performance of
the different combination methods, namely: overal l performanc e and error
re duction rate {ERR}. The overall perfor-
mance is the mean of classification accuracy obtained by combining all considered
subsets
of 2; 001 001 001 ; N classifiers. E RR is the percentage of error reduction
obtained by combining classifiers with reference to the best single classifier:
E RR =
E R
BSC 000 E R
C CE R
BSC 002 100 {21}
where E R
BSC
is the error rate of the best single classifier and E R
C C
is the error rate ob-
tained by combining the considered classifiers. Unlike classification accuracy,
E RR clearly
shows how the performance of the combined classifiers improves or deteriorates
compared
to the best single classifier. In other words, it shows the merit of performing
the combi-
nation. We willspecifically concentrate on the maximum ERR obtained by combining
all
the considered subsets of 2; 001 001 001 ; N classifiers. In addition,
we will also investigate how the
value of E RR gets affected by increasing the number of combined classifiers.
5.1 Texture Classification Several experiments have been carried out for
the classification of texture images. The
textures considered here are: bark, brick, bubbles, leather, raffia, water,
weave, wood and
wool {USC, 1981}. In order to obtain a better comparison between the different
combination
methods, we considered classifying the first two textures, then the first
three, the first five
and finally all the nine textures. Additive Gaussian noise, with different
signal-to-noise ratio,
has been added to {1024 002 1024} pixels image of each texture class to
form the training and
testing sets. 961 patterns were obtained from each image using {64 002
64} windows with an overlap of 32 pixels.3. The weights of each classifier
are determined according to the classification accuracy of each class label
using the training dataset
4. The reader may refer to Appendix A for a brief description of this method
343Al-Ani & DericheNo. of classesS DH 1S DH 2S DH 3S DH 4E n286.9685.7384.4485.4591.14384.5884.5283.9186.2489.72585.1084.6284.3483.4688.84980.9777.4477.5175.7283.65Table
1: Texture classification accuracy of the five original classifiers for different
number
of class labels Four nine-feature vectors were calculated using statistics
of sum and difference histogram
{S DH } of the co-occurrence matrix with different directions, vertical
{S DH 1
}, horizontal {S DH
2
}, and the two diagonals {S DH
3
and S DH
4
} . For each direction, the features used
were: mean, variance, energy, correlation, entropy, contrast, homogeneity,
cluster shade,
and cluster prominence. The fractal dimension {FD} has also been used to
form the tenth
feature of each vector. The energy contents of texture images {E n} has
been used to form another feature vector using 9 different masks. Again the
tenth feature was FD. Each of these five feature vectors has been used as
input to an ANN. The numbers of
training and testing patterns depend upon number of classes considered,
i.e. for the case
oftwo classes, 15376 patterns were used to train the networks and 5766 to
test them. The
resultsobtained are shown in Table 1. Note that as the number of classes
increases the overall accuracy decreases. In addition, the performance of
the E n classifiers is found to be better than that of the other four.No.
of classesWSAvMdMxMVFIDS0DS1DS2289.1689.0487.6690.1288.0990.0888.7090.6690.72388.5288.3987.4188.8687.3088.7188.4090.2190.08589.6089.4187.9989.2387.8390.2889.5292.6991.50984.9684.5583.3782.9083.2386.7684.8789.8386.79Table
2: Overall performance of the various combination methods for different number
of
class labels {texture classification}
The overall performance of the tested combination methods for different
number of class
labelsare shown in Table 2. For the case of 2 classes, it is clear that
the overall performances
of DS1 and DS2 are better than that of the other combination methods. When
mixtures of
good and bad classifiers are considered, the performance of combination
methods, except
for DS1 and DS2, is closer to or worse than that of the best single classifier.
This is shown in Table 3 for the combination of fc
1
; c 3
; c
4 ; c
5
g, fc
1
; c 4
; c
5 g, fc
1 ; c
5
g, etc
5
. When 3 and 5
classes are considered, DS1 performs slightly better than DS2, andboth outperform
the
other methods. The gap between DS1 and other methods gets wider when all
9 classes are
considered. The superiority of DS1 reflects the advantage of using the whole
output vector
in measuring evidences of classifiers.5. The reader may refer to Appendix
B for detailed results of other cases
344A New Technique for Combining Multiple ClassifiersClassifiersWSAvMdMxMVFIDS0DS1DS2c
1
, c
292.5692.5992.5992.5192.5192.6192.4092.4692.46c
1
, c
591.1691.1291.1291.0991.0991.0091.3391.6291.61c
4 , c
585.0785.0785.0785.2285.2285.0785.1585.1085.12c
1
, c
2 , c
391.2191.2188.9291.6288.9291.6990.8192.4092.53c
1
, c
2 , c
591.0390.8188.6891.4888.7191.4890.4392.4792.39c
1
, c 4
, c
589.8089.5986.2191.2186.2191.2488.8891.6891.78c
3
, c 4
, c
585.3885.3885.4785.4085.4885.3385.2285.4385.40c
1 , c
2
, c
3
, c
489.9489.7087.8491.4789.1391.5989.1392.2192.33c
1
, c
2
, c
3
, c 589.7089.4287.5391.4289.0491.2988.9292.3292.42c
1
, c
2
, c
4
, c 589.7289.4987.3791.4888.9491.4089.0092.2592.26c
1
, c
3
, c
4
, c 588.5788.4586.3091.0987.0390.9887.8191.7891.87c
2
, c
3
, c
4
, c 586.0786.1185.8786.2586.2686.1385.9386.6686.80c
1
, c
2
, c
3
, c 4
, c
588.8188.5486.6391.3386.5991.2188.1092.2192.33Table 3: Classification accuracy
of texture images using different combination methods {2
textures} The best E RR values of WS, FI, DS0, DS1 and DS2 are determined
according to Eq.
21. Since WS has been widely used in the literature, and it outperforms
other conventional
methods {Av, Md, Mx, and MV}, as observed in Table 2, then we will use it
as a represen-
tative of the conventional methods when performing the comparison with FI,
DS0, DS1 and
DS2. Figure 3a shows the E RR values when 2 classes are considered. It is
clear that the maximum E RR values of these five combination methods are
very close, ranging between
14045 to 16045. They are obtained by combining the best two classifiers
for WS, FI and DS0, while DS1 and DS2 use three classifiers to obtain their
maximum E RR. As mentioned
earlier, The performance of the first four individualclassifiers is weaker
than that of the
E n. Notice that, for both DS1 and DS2, there is no significant degradation
in E RR as the
number of combined classifiers increases.
For the case of 3 classes, bothDS1 and DS2 outperform other combination
methods
in terms of the maximum E RR. They achieve values of 17:3045 and 19:6045
respectively,
compared to 11:4045 or less for other methods as shown in Figure 3b. In
addition, E RR of
DS1 and DS2 are not affected as the number of combined classifiers increases.
For the case of 5 classes, the maximum E RR values sorted in a descending
order are:
DS1 50:7045, DS2 40:2045, FI 31:6045, WS 28:1045, and DS0 23:8045,
as shown in Figure 3c. In
addition, E RR values of DS1 improve as the number of combined classifiers
increases, DS2 is the second best, while E RR values of other methods degrade
as the number of combined
classifiers increases. For the case of 9 classes, the superiority of DS1
becomes clearer, where
as shown in Figure 3d, the maximum E RR value of DS1 is 54045 compared
to 37:5045 or less for other methods. It is worth mentioning that even though
the maximum E RR values of
other methods degrade, they still perform better than the best single classifier.
This leads
us to conclude that as the number of classes increases, the performance
of most classifier
combination methods gets better overall.
345Al-Ani & Deriche-30-20-1001020WSFIDS0DS1DS2-20-10010203022.533.544.5501020304050ERRNo.
of combined classifiers22.533.544.551020304050{a} 2 classes{b} 3 classes{c}
5 classes{d} 9 classesFigure 3: E RR of different classifier combination
methods obtained by considering different number of classifiers for the cases:
{a} 2 classes, {b} 3 classes, {c} 5 classes, and
{d} 9 classes
Taking all these facts into consideration, we can sort the methods in a
descending order asfollows: DS1, DS2, FI, WS, DS0, andthe other conventional
methods. Thus, in summary,
for the problem of texture classification, our proposed technique with its
two alternatives {DS1and DS2} clearly outperforms other standard combination
methods with an increase in classification accuracy of about 2 000 7045.
For the cases of 2 and 3 classes, there is a little
difference in performance between DS1 and DS2. This is because using reference
vectors of
small size, 2 002 1 and 3 002 1, does not make a big impact upon the estimation
of evidence
compared to that obtained using a single reference value. As the size of
the reference vector
increases, 5 002 1 and 9 002 1 for the other two cases, its impact on
estimating the evidence
becomesclearer, which leads to better results, but at the cost of increasing
computational
load.
5.2 Speech Segment Classification Six different input feature sets have
been used to classify speech segments according to their
manner of articulation, these were: 13 mel-frequency cepstral coefficients
{MFC}, 16 log
mel-filter bank {MFB}, 12 linear predictive cepstral coefficients {LPC},
12 linear predictive 346A New Technique for Combining Multiple ClassifiersNo.
of classesMFCMFBLPCLPRWVTARP388.2190.9881.6480.6990.6470.87683.1685.5074.7774.0684.3362.90978.4883.2471.6470.0381.3356.66Table
4: Speech segment classification accuracy of the six original classifiers
for different
number of class labels
reflectioncoefficients {LPR}, 10 wavelet energy bands {WVT}, and 12 autoregressive
model
parameters {ARP}. For this experiment, speech was obtained from the TIMIT
database {MIT, SRI, & TI, 1990}. Segments of 152 speakers {56456 segments}
were used to train
the ANNs, and 52 speakers {19228 segments} to test them. Three cases were
considered:
3 classes {vowel, consonant, and silence}, 6 classes {vowel, nasal, fricative,
stop, glide, and silence}, andfinally 9 classes {vowel, semi-vowel, nasal,
fricative, stop, closure, lateral,
rhotic, and silence}. The classification results for these three cases are
summarized in Table
4.No. of classesWSAvMdMxMVFIDS0DS1DS2390.8090.4190.2086.1589.5190.6590.9091.5791.31685.5484.9184.6281.1684.0385.2985.1887.1886.37983.0582.3181.9375.6381.0082.7382.8685.2084.22Table
5: Overall performance of the various combination methods for different number
of class labels {speech segment classification}
The two best individualclassifiers are MFB and WVT in all three cases, followedby
MFC then other methods. Unlike texture classifiers that had one good classifier
and four,
relatively, weak classifiers, we have here three good classifiers {MFB,
MFC and WVT} and
three weak classifiers {LPC, LPR and ARP}.
The overall performance values of the various combination methods are displayed
in
Table 5. For the case of the 3 classes, it can be seen that the overall
performance of DS1 is
better than that of DS2 and they both outperform the other methods. This
becomes even
clearer as the number of classes increases {with more than 2045 increase
in accuracy}.
The E RR values for the case of 3 classes are shown in Figure 4a. The maximum
E RR
value of DS1 is 23:4045, whichis achieved by combining all six classifiers,
compared to 20:3045
for DS2 and 19:6045 or less for the other methods. The gap between DS1
and the other methods gets wider when we consider 6 and 9 classes as shown
in Figures 4b and 4c. Because there are more good classifiers in this experiment
compared to that of the texture
experiment, the variations of the E RR values when the number of classifiers
increases are found to be smaller. In addition, we can see that as the number
classes increases DS1 keeps
its steady and superior performance in terms of E RR with more than 10045
increase.
As a summary, DS1 outperforms other methods in terms of overall performance
and
E RR measurements. It is followed by DS2, WS, and the rest of the methods.
347Al-Ani & Deriche234565101520252345610152025WSFIDS0DS1DS223456510152025No.
of combined classifiersERR{a} 3 classes{b} 6 classes{c} 9 classesFigure 4:
E RR of different classifier combination methods obtained by considering
different number of classifiers for the cases: {a} 3 classes, {b} 6 classes,
and {c} 9 classes
5.3 Speaker Identification
Three limited-scope experiments were carried out to perform speaker identification
using 2,
3, and 4 speakers. Speech data from the TIMIT database was also used {MIT
et al., 1990}.
The number of training patterns were 3232, 4481 and 5931 respectively, and
the number of testing patterns were 1358, 1921 and 2542 respectively. The
same features used to classify
speech segments according to their manner of articulation were used to identify
speakers.
Classification results of the six classifiers are shown in Table 6. The
performance of the individual classifiers are not quite similar to the speech
segment problem, where the three
good classifiers are: MFB, MFC and LPC and the three weak classifiers are:
LPR, WVT
andARP. The overall performance of the various combination methods are shown
in Table 7. For
the case of 2 classes, it is clear that the overall performance of most
combination methods is very comparable. The superiority of DS1, and to a
lesser degree DS2, becomes clear as
the number of classes increases {more patterns were includedto estimate
evidence}.
Note that, becauseof the high performance of individual classifiers for
the case of 2
classes, a small difference in the performance of combination methods will
have great impact
on E RR, which explains the graphs' fluctuations, as shown in Figure 5a.
It can be seen 348A New Technique for Combining Multiple ClassifiersNo. of
classesMFCMFBLPCLPRWVTARP294.5896.1792.4989.6087.8084.55385.8487.2582.2081.0074.3973.03485.0185.9680.8477.9770.9364.59Table
6: Speaker identification accuracy of the six original classifiers with different
number of speakersNo. of classesWSAvMdMxMVFIDS0DS1DS2295.5395.5095.2695.3695.2195.2595.4695.4895.45390.8090.4190.2086.1589.5190.6590.9091.5791.31483.0582.3181.9375.6381.0082.7382.8685.2084.22Table
7: Overall performance of the various combination methods for different number
of
class labels {speaker identification}
that both maximum E RR and overall performance of most combination methods
are close. These results do not favor DS1 nor DS2, because they have an additional
computational cost. Let's now consider the case of 3 classes, Figure 5b shows
that the maximum E RR
of DS2 is the highest followed by DS1, and they both outperform the other
methods. For
the case of 4 classes, the maximum E RR of DS1 is 30045, compared to 27045
or less for other
methods, as shown in Figure 5c. The figure also shows that E RR values of
DS2 and WS are close. However, as the overall performance of DS2 is better
than that of WS, DS2 can
be considered as the second best method followed by WS, DS0 and finally
FI.
The above results clearly show how the performance of DS1 and DS2 get affected
by the
number of training patterns, which is crucial in achieving good estimation
of the evidence of
each classifier. This is very clear for the case of 2 speakers. Their performance,
however, get
better as the number of speakers and training patterns increase. In other
words, DS1 and
DS2 require a larger number of patterns to work properly. Failing to provide
such number
of patterns, other conventional methods, such as WS, can achieve similar
performance.
The experiments of textures, speech segments and speaker classification
show that our
proposed technique clearly outperforms the other methods in terms of overall
performance
and E RR, providing that a sufficient number of patterns to estimate evidence
of classifiers exists. Also, among the different combination methods, DS1
and DS2 are the least effected
by the inclusion of weak classifiers. The experiments also show that the
BBAs could be
better estimated using reference vectors rather than reference values, especially
for large number of classes.
It is worth mentioning that each one of the combination methods has its
own merit. For example, the MV is very useful combination method when dealing
with classifiers that
produce results of the abstract level. When working in the measurement level,
other com-
bination methods could have better performance.
TheMx method can provide good results when the performance of the combined
classi-
fiers are close. In such case, the classifier with higher confidenc e can
provide better results
349Al-Ani & Deriche23456-50510152023456152025303540WSFIDS0DS1DS22345651015202530No.
of combined classifiersERR{a} 2 classes {b} 3 classes {c} 4 classes Figure
5: E RR of different classifier combination methods obtained by considering
different number of classifiers for the cases: {a} 2 classes, {b} 3 classes,
and {c} 4 classes
than any individual classifier. This is shown in Tables 11-13 {refer to
Appendix B}, where
good results are achieved when combining the best two or three classifiers
of the speech segment experiment compared to the best individual classifier.
However, if there is a clear
difference in the performance of classifiers, as in the case when considering
mixtures of good and bad classifiers, then using Mx to combine the classification
results will not be a good
choice. In case we don't have any information about the performance of the
classifiers, i.e.,
there is no training dataset, the Av and Md methods could provide an attractive
choice. Similar to the findings of {Kittler et al., 1998; Alkoot & Kittler,
1999}, the performance of
these two methods are found to be close with slight favor of the Av method.
If the clas-
sification accuracy of the different classifiers are available, then the
WS method represents a good choice, where it outperforms Av in almost all
the conducted experiments. This is
expected, as associating each classifier with a weight that reflects its
performance, would
make the better classifier contributes more towards the final decision.
If the performance
of the combined classifiers are very close, then combining their results
using both the Av and WS methods would lead to very similar performance,
as shown in Tables 11-13 for the
cases of combining the best two and three speech segment classifiers.
The FI and DS0 represent two non-linearcombination methods. According to
{Cho &
Kim, 1995a}, the performance of FI was slightly better than the WS when
tested using an
350A New Technique for Combining Multiple Classifiers
optical character recognition database, whichis similar to the results we
obtained for the
texture experiments. However, for speech segment classification and speaker
identification experiments,the performance of FI was not as good as that
of WS. On the other hand, the
experiments conducted here show that WS slightly outperforms DS0. Note that
Rogova
{1994} onlycompared DS0 to the original classifiers. The main problem with
both FI
and DS0 is the appropriate estimation of their parameters. For example,
the desired sum of fuzzy densities affects the combination results of FI,
while the choice of the proximity
measure and reference vector plays an important role in the performance
of DS0.
DS1 and DS2 differ from DS0 by the appropriate measure of the reference
vectors, and hence the accurate estimation of the evidence of each classifier.
This will exploit the complementary information provided by the different
classifiers. In other words, the
accurate estimation of evidence of each classifier will lead to minimizing
the MSE of the
combined results, and hence resolving the conflicts between classifiers.
6. Conclusion
We have developed in this work a new powerful classifier combination technique
based on the D-S theory of evidence. The technique, based on adjusting the
evidence of different
classifiers by minimizingthe MSE of training data, gave very good results
in terms of overall
performance and error reduction rate. To test the algorithm, three experiments
were carried
out: texture classification, speech segments classification, and speaker
identification. All
of the experiments showed the superiority of the proposed technique when
compared to
conventional methods, fuzzy integral, and another D-S implementation that
uses a different measure of evidence. We have shown that accurate estimation
of the evidence from different
classifiers based on the whole output vectors {DS1} gives the best performance,
especially
for higher number of class labels. The only drawback of the algorithm is
that training
can be computationally expensive {this is used to accurately estimate the
evidence of each classifier}. However, this can be executed off-line, and
as such, has no ma jor effect on the
performance of the algorithm. We have also shown that the proposed algorithm
can easily
achieve an increase in classification accuracy of the order of 2045 to
7045 compared to other combination methods. We believe that with more work
on enhancing the technique, the
scheme can form a new framework for pattern classification in the future.
Acknowledgment
The authors wish to thank Dr. J. Chebil and Dr. M. Mesbah for their valuable
comments on the paper. The authors also acknowledge the support of Queensland
University of
Technology for the work presented in this paper. Dr. Deriche acknowledges
the support of
King Fahd University, Saudi Arabia, where he is currently on leave.
Appendix A. Classifier Combination Based on the Fuzzy Integral
Fuzzy integral is a non-linear combination method defined with respect to
a fuzzy measure. Detailed explanation of classifier combination based on
the g
025
fuzzy measure can be found
in the work of Cho & Kim {1995a, 1995b}. 351Al-Ani & Deriche For a finite
set of elements, Z , the g
025 fuzzy measure {Sugeno, 1977} is defined as the set
function g: 2 Z
! [0; 1] that satisfies the following conditions: 1. g{;} = 0; g{Z } = 1,
2. g{A} 024 g{B } if A 032 B ,
3. if fA i
g
1
i=1
is an increasing sequence of measurable sets, then lim
i!1
g{A
i
} = g{lim
i!1
A
i
},
4. g{A [ B } = g{A}+ g{B } + 025g{A}g{B }
for all A; B 032 Z and A \ B = ;, and for some 025 > 0001. Let h : Z
! [0; 1] be a fuzzy subset of Z . The fuzzy integral over Z of the function
h with respect to a fuzzy measure g
is defined by
h{z} ffi g{:} = max
E022Z
024
min
022
min
z2E
h{z}:g{E }
025 = max
ff2[0;1]
[min{ff; g{F
ff
})]; where
F
ff = fzjh{z} 025 ffg
Let Z = fz
1
; 001 001 001 z
n g, and suppose that h{z
1
}025 h{z
2
} 025 001 001 001 025 h{z
n
}, {if not, Z is rearranged
so that this relation holds}. Then a fuzzy integral e, with respect to a
fuzzy measure g over
Z can be computed by
e =
n max
i=1
[min{h{z
i
}; g{A
i
})]; where
A i
= fz
1
; 001 001 001 z
i
g
g{A
1
} = g{fz
1
g} = g
1
g{A
i
} = g
i
+ g{A
i0001
} + 025g
i
g{A
i0001
}; for 1 < i 024 n 025 is given by solving: 025 + 1 =
Q n
i=1
{1 + 025g
i
}, where 0252 {0001; 1} and 025 6= 0. This can be
calculated by solving an {n 000 1}
st degree polynomial and finding the unique root greater
than 0001. For the problem of combining classifiers, Z represents the set
of classifiers, A the ob ject
under consideration for classification, and h
k
{z
i
} is the partial evaluation of the ob ject A for class !
k
. Corresponding to each classifier z
i
, the degree of importance, g
i
, that reflects how
good is z
i
in the classification of class !
k
must be given. These densities can be induced
from a training dataset.
352A New Technique for Combining Multiple Classifiers
Appendix B. Tables of Classification Accuracy for Different Combination
MethodsClassifiersWSAvMdMxMVFIDS0DS1DS2c
1
, c
290.8990.7990.7990.8390.1790.2290.8191.0990.81c
1
, c
589.6989.5789.5789.6889.1088.9989.9290.4590.29c 4
, c
585.0585.0585.0584.6784.8185.1385.5185.4985.44c
1
, c
2
, c
389.9289.8388.0490.4887.7789.8889.5991.2691.73c 1
, c
2 , c
589.5589.3987.5490.2387.0989.5689.1891.1291.10c
1
, c 4
, c
589.0588.8086.9589.6286.6289.2688.8090.7790.73c
3 , c
4
, c
585.7685.7385.4984.8185.4985.6785.9187.2786.88c 1
, c
2 , c
3
, c
489.2989.0787.9890.1787.9189.8588.8291.4491.51c 1
, c
2 , c
3
, c
589.0788.8387.2490.0187.5089.5888.8191.4891.53c 1
, c
2 , c
4
, c
588.8788.7887.3490.0287.6189.6488.6691.1691.26c 1
, c
3 , c
4
, c
588.6088.3887.0589.3987.5289.4488.3991.3191.00c 2
, c
3 , c
4
, c
586.4586.4486.1885.5286.3186.3686.4688.3787.20c 1
, c
2 , c
3
, c
4
, c
588.5488.4787.0689.7387.0389.6588.2991.5091.61Table 8: Classification accuracy
of texture images using different combination methods {3 textures}ClassifiersWSAvMdMxMVFIDS0DS1DS2c
1 , c
291.9891.8791.8790.2889.8992.2191.4593.4092.81c
1
, c
591.9591.7491.7490.7289.5891.3091.5593.2792.72c
4 , c
585.1485.1985.1984.7784.5084.9485.8286.2885.36c
1
, c 2
, c
391.4991.2988.7890.3988.6092.3591.0894.3793.19c
1 , c
2
, c
590.7790.5087.9290.3487.7491.5490.3293.6692.96c 1
, c
4 , c
590.6190.3587.6790.2987.4391.4190.2593.6992.90c
3
, c 4
, c
586.2986.2285.4085.9985.6485.9186.9889.1687.29c
1 , c
2
, c
3
, c
490.2690.0388.3790.1288.9091.8689.9394.4893.19c
1
, c
2
, c
3
, c 590.1489.9188.1790.2589.0091.8390.0994.4193.26c
1
, c
2
, c
4
, c 589.7389.4387.6290.1988.4591.1189.4193.8292.56c
1
, c
3
, c
4
, c 590.3290.0687.9190.5088.5191.8390.2094.5093.33c 2
, c
3 , c
5
, c
686.4186.3885.9985.9885.7986.1587.2189.4786.97c 1
, c
2 , c
3
, c
4
, c
589.6589.3987.2790.1187.8191.2289.5194.4693.01Table 9: Classification accuracy
of texture images using different combination methods {5
textures} 353Al-Ani & DericheClassifiersWSAvMdMxMVFIDS0DS1DS2c
1
, c
288.4588.1388.1385.4485.9189.7889.2691.4289.20c
1
, c
586.3985.9785.9782.2783.0088.1887.1090.0788.08c 4
, c
579.5579.4079.4078.7478.5379.3779.2881.2279.91c
1
, c 2
, c
387.1086.5384.2084.6584.6089.5187.0692.0088.59c
1
, c 2
, c
586.4885.8583.6684.0884.1788.7286.9292.1388.53c
1
, c 4
, c
585.9285.3283.2282.9783.6788.2585.9892.0188.30c
3
, c
4
, c
580.2780.2179.6379.5479.3879.9779.8082.4880.89c
1
, c
2
, c
3
, c 486.4285.9584.5684.3385.1689.1885.9692.4588.70c
1
, c
2
, c
3
, c 585.7085.1183.3783.9184.2688.5285.4192.0288.30c
1
, c
2
, c
4
, c 585.7985.3984.2983.8384.9188.5285.9492.4888.66c 1
, c
3 , c
4
, c
585.4284.9283.0683.4784.0888.3484.8192.3588.35c 2
, c
3 , c
4
, c
581.6081.4481.1580.7581.0381.5481.0284.7082.17c 1
, c
2 , c
3
, c
4
, c
585.3384.8883.1983.7483.3588.0284.8392.4388.65Table 10: Classification accuracy
of texture images using different combination methods {9
textures}ClassifiersWSAvMdMxMVFIDS0DS1DS2c
1
, c 292.3492.3492.3492.3492.2291.8492.3892.4592.29c
1
, c
689.9988.1188.1183.9582.6491.0690.9291.4891.42c
5
, c
681.0980.6580.6576.1976.1082.0381.5482.2982.01c 1
, c
2 , c
392.6392.6192.3792.3492.2792.3692.6392.7992.67c
1
, c
2
, c 692.1791.7991.8385.9791.6092.0392.2792.5392.23c
1
, c
5
, c
689.8588.7988.2084.1787.5089.6690.3491.9291.79c
4
, c 5
, c
684.9884.9684.7679.3684.0884.8984.6285.6585.36c
1 , c
2
, c
3
, c
492.5992.5792.4291.6492.0492.3892.6192.7792.64c
1
, c
2
, c
3
, c 692.6292.4792.5086.9492.0292.4992.7392.7892.54c 1
, c
2 , c
5
, c
692.2591.9391.8286.0191.9692.1492.3792.7792.49c 1
, c
4 , c
5
, c
690.1589.5189.3484.6489.8489.4890.1291.9291.84c 3
, c
4 , c
5
, c
688.7988.4288.3083.2588.4688.4688.7690.5189.94c 1
, c
2 , c
3
, c
4
, c
592.7592.6492.2391.4292.1392.4992.7493.0792.81c 1
, c
2 , c
3
, c
4
, c
692.5792.4892.3086.9691.9292.2592.5692.7792.56c
1
, c
2
, c
3
, c 5
, c
692.7392.5092.2086.9391.9192.5392.7693.0392.72c
1
, c
2
, c
4
, c 5
, c
692.1491.7091.0786.1990.9791.7192.2392.7892.46c
1 , c
3
, c
4
, c
5
, c
691.4191.0490.6385.7990.4190.9891.4692.4892.14c
2
, c
3
, c
4
, c
5
, c
691.5190.9890.5285.7390.6091.1591.5092.6792.40c
1
, c 2
, c
3 , c
4
, c
5
, c
692.6292.3692.2486.9591.9792.3892.6193.0992.64Table 11: Classification accuracy
of speech segments using different combination methods {3 classes}
354A New Technique for Combining Multiple ClassifiersClassifiersWSAvMdMxMVFIDS0DS1DS2c
1
, c 286.9786.9586.9586.4686.3086.6786.6488.1287.59c
1
, c
684.3081.9881.9877.9776.6685.0184.5286.7186.39c
5
, c
674.6073.9173.9170.3870.2075.0074.2076.4775.68c 1
, c
2 , c
388.0988.0887.7787.1187.7687.5187.4988.7388.10c
1 , c
2
, c
686.4885.7686.0180.3885.3186.4986.4588.1987.50c 1
, c
5 , c
684.3382.8081.8878.6481.2483.8884.5687.1686.45c
4
, c
5
, c
679.2778.7377.8474.0777.2678.4678.5180.1779.39c
1
, c
2
, c
3
, c 488.0387.9187.8186.5887.2987.6787.4888.9888.09c 1
, c
2 , c
3
, c
687.6487.1787.2682.5986.9587.3787.2588.6188.08c 1
, c
2 , c
5
, c
686.5185.8485.9080.5486.1086.6986.5988.4787.57c 1
, c
4 , c
5
, c
684.6183.6483.2879.4783.8583.9784.2987.3886.68c 3
, c
4 , c
5
, c
683.8483.1282.8479.2883.0583.2283.5686.0084.95c 1
, c
2 , c
3
, c
4
, c
587.9087.8787.4486.0187.3587.7287.3489.1588.15c
1
, c
2
, c
3
, c
4
, c 687.5787.2486.9882.7686.8587.3687.0888.9588.11c
1
, c
2
, c
3
, c 5
, c
687.6987.1786.9182.6486.9687.5287.3288.9388.12c
1 , c
2
, c
4
, c
5
, c
686.7186.0085.5481.2185.6386.4686.3988.6487.56c
1
, c
3
, c
4
, c
5
, c
686.3685.8585.2681.5085.5786.1085.9588.2687.38c
2
, c 3
, c
4 , c
5
, c
686.7085.9585.1881.6385.2386.0685.8188.3787.28c 1
, c
2 , c
3
, c
4
, c
5
, c
687.6787.3387.0282.7686.9587.3586.9889.1488.01Table 12: Classification accuracy
of speech segments using different combination methods
{6 classes}
355Al-Ani & DericheClassifiersWSAvMdMxMVFIDS0DS1DS2c
1
, c 284.5884.6184.6183.7783.6684.0684.1586.1185.60c
1
, c
681.8578.5978.5971.4870.5082.5382.5184.4183.72c
5
, c
670.4869.0469.0462.8762.6571.2269.9773.2471.94c 1
, c
2 , c
385.4185.4285.0083.8185.0285.0085.3486.8586.15c
1
, c
2
, c 684.2083.4483.5374.1882.7084.1184.0186.0585.44c
1
, c
5
, c
682.1880.2679.1972.3277.8681.2082.5385.1084.56c
4
, c 5
, c
675.9475.2374.4367.2373.4675.3075.5478.2176.92c
1 , c
2
, c
3
, c
485.4485.3085.2083.4184.9685.1185.1587.0586.13c 1
, c
2 , c
3
, c
685.3484.9784.9976.4984.8285.1985.1486.7986.16c 1
, c
2 , c
5
, c
684.4883.7783.7774.6183.9884.5884.5386.6785.86c 1
, c
4 , c
5
, c
682.4181.4481.0373.6181.4981.6182.4485.6584.76c 3
, c
4 , c
5
, c
681.0280.5180.1572.7479.8380.0680.4183.8682.25c 1
, c
2 , c
3
, c
4
, c
585.5285.4384.9482.9384.7785.4585.4287.3286.32c 1
, c
2 , c
3
, c
4
, c
685.5385.0184.4177.0484.5384.9685.0087.0986.09c
1
, c
2
, c
3
, c 5
, c
685.5485.1684.7676.6284.6685.3885.2587.1186.22c
1
, c
2
, c
4
, c 5
, c
684.6084.0383.3475.5183.6984.1784.4286.8285.78c
1 , c
3
, c
4
, c
5
, c
683.9783.3282.7475.8083.1683.6783.9186.5985.45c
2
, c
3
, c
4
, c
5
, c
684.1483.4882.3275.4082.6683.2983.4486.5585.56c
1
, c 2
, c
3 , c
4
, c
5
, c
685.3284.9384.7077.0884.5185.0785.1987.2685.23Table 13: Classification accuracy
of speech segments using different combination methods
{9 classes} 356A New Technique for Combining Multiple ClassifiersClassifiersWSAvMdMxMVFIDS0DS1DS2c
1
, c 296.1096.1096.1096.1096.1095.1696.0295.5195.80c
1
, c
695.3894.9594.9594.9594.9596.2595.3696.1796.17c
5
, c
690.2590.6990.6990.8390.8387.5189.3288.5988.66c 1
, c
2 , c
396.6896.6896.9096.9796.9096.6196.9896.9196.54c
1 , c
2
, c
695.8195.7495.2395.3895.2396.1795.7395.5195.88c 1
, c
5 , c
694.6694.7394.3795.0994.5194.9594.6296.2496.17c
4
, c
5
, c
692.2092.1391.7792.6491.7091.4891.7591.1691.16c
1
, c
2
, c
3
, c 496.9096.9096.9796.6896.6196.6197.0596.6996.69c
1
, c
2
, c
3
, c
697.0496.8296.9796.1096.5396.9796.9196.7696.69c
1
, c
2
, c 5
, c
695.8895.8895.7495.1695.9696.1795.9595.8095.80c
1 , c
4
, c
5
, c
695.8895.8195.5295.5295.4594.9595.9595.8896.24c
3
, c
4
, c
5
, c 695.0294.9594.3094.0194.3094.3095.1494.5594.40c
1
, c
2
, c
3
, c 4
, c
596.8296.7596.2596.5396.2596.8296.6996.7696.61c
1 , c
2
, c
3
, c
4
, c
696.1096.1095.4596.3295.4596.2596.0296.2496.10c
1
, c
2
, c
3
, c
5
, c
696.5396.4696.1795.8896.1096.7596.4796.7696.54c
1
, c 2
, c
4 , c
5
, c
695.7495.6795.3895.6095.3195.8195.8896.0295.95c 1
, c
3 , c
4
, c
5
, c
696.3996.3295.6096.3995.5295.9696.2496.3996.24c
2
, c
3
, c
4
, c 5
, c
695.1695.2395.0295.4594.9595.1695.1495.5895.58c
1 , c
2
, c
3
, c
4
, c
5
, c
696.5396.6196.5396.1796.2595.9696.5496.5496.39Table 14: Speaker identification
accuracy using different combination methods {2 speakers}
357Al-Ani & DericheClassifiersWSAvMdMxMVFIDS0DS1DS2c
1
, c 289.2289.1789.1788.7088.9187.8289.2289.3389.17c
1
, c
688.9188.5588.5587.6185.5889.3888.8689.9089.95c
5
, c
680.3780.9080.9079.4477.9375.9079.4480.3779.85c 1
, c
2 , c
390.5390.4790.3289.2890.0689.8090.4790.5891.20c
1
, c
2
, c
690.3790.5889.9088.6590.0689.4390.5890.3290.21c
1
, c
5
, c
688.7088.6087.3587.2586.9388.2988.9690.1190.58c
4
, c 5
, c
684.5484.1783.6083.2483.3483.0883.8684.6485.06c
1 , c
2
, c
3
, c
491.1091.1090.5889.5990.3290.9491.3191.7891.51c 1
, c
2 , c
3
, c
691.2591.5191.3688.7090.0690.8991.6291.6292.04c 1
, c
2 , c
5
, c
690.6390.4790.4788.8188.3990.1690.5890.9490.21c 1
, c
4 , c
5
, c
690.3790.3289.4887.8788.5589.5490.4791.5791.46c 3
, c
4 , c
5
, c
686.8886.7885.9484.6484.5485.7986.1587.4087.35c 1
, c
2 , c
3
, c
4
, c
591.2091.2090.5889.5989.5990.8491.7291.9391.78c 1
, c
2 , c
3
, c
4
, c
691.3691.3191.2588.9690.3290.8991.3691.9392.40c 1
, c
2 , c
3
, c
5
, c
690.9991.1590.5888.8190.0690.7391.2091.6291.88c
1
, c
2
, c
4
, c 5
, c
691.4191.6790.9988.8690.4790.8991.4691.8392.09c
1
, c
3
, c
4
, c 5
, c
691.0590.9989.6988.3489.5989.8591.0491.8391.41c
2 , c
3
, c
4
, c
5
, c
690.3790.2789.2288.2488.8188.9689.3890.4290.32c
1
, c
2
, c
3
, c
4
, c
5
, c 691.5191.3691.7888.9190.8991.4691.4691.9391.98Table 15: Speaker identification
accuracy using different combination methods {3 speakers}
358A New Technique for Combining Multiple ClassifiersClassifiersWSAvMdMxMVFIDS0DS1DS2c
1
, c 287.4587.5387.5387.2986.9886.5587.6987.7387.41c
1
, c
684.7883.7983.7983.4081.5585.4883.8785.3784.78c
5
, c
674.0074.4774.4772.0371.2869.8371.9973.2973.13c 1
, c
2 , c
389.1889.1888.2488.1687.9288.3289.1089.3888.87c
1
, c
2
, c 687.9688.0887.5386.3587.3387.2587.6588.0887.41c
1
, c
5
, c
685.6085.4183.3283.5283.1284.6684.6286.1585.13c
4
, c 5
, c
680.8480.6879.5877.1878.7680.2978.8081.5180.68c
1 , c
2
, c
3
, c
489.7789.8589.6588.0488.5989.6589.6190.1789.61c
1
, c
2
, c
3
, c
689.2689.2288.7187.2988.2088.7588.9189.5088.71c
1
, c 2
, c
5 , c
687.8487.3386.9086.3587.1087.3787.6588.3287.69c
1
, c
4
, c
5
, c
687.0686.7486.2383.9985.4486.8286.6688.1687.14c
3
, c
4
, c
5
, c
684.7484.3884.3080.7283.0184.0783.8385.6884.66c
1
, c
2
, c
3
, c
4
, c
589.6189.6189.0288.0088.4789.1889.6190.0189.73c
1
, c
2
, c
3
, c
4
, c
689.7789.7388.8387.4588.8789.0689.6189.9389.50c
1
, c
2
, c
3
, c
5
, c
689.2689.1488.3687.2988.0488.5589.1889.7388.59c
1
, c 2
, c
4 , c
5
, c
689.1089.0688.2086.7488.3688.3688.9589.6588.95c 1
, c
3 , c
4
, c
5
, c
688.0087.8886.9885.2187.4587.9287.8889.3889.02c
2
, c
3
, c
4
, c 5
, c
688.9988.8787.2185.8087.0688.0087.9289.2688.75c
1 , c
2
, c
3
, c
4
, c
5
, c
689.4289.2688.9187.3788.2489.2289.5489.8989.73Table 16: Speaker identification
accuracy using different combination methods {4 speakers}
359Al-Ani & Deriche References
Aha, D. {1995}. Machine learning. Tutorial presented at The 1995 Artificial
Intelligence and Statistics Workshop.
Alkoot, F., & Kittler, J. {1999}. Experimental evaluation of expert fusion
strategies. Pattern
Re c o gnition Letters, 20, 1361{1369. Chen, K., & Chi, H. {1998}. A method
of combining multiple classifiers through soft com-
petition on different feature sets. Neuro c omputing, 20, 227{252.
Cho, S., & Kim, J. {1995a}. Combining multiple neural networks by fuzzy
integral for robust
classification. IEEE Tr ansactions on Systems, Man and Cybernetics, 25,
380{384.
Cho, S., & Kim, J. {1995b}. Multiple networks fusion using fuzzy logic.
IEEE Tr ansactions
on Neural Networks, 6, 497{501. Denoeux, T. {2000}. A neural network classifier
based on Dempster{Shafer theory. IEEE
Tr ansactions on Systems, Man and Cybernetics, 30, 131{150.
Dietterich, T. {1999}. An experimental comparison of three methods for constructing
en-
sembles of decision trees: Bagging boosting and randomization. Machine Le
arning,
40, 139{158.
Hansen, L., & Salamon, P. {1990}. Neural network ensembles. IEEE Tr ansactions
on Pattern Analysis and Machine Intel ligence, 12, 993{1001. Hashem, S.,
& Schmeiser, B. {1995}. Improving model accuracy using optimal linear com-
binations of trained neural networks. IEEE Tr ansactions on Neural Networks,
6,
792{794. Ho, T., Hull, J., & Srihari, S. {1994}. Decision combination in
multiple classifier system.
IEEE Tr ansactions on Pattern Analysis and Machine Intel ligence, 16, 66{75.
Kittler, J., Hatef, M., Duin, R., & Matas, J. {1998}. On combining classifiers.
IEEE
Tr ansactions on Pattern Analysis and Machine Intel ligence, 20, 226{239.
Kohlas, J., & Monney, P. {1995}. A mathematical theory of hints. An appro
ach to the
Dempster-Shafer theory of evidence. Berlin: Springer{Verlag.
Lam, L., & Suen, C. {1995}. Optimal combinations of pattern classifiers.
Pattern Re c o gnition L etters, 16, 945{954. Liu, W., & Bundy, A. {1992}.
The combination of different pieces of evidence using incidence
calculus. Tech. rep. RP 599, Dept. of Artificial Intelligence, Univ. of
Edinburgh.
Mandler, E., & Schurmann, J. {1988}. Combining the classification results
of independent
classifiers based on the dempster{shafer theory of evidence. In Gelsema,
E., & Kanal,
L. {Eds.}, Pattern re c o gnition and artificial intel ligence, pp. 381{393.
North-Holland.
MIT, SRI, & TI {1990}. DARPA TIMIT acoustic-phonetic continuous speech corpus..
http://www.ldc.upenn.edu/doc/TIMIT.html.
Rogova, G. {1994}. Combining the results of several neural network classifiers.
Neural
Networks, 7, 777{781.
Shafer, G. {1976}. A mathematical theory of evidence. Princeton University
Press.
360A New Technique for Combining Multiple Classifiers
Smets, P. {1990}. The combination of evidence in the transferable belief
model. IEEE
Tr ansactions on Pattern Analysis and Machine Intel ligence, 12, 447{458.
Smets, P. {1998}. The transferable belief model for quantified belief representation.
In
Gabbay, D., & Smets, P. {Eds.}, Handbo ok of defeasible re asoning and uncertainty,
pp. 267{301. Kluwer. Smets, P. {2000}. Data fusion in transferable belief
model. In 3rd Intl. Conf. Information
Fusion, pp. 21{33.
Sugeno,M. {1977}. Fuzzy measures and fuzzy integrals: a survey. In Gupta,
M., Saridis, G., & Gaines, B. {Eds.}, Fuzzy automata and decision pro c esses,
pp. 89{102. Amsterdam:
North-Holland. USC {1981}. USC-SIPI image database.. http://sipi.usc.edu/services/database/.
Xu, L., Krzyzak, A., & Suen, C. {1992}. Methods of combining multiple classifiers
and their
applications to handwriting recognition. IEEE Tr ansactions on Systems,
Man and
Cybernetics, 22, 418{435.
361