page 1  (11 pages)
2to next section

To appear, International Conference on Artificial Intelligence, Expert Systems and Neural Networks (AIE?96), August 1996.

Value Difference Metrics for Continuously Valued Attributes

D. Randall Wilson, Tony R. Martinez
e-mail: randy@axon.cs.byu.edu, martinez@cs.byu.edu
Computer Science Department, Brigham Young University, Provo, UT 84602, U.S.A.

Key words: instance-based learning, generalization, distance metrics, nearest neighbor

Abstract. Nearest neighbor and instance-based learning techniques typically handle continuous and linear input values well, but often do not handle symbolic input attributes appropriately. The Value Difference Metric (VDM) was designed to find reasonable distance values between symbolic attribute values, but it largely ignores continuous attributes, using discretization to map continuous values into symbolic values. This paper presents two heterogeneous distance metrics, called the Interpolated VDM (IVDM) and Windowed VDM (WVDM), that extend the Value Difference Metric to handle continuous attributes more appropriately. In experiments on 21 data sets the new distance metrics achieves higher classification accuracy in most cases involving continuous attributes.

1. Introduction

Nearest neighbor (NN) techniques [1][2][3], Instance-Based Learning (IBL) algorithms [4][5][6][7], and Memory-Based Reasoning methods [8] have had much success on a wide variety of applications. Such algorithms typically store some or all available training examples during learning. During generalization, a new input vector is presented to the system for classification and a distance function is used to determine how far each stored instance is from the new input vector. The stored instance or instances which are closest to the new vector are used to classify it.

There are many distance functions that can be used to decide which instance is closest to a given input vector, including Euclidean distance and Manhattan distance (defined in Section 2). These metrics work well for numerical attributes, but they do not appropriately handle symbolic (i.e., nonlinear, and perhaps unordered) attributes.

The Value Difference Metric (VDM) was introduced by Stanfill & Waltz [8] in order to provide an appropriate distance function for symbolic attributes. The Modified Value Difference Metric (MVDM) was introduced by Cost & Salzberg [9] and used in the PEBLS system [10], and modifies the weighting scheme used by the VDM. These distance metrics work well in many symbolic domains, but they do not handle continuous attributes directly. Instead, they rely upon discretization, which often degrades generalization accuracy [11].

This paper presents two extensions of the Value Difference Metric which allow for more appropriate use of continuous attributes. Section 2 provides more background on the original VDM and subsequent extensions to it. Section 3 introduces the Interpolated Value Difference Metric (IVDM), and Section 4 presents the Windowed Value Difference Metric (WVDM). These two sections present empirical results which show that IVDM and WVDM provide a significant improvement in classification accuracy over VDM on several data sets. Section 5 provides conclusions and future research areas.

2. Background: Value Difference Metric (VDM)

The learning systems described in this paper address the problem of classification. A system is presented with a training set T, which consists of n instances. Each instance has an input vector x, and an output class c. The problem of classification is to decide what the output class of a new input vector y should be, based on what was learned from the training set, even if y did