Harnad, S., Hanson, S.J. & Lubin, J. (1991) Categorical Perception and the Evolution of Supervised Learning in Neural Nets. In: Working Papers of the AAAI Spring Symposium on Machine Learning of Natural Language and Ontology (DW Powers & L Reeker, Eds.) pp. 65-74. Presented at Symposium on Symbol Grounding: Problems and Practice, Stanford University, March 1991; also reprinted as Document D91-09, Deutsches Forschungszentrum fur Kuenstliche Intelligenz GmbH Kaiserslautern FRG.

Categorical Perception and the Evolution of Supervised Learning in Neural Nets

Stevan Harnad,
SJ Hanson*,** & J Lubin*
*Princeton Univ., **Siemens Res. Ctr.

harnad@cogsci.soton.ac.uk

Presented at 1991 AAAI Symposium on Symbol Grounding: Problem and Practice

ABSTRACT: Some of the features of animal and human categorical perception (CP) for color, pitch and speech are exhibited by neural net simulations of CP with one-dimensional inputs: When a backprop net is trained to discriminate and then categorize a set of stimuli, the second task is accomplished by "warping" the similarity space (compressing within-category distances and expanding between-category distances). This natural side-effect also occurs in humans and animals. Such CP categories, consisting of named, bounded regions of similarity space, may be the ground level out of which higher-order categories are constructed; nets are one possible candidate for the mechanism that learns the sensorimotor invariants that connect arbitrary names (elementary symbols?) to the nonarbitrary shapes of objects. This paper examines how and why such compression/expansion effects occur in neural nets.

1. Categorical Perception

One of the most remarkable properties of human perception is that it seems to carve the world at its joints. The physical signals that bombard our sensory surfaces do not give rise to a "blooming, buzzing confusion" but to relatively orderly experiences, segmented into "chunks" (Miller 1956) or categories. How does our brain sort things into categories on the basis of the sensory signals it receives?

A relevant phenomenon in human and animal perception that has received a good deal of attention is "categorical perception" (CP) (Harnad 1987): Equal-sized physical differences in the physical signals arriving at our sensory receptors are perceived as smaller within categories and larger between categories. For example, differences in wavelength within the range we call "yellow" are perceived as smaller than equal-sized differences that straddle the boundary between yellow and the range we call "green." The wavelength continuum has somehow been "warped," with some regions getting compressed and other regions getting stretched out.

In the case of color CP, although learning may have played a role, most of the warping seems to have been done by evolution, with the result that it is probably an inborn property of our sensory systems, modifiable only minimally (if at all) by experience. Other prominent examples of CP have been found in human speech perception as well as in some animal signalling systems (see chapters in Harnad 1987 for examples). These too seem to be largely innate, although they are modifiable by experience. Musical pitch categories may be examples of CP effects that arise primarily as a result of learning. CP effects have also been reported to occur purely as a result of learning in experiments with artificial continua; similar "warping" effects might be expected to arise from learning complex multidimensional categories, as in learning to sort baby chicks as male and female, or histological slides as cancerous or noncancerous.

The generation of CP (enhanced within-category similarity and enhanced between-category differences) by perceptual learning has been described as the "acquired similarity [difference] of cues" but no mechanism has been proposed to explain how or why it occurs.[1]

In this paper we will show how CP might arise as a natural side-effect of the means by which certain standard neural net models (backpropagation, Rumelhart & McClelland 1986) accomplish learning. They acquire the capacity to sort their inputs into the categories imposed by supervised learning through altering the pairwise distances between them (where distance is the degree to which a pair of inputs is discriminable by the net) until there is sufficient within-category compression and between-category separation to accomplish reliable categorization. As we shall see, however, the nets don't necessarily stop at a minimal degree of compression/separation; rather, they overshoot, producing much stronger CP effects than seem necessary to accomplish the categorization.

CP is of interest not only in its own right, as a very basic perceptual phenomenon, but also as a possible contributor to solving the "symbol grounding problem" (Harnad 1990): In a formal symbol system such as a computer program, or in the actual implementation of such a system on a machine, symbols are manipulated on the basis of formal rules or algorithms that apply to the shapes of the symbols, not their meanings (i.e., symbol manipulation is syntactic rather than semantic). The meanings of the symbols are projected onto them by the user who interprets the symbols and the symbol manipulations; they are not intrinsic to the system itself. By contrast, if, using the sensory projections on its transducer surfaces, a robot were able to discriminate and categorize the real-world objects, events and states of affairs to which its symbols can be interpreted as referring, then those symbols would be grounded in the robot's causal capacity rather than just being parasitic on the meanings an interpreter projects onto them.

So there is a close connection between the sensorimotor capacity to carve the world at its joints and the cognitive capacity to produce symbolic descriptions of that world: For the compressed and separated "chunks" of the similarity space originating from our sensory receptors can be given names, and those category names can then be combined syntactically to form propositions about the world. Whatever mechanism successfully maps the sensory projections onto their category names is also what grounds the symbol system.

It is one possible candidate mechanism for mapping simple sensory inputs onto category names that will be analyzed here, and in particular, the dynamical role that the warping of similarity space which is characteristic of CP may be playing in its successful performance.

2. Learning to Split a Line.

Both the neural net architecture and the task used were very simple. A backpropagation net with 8 input units, 2 - 12 hidden units and 8 or 9 output units was used. The net's task was to learn to sort 8 "lines" into 2 categories (let us call them "short" and "long"). The lines were represented in 6 different ways, in order to test the effects of the input coding. One variable of interest was the "iconicity" of the coding (i.e., how analog, nonarbitrary, or structure-preserving it was in relation to what it represented).

The lines were either "place" coded (e.g., a line of length 4 would be 0 0 0 1 0 0 0 0) or "thermometer" coded (e.g., line 4 would be 1 1 1 1 0 0 0 0). The place code was assumed to be more arbitrary and the thermometer code more analog, in that the thermometer code preserved some multi-unit constraints whereas the place code did not. In addition, the thermometer-coded lines and the place-coded lines could be discrete-coded (as above) or they could be coarse-coded, allowing some gaussian spillover to adjacent units (e.g., line 4 coarse/place-coded might be 0 .001 .1 .99 .1 .001 0 0, and line 4 coarse/thermometer-coded might be .90 .99 .99 .90 .1 .001 0 0). Finally, because CP concerns the formation of boundaries between categories, a lateral inhibition coding was also tested, in which adjacent coarse-coded units were inhibited so as to enhance boundaries (e.g., line 4 lateral-inhibition/place-coded might be .1 .1 .001 .99 .001 .1 .1 .1, and line 4 lateral-inhibition/thermometer-coded would be .8 .9 .9 .99 .001 .1 .1 .1). Coarse coding was assumed to be more analog than the discrete binary coding, again because it preserved multi-unit constraints. Lateral Inhibition was likewise more analog than the discrete code, but also more complicated, because the width and placement of the boundary effects from the lateral inhibition could in principle help or hinder the formation of a CP boundary, depending on whether the two effects happened to be in or out of phase.

In human experiments the CP effect is defined as an interaction between discrimination (the capacity to tell pairs of stimuli apart, a relative judgment) and identification (the capacity to categorize or name individual stimuli, an absolute judgment). Normally, along a one-dimensional stimulus intensity continuum the discrimination function is log-linear (i.e., equal-sized logarithmic increases in stimulus intensity produce equal-sized increases in sensation intensity, and hence response measures of it, such as same/difference and degree of similarity judgments). CP is a systematic departure from this log-linearity, with relative compression (attenuation) of discriminability within categories and/or relative dilation of discriminability (separation) between categories. The neural net accordingly had to be given an initial discrimination function, which could then be re-examined after categorization training to see whether it had "warped."

The method used to generate the precategorization discrimination function was "auto-association" (Hanson & Kegl 1987; Cottrell, Munro & Zipser 1987). Different nets were trained, separately for each of the 6 representations of the 8 lines, to produce as output exactly the same pattern they received as input. For each net trained to a predefined criterion level of performance on auto-association the interstimulus distances for all pairs of the 8 lines were then calculated as the euclidean distance between the vectors of hidden unit activations for each pair of lines. For example, if there were four hidden units and their activation values after training for line X were (x1 x2 x3 x4) and for line Y (y1 y2 y3 y4), then the distance between the two inputs, and hence their discriminability for that net, would be the distance between X and Y (see Hanson & Burr 1990 for prior work on using this internal measure of interstimulus distance).

After auto-association the trained weights for the connections between the hidden layer and the output layer were reloaded (and then all weights were left free to vary) and the net was given a double task: Auto-association (again) and categorization, i.e., lines 1 - 4 had to be given one (arbitrary) "name" and lines 5 - 8 had to be given another (e.g., "short" and "long"). In practice, this naming required one more bit on the output, the usual eight for the auto-association, and then one more for the categorization (initially seeded randomly with weights in the (-1.0, 1.0) range).

For each of the six representations, 50 auto-association nets were trained, and the results of each of these were used to train 10 categorization nets; except where noted, the results reported here refer to averages. Once each net was trained on the categorization task, the pairwise interstimulus distances were again computed, as before, and then compared to their precategorization values for that net. A CP effect was defined as a decrease in within-category interstimulus distances and/or an increase in between-category interstimulus distances relative to the auto-association-alone baseline.

3. Results.

We will first report the results for auto-association alone, and then for the pre/post comparison. Finally, we will analyze some of the details of the evolution of the CP effects that were observed.

The auto-association-alone results for each of the 6 representations for 4-hidden-unit nets are shown in the corresponding upper portions of Figure 1a-f. Plotted are the interstimulus distances (computed as described earlier) between each pair of inputs for the trained net. As expected, the most arbitrary representation (discrete/place) produced the flattest discrimination function: All interstimulus distances were equal. To an extent, this is true of all the place-coded representations, but it can be seen that the effect of the coarse coding produces some rounding and spillover. All the thermometer-coded representations are more iconic (in the sense that a monotonic increasing relationship, sometimes even a linear one is maintained as the pairs move further apart on the continuum, as in human discrimination functions). This seems to be reflected equally by the discrete/thermometer and coarse/thermometer codes, but the coarse/thermometer code has some more of the properties of human discrimination, as we will see later. The lateral inhibition representations are more complicated, because of interactions between the (arbitrarily chosen) size of the lateral inhibition envelope and the interstimulus increment.

The lower portions of Figure 1a-f show the difference between the interstimulus distances for auto-association alone and the interstimulus distances for auto-association-plus-categorization for each of the six representations. A positive deviation means that the interstimulus distance has decreased and a negative deviation after categorization means it has increased.[2]

Hence positive deviations within categories (compression) and/or negative deviations between categories (separation) would be CP effects. As is clear from Figure 1, pronounced CP effects occurred for all 6 representations. (Although there may be some trend toward greater magnitude CP effects with the more iconic representations, the scales vary and the relative magnitude is probably not comparable across representations with this methodology.)

Having observed strong CP effects in all representations, our next question was: Why were they there and what, if anything, were they for? To examine this more closely we first hypothesized that CP effects may arise as a consequence of compressing the input data into a smaller number of hidden units, so we re-ran the nets with hidden units varying in number from 2 - 12, predicting that the CP effect would diminish with more units. We also thought that whereas a small number of hidden units may give rise to global representations, a large number would allow local ones to form. The prediction was that the global representations would show more of a CP effect.

The categorization task turned out to be very difficult to learn with only 2 hidden units; most nets did not succeed even after a very large number of training trials. With 3 there was CP just as there had been with the 4-hidden-unit nets in Figure 1, and CP continued to be present even when the number of hidden units was increased to 12, exceeding the number of input units. So CP is not merely a consequence of compression. With more hidden units, however, there was more overall separation and less compression in all directions superimposed on the CP effect, both within and between categories.

The next hypothesis was that CP might arise gradually after the first point of separation in the task, as the net overlearned to more extreme values. However, when we trained nets just to the first epsilon of separation and checked for CP, we found the CP pattern was already there then, smaller than in Figure 1, but present.

Another test was whether CP might be an artifact of using the same net, with reloaded weights, to do the auto-association as well as the auto-association-plus-categorization. Now, in some respects this seems the natural thing to do: After all, we are the same systems that do discrimination as well as categorization. So although it was a bit like comparing apples and oranges (or at least like making between-subject rather than within-subject comparisons, we also compared performance averaged over many nets for auto-association alone with performance averaged over many other, independent nets, for auto-association-plus-categorization. Here too, although the effect was much weaker and not present in all representations, there was still evidence of a CP effect.

A final test concerned iconicity and interpolation: Was the CP restricted to trained stimuli, or would it "spill over" (or "generalize") to untrained ones? Nets were trained on auto-association the usual way, and then, during categorization training, some of the lines were left untrained (say, line 3 and line 6) to see whether they would nevertheless "warp" in the "right" direction. We found interpolation of the CP effects to untrained lines, but only for the coarse-coded representations.

Our provisional conclusion was that, whatever was responsible for it, CP had to be something very basic to how these nets learned, in particular, to how they accomplished supervised category learning. So the next step was to look more closely at the time-course and evolution of the learning itself. Instead of looking only at the pre/post-categorization comparison of the interstimulus distances, we analyzed how the interstimulus distances evolved across trials for each of the 8 stimuli. For this we used nets with 3 hidden units. This gave us a visualizable 3-dimensional hidden unit space in which we could follow the locus of the representation of each of the lines in hidden unit space during the course of learning. The results are shown in Figure 2.

Three factors were found to influence the generation of the CP during the course of learning. Two were related to the sigmoid or logistic activation function and one was related to the degree of iconicity of the input representation.

First, a finite, bounded hidden unit space arises because the units saturate to 0 and 1. In the three-dimensional case illustrated here, the hidden unit representations for each of the inputs move into the farthest corners of the unit cube during the course of auto-association learning, maximizing their pairwise distances from one another. This extreme cornering was found with the discrete/place coding (Fig. 2a); there was movement into corners and edges with the discrete/thermometer coding. The other representations showed less of this tendency to move to the extreme periphery of hidden unit space.

This separation tendency thus interacts with the second factor, the iconicity of the thermometer-coded and coarse-coded inputs: Some hidden unit representations are forced by the auto-association to stay closer to one another than they would otherwise have "liked" to stay because of the input structure they are constrained to inherit (see Figure 2b). Thermometer-coded and coarse-coded inputs accordingly arrive at the categorization stage after auto-association with linearly separable[3] configurations of hidden-units representations whereas place-coded inputs may arrive with more random configurations (depending on the random initial "seeding" values given to each of the weights prior to learning) and hence more of them may fail to be linearly separable (hence failing to be categorizable) after categorization training. Thermometer- and coarse-coded inputs produce faster and more reliable CP effects than place-coded inputs, in that they rarely or never get caught in the local minima that may block linear separability (cf. Figs. 2c - 2e).

The third factor is peculiar to categorization learning and arises from the dynamics of the learning (again because of the logistic function): Because of the error metric of the learning equation, the hidden-unit representations will be pushed with a force that is inversely proportional to an exponential function of their distances from the (hyper)plane separating the two categories.

The codings that generated the largest number of nets that were unable to learn the categorization task were the 2 most arbitrary (noniconic) ones, discrete/place (Fig. 2e) and especially lateral-inhibition/place. Our diagnosis is that with place-coding the output of the auto-associator is more likely to generate configurations in hidden-unit 3-space in which the representations of the eight lines are not readily linearly separable into the two 4-member categories imposed by the task. More training trials are hence required to move such nets into a configuration where the the eight representations are linearly separable (see Figure 2d). The lateral inhibition probably acts to add bumps to the representational space and hence to the error surface. Sometimes the configuration even gets trapped in a local minimum, in which case the categorization cannot be learned at all (see Figure 2e).

So what can so far be inferred about the evolution of CP learning can be stated as follows: During auto-association the iconic properties of the inputs are "imprinted" onto them, and are then reflected in their interstimulus distances in hidden-unit space. Apart from having to remain faithful to these constraints, the effect of auto-association is to maximize the pairwise interstimulus distance among all the stimuli within a bounded, finite space. The categorization phase then has no choice, if it is to generate successful performance, but to "warp" the finite space of this maximal separation, moving some of the stimuli (those within the same category) closer together than they would "like" in order to successfully separate them from the others (those in the other category); the magnitude of the warping effect is proportional to the distance of each stimulus from the plane that marks the boundary between the two categories. A complicating factor, and one affecting either the magnitude of the CP or the probability or number of trials before successful performance is attained, is the initial structure of the 8 stimuli at the end of successful auto-association and the beginning of categorization training: If their initial configuration is at odds with the partition that is needed, more warping is needed, and in some particularly bad configurations (arising mostly with lateral-inhibition-place coding) convergence may not be possible at all.

4. Conclusions.

We have analyzed how one particular family of neural nets accomplishes categorization by "warping" interstimulus similarity space in a way that resembles human categorical perception. Other kinds of nets generate CP too (e.g., unsupervised ones), but this analysis seems to be especially revealing about supervised learning, an important form of learning, because the contingencies of survival and successful behavioral adaptation do not always follow the natural lay of the land: Or, to put it another way, where nature's joints are may not be at all obvious from the input alone. Supervision in the form of feedback from the consequences of mis categorization may be our best guide as to how to carve up objects, events and states of affairs. If so, then the plasticity afforded by a mechanism that can "warp" the landscape in the service of the partition dictated by behavioral contingencies would be a useful one indeed, especially when the behavior is symbolic, and the task is not just to survive, reproduce and get around in the environment, but to describe and explain it -- a mechanism that allows you to "see" the world differently as you carve out ever subtler categories with the fine edge of human language.

References

Cottrell, Munro & Zipser (1987) Image compression by back propagation: an example of extensional programming. ICS Report 8702, Institute for Cognitive Science, UCSD.

Hanson & Burr (1990) What connectionist models learn: Learning and Representation in connectionist networks. Behavioral and Brain Sciences 13:471-518.

Hanson, S. J. and Kegl, J. (1987) Parsnip: A Connectionist Model that Learns Natural Language Grammar from Exposure to Natural Language Sentences. "Ninth Annual Cognitive Science Conference, Seattle."

Harnad, S. (ed.) (1987) "Categorical Perception: The Groundwork of Cognition" . New York: Cambridge University Press.

Harnad, S. (1990) The Symbol Grounding Problem. "Physica D" 42: 335-346.

McClelland, J.L., Rumelhart, D. E., and the PDP Research Group (1986) "Parallel distributed processing: Explorations in the microstructure of cognition," Volume 1. Cambridge MA: MIT/Bradford.

Miller, G. A. (1956) The magical number seven, plus or minus two: Some limits on our capacity for processing information. "Psychological Review" 63: 81 - 97.

Figure 1.

Pairwise distances between the 8 lines in hidden-unit space (4 hidden units) for each of the 6 input representations: discrete/place (1a), coarse/place (1b), lateral-inhibition/place (1c), discrete/thermometer (1d), coarse/thermometer (1e), and lateral-inhibition/thermometer. In each case the upper figure displays the pairwise distances following auto-association alone and the lower figure displays the difference between auto-association alone and auto-association plus categorization. The polarity of these differences is positive if the interstimulus distance has become smaller (compression) and negative if it has become larger (separation). To visualize within-category and between-category effects more easily, the comparisons have all been ordered as follows: first the one-unit comparisons 1-2, 2-3,... 7-8; then the two-unit comparisons 1-3, 2-4, etc, and so on until the last seven-unit comparison: 7-8. Note that the category boundary is between stimuli 4 and 5, hence all pairs that cross that boundary are between-category comparisons; otherwise they are within-category comparisons. Almost without exception, within-category distances are compressed and between-category distances are expanded by the categorization learning. Notice also that interstimulus distances before categorization (auto-association alone) tend to be equal (flat) for the more arbitrary codes (discrete/place, lateral-inhibition/place) and ascending with increasing distance in units for the more iconic representations (thermometer and coarse codes).

Figure 2.

The evolution of the 8 line representations in hidden-unit space for 3-hidden-unit nets. Each line's representation is displayed as a point in the unit cube, its value on each axis corresponding to the activations of each of the hidden units (the connecting lines are just to help visualize in 3 dimensions). Figure 2a shows how the arbitrary discrete/place codings evolve during auto-association from their initial random configuration (left) to extreme separation in the corners and edges of the space after auto-association learning (right). Figure 2b, again auto-association alone, shows how the iconic factors in the coarse/thermometer representation constrain this separation. Figure 2c shows the evolution of categorization with the iconic discrete/thermometer code from the final configuration after auto-association alone (left) to the configuration after successful category learning (right). Figure 2d shows in four stages from left to right the more difficult evolution of the configuration with the arbitrary discrete/place code; after considerable movement, linear separability between the two categories is achieved. Finally, Figure 2e shows a discrete/place net that cannot accomplish categorization because it is stuck in a local minimum in which the two categories are not linearly separable.

Footnotes

1. Behaviorists proposed an associative explanation -- that members of the same category grew more similar because they were were more closely associated with one another and with their shared category name than with members of different categories and their names, but this is more a restatement of the phenomenon than a model that explains it. The" motor theory of speech perception" explained speech CP by the similarities and differences between the motor pattern required to produce, say, a BA and a DA, but this model applies only to the special case of speech, where there is a perception/production analogue, and has given rise to decades of unfruitful debate about whether or not speech is "special." The last "theory" of CP is the Whorf Hypothesis, according to which CP is a manifestation of how language and culture shape our view of reality. This too seems more a restatement of the phenomenon than an explanation of it.

2. To facilitate comparison, the 28 possible pairwise comparisons of the 8 lines are displayed in terms of the size of the increment: Lines differing by 1 unit first, then 2 units, etc. Note that because the category boundary was between lines 4 and 5, increments of 4 or greater are all between-category differences.

3. Two sets of points in a plane are "linearly separable" if and only if they can be divided into their respective categories by a straight line cutting across the plane. In three dimensional space, linear separability is accomplished by a plane; in higher dimensions, by a hyperplane, etc.