In Mark Johnson (Ed.), Brain Development and Cognition: A Reader. Oxford: Blackwell Publishers. 1993. Pp. 623-642.

CONNECTIONISM AND THE STUDY OF CHANGE

Elizabeth A. Bates
Jeffrey L. Elman
University of California, San Diego



Developmental psychology and developmental neuropsychology have traditionally focussed on the study of children. But these two fields are also supposed to be about the study of change, i.e., changes in behavior, changes in the neural structures that underlie behavior, and changes in the relationship between mind and brain across the course of development. Ironically, there has been relatively little interest in the mechanisms responsible for change in the last 15 - 20 years of developmental research. The reasons for this de-emphasis on change have a great deal to do with a metaphor for mind and brain that has influenced most of experimental psychology, cognitive science and neuropsychology for the last few decades, i.e., the metaphor of the serial digital computer. We will refer to this particular framework for the study of mind as the First Computer Metaphor, to be contrasted with a new computer metaphor, variously known as connectionism, parallel distributing processing, and/or neural networks. In this brief chapter, we will argue that the First Computer Metaphor has had some particularly unhappy consequences for the study of mental and neural development. By contrast, the Second Computer Metaphor (despite its current and no doubt future limitations) offers some compelling advantages for the study of change, at both the mental and the neural level.

The chapter is organized as follows: (1) a brief discussion of the way that change has (or has not) been treated in the last decade of research in developmental psychology, (2) a discussion of the First Computer Metaphor, and its implications for developmental research, (3) an introduction to the Second Computer Metaphor, and the promise it offers for research on the development of mind and brain, ending with (4) a response to some common misconceptions about connectionism.

(1) What Happened to the Study of Change?

Traditionally, there are three terms that have been used to describe changes in child behavior over time: maturation, learning, and development. For our purposes here, these terms can be defined as follows.

(a) Maturation. As the term is typically used in the psychological literature (although this use may not be entirely accurate from a biological perspective), ``maturation'' refers to the timely appearance or unfolding of behaviors that are predetermined, in their structure and their sequence, by a well-defined genetic program. The role of experience in a strong maturational theory is limited to a ``triggering'' function (providing the general or specific conditions that allow some predetermined structures to emerge) or a ``blocking'' function (providing conditions that inhibit the expression of some predetermined event). The environment does not, in and of itself, provide or cause behavioral structure.

(b) Learning. ``Learning'' is typically defined as a systematic change in behavior as a result of experience. Under some interpretations, learning refers to a copying or transfer of structure from the environment to the organism (as in ``acquisition'' or ``internalization''). Under a somewhat weaker interpretation, learning may refer to a shaping or alteration of behavior that is caused by experience, although the resulting behavior does not resemble structures in the environment in any direct or interesting way.

(c) Development. As defined by Werner (1948) in his elaboration of the ``orthogenetic principle'', ``development'' refers to any positive change in the internal structure of a system, where ``positive'' is further defined as an increase in the number of internal parts (i.e., differentiation), accompanied by an increase in the amount of organization that holds among those parts. Under this definition, the term ``development'' is neutral to the genetic or experiential sources of change, and may include emergent forms that are not directly predictable from genes or experience considered separately (i.e., the sum is greater than and qualitatively different from the parts).

Although all three terms have been used to describe behavioral change in the psychological literature, the most difficult and (in our view) most interesting proposals are the ones that have involved emergent form, i.e., changes that are only indirectly related to structure in the genes or the environment. We are referring here not to the banal interactionism in which black and white yield grey, but to a much more challenging interactionism in which black and white converge and interact to yield an unexpected red. Because this interactionist view appears to be the only way to explain how new structures arise, it may be our only way out of a fruitless Nature/Nurture debate that has hampered progress in developmental psychology for most of its history.

Within our field, the most complete interactionist theory of behavioral change to date is the theory offered by Jean Piaget, across a career that spanned more than fifty years (Piaget, 1952, 1970a, 1970b, 1971). Piaget's genetic epistemology concentrated on the way that new mental structures emerge at the interface between an active child and a structured world. The key mechanism for change in Piaget's theory is the consummate biological notion of adaptation. Starting with a restricted set of sensorimotor schemes (i.e., structured ``packages'' of perception and action that permit activities like sucking, reaching, tracking, and/or grasping), the child begins to act upon the world (assimilation). Actions are modified in response to feedback from that world (accommodation), and in response to the degree of internal coherence or stability that action schemes bear to one another (reciprocal assimilation). The proximal cause that brings about adaptation is a rather poorly defined notion of equilibration, i.e., the re-establishment of a stable and coherent state after a perturbation that created instability or disequilibrium. In the infant years, adaptation of simple sensorimotor schemes to a structured world leads to an increasingly complex and integrated set of schemes or ``plans'', structures that eventually permit the child to ``re-present'' the world (i.e., to call potential perceptuo-motor schemes associated with a given object or event into an organized state-of-readiness, in the absence of direct perceptual input from the represented object or event). This developmental notion of representation comprised Piaget's explanation for the appearance of mental imagery, language and other symbolic or representational forms somewhere in the second year of life. After this point, the process of adaptation continues at both the physical and representation level (i.e., operations on the real world, and operations on the new ``mental world''), passing through a series of semi-stable ``stages'' or moments of system-wide equilibrium, ultimately leading to our human capacity for higher forms of logical and reasoning.

This ``bootstrapping'' approach to cognitive development does involve a weak form of learning (as defined above), but the mental structures that characterize each stage of development are not predictable in any direct way from either the structure of the world or the set of innate sensorimotor schemes with which the child began. Furthermore, Piaget insisted that these progressive increases in complexity were a result of activity (``construction''), and not a gradual unfolding of predetermined forms (maturation). In this fashion, Piaget strove to save us from the Nature/Nurture dilemma. Behavioral outcomes were determined not only by genes, or by environment, but by the mathematical, physical and biological laws that determine the kinds of solutions that are possible for any given problem. As Piaget once stated in a criticism of his American colleague Noam Chomsky, ``That which is inevitable does not have to be innate'' (Piaget, 1970a).

There was a period in the history of developmental psychology in which Piagetian theory assumed a degree of orthodoxy that many found stifling. Decades later, it now appears that much of Piaget's theory was wrong in detail. For one thing, it is now clear that the infant's initial stock of innate sensorimotor schemes is far richer than Piaget believed. It is also clear that Piaget overestimated the degree of cross-domain stability that children are likely to display at any given point in development (i.e., the notion of a coherent ``stage''). Once the details of his stage theory were proven inadequate, all that really remained were the principles of change that formed the bedrock of Piaget's genetic epistemology -- notions of adaptation and equilibration that struck many of his critics as hopelessly vague, and a notion of emergent form that many found downright mystical. Piaget was aware of these problems, and spent the latter part of his career seeking a set of formalisms to concretize his deep insights about change. Most critics agree that these efforts failed. This failure, coupled with new empirical information showing that many other aspects of the theory were incorrect, has led to a widespread repudiation of Piaget. Indeed, we are in a period of ``anti-Piagetianism'' of patricidal dimensions.

But what have we put in Piaget's place? We have never replaced his theory with a better account of the epistemology of change. In fact, the most influential developmental movements of the last two decades have essentially disavowed change. Alas, we fear that we are back on the horns of the Nature-Nurture dilemma from which Piaget tried in vain to save us.

On the one hand, we have seen a series of strong nativist proposals in the last few years, including proposals by some neo-Gibsonian theorists within the so-called ``competent infant movement'' (Baillargeon & deVos, 1991; Spelke, 1990, 1991), and proposals within language acquisition inspired by Chomsky's approach to the nature and origins of grammar (Hyams, 1986; Roeper & Williams, 1987; Lightfoot, 1991). In both these movements, it is assumed that the essence of what it means to be human is genetically predetermined. Change -- insofar as we see change at all -- is attributed to the maturation of predetermined mental content, to the release of preformed material by an environmental ``trigger'', and/or to the gradual removal of banal sensory and motor limitations that hid all this complex innate knowledge from view. Indeed, the term ``learning'' has taken on such negative connotations in some quarters that efforts are underway to eliminate it altogether. The following quotes from Piatelli-Palmerini (1989) illustrate how far things have gone:
I, for one, see no advantage in the preservation of the term learning. We agree with those who maintain that we would gain in clarity if the scientific use of the term were simply discontinued. (p. 2)
Problem-solving...adaptation, simplicity, compensation, equilibration, minimal disturbance and all those universal, parsimony-driven forces of which the natural sciences are so fond, recede into the background. They are either scaled down, at the physico-chemical level, where they still make a lot of sense, or dismissed altogether. (pp. 13-14).

On the other hand, the neo-Vygotskian movement and associated approaches to the social bases of cognition have provided us with another form of preformationalism, insisting that the essence of what it means to be human is laid out for the child in the structure of social interactions (Bruner & Sherwood, 1976; Rogoff, 1990). In these theories, change is viewed primarily as a process of internalization, as the child takes in preformed solutions to problems that lie in the ``zone of proximal development'', i.e., in joint activities that are just outside his current ability to act alone. Related ideas are often found in research on ``motherese'', i.e., on the special, simplified and caricatured form of language that adults direct to small children (for a review, see Ferguson & Snow, 1978). In citing these examples, we do not want to deny that society has an influence on development, because we are quite sure that it does. Our point is, simply, that the pendulum has swung too far from the study of child-initiated change. The most influential movements in developmental psychology for the last two decades are those that have deemphasized change in favor of an emphasis on some kind of preformation: either a preformation by Nature and the hand of God, or a preformation by the competent adult.

Why have we accepted these limits? Why haven't we moved on to study the process by which new structures really do emerge? We believe that developmental psychology has been influenced for many years by a metaphor for mind in which it is difficult to think about change in any interesting form -- which brings us to the First Computer Metaphor.

(2) The First Computer Metaphor and its Implications for Development

At its core, the serial digital computer is a machine that manipulates symbols. It takes individual symbols (or strings of symbols) as its input, applies a set of stored algorithms (a program) to that input, and produces more symbols (or strings of symbols) as its output. These steps are performed one at a time (albeit very quickly) by a central processor. Because of this serial constraint, problems to be solved by the First Computer must be broken down into a hierarchical structure that permits the machine to reach solutions with maximum efficiency (e.g., moving down a decision tree until a particular subproblem is solved, and then back up again to the next step in the program).

Without question, exploitation of this machine has led to huge advances in virtually every area of science, industry and education. After all, computers can do things that human beings simply cannot do, permitting quantitative advances in information processing and numerical analysis that were unthinkable a century ago. The problem with this device for our purposes here lies not in its utility as a scientific tool, but in its utility as a scientific metaphor, in particular as a metaphor for the human mind/brain. Four properties of the serial digital computer have had particularly unfortunate consequences for the way that we have come to think about mental and neural development.

(1) Discrete representations. The symbols that are manipulated by a serial digital computer are discrete entities. That is, they either are or are not present in the input. There is no such thing as 50% of the letter A or 99% of the number 7. For example, if a would-be user types in a password that is off by only one key-stroke, the computer does not respond with ``What the heck, that's close enough.'' Instead, the user is damned just as thoroughly as he would be if he did not know the password at all.

People (particularly children) rarely behave like this. We can respond to partial information (degraded input) in a systematic way; and we often transform our inputs (systematic or not) into partial decisions and imperfect acts (degraded output). We are error-prone, but we are also forgiving, flexible, willing and able to make the best of what we have. This mismatch between human behavior and the representations manipulated by serial digital computers has of course been known for some time. To resolve this well-known discrepancy, the usual device adopted by proponents of the First Computer Metaphor for Mind is the competence/performance distinction. That is, it is argued that our knowledge (competence) takes a discrete and idealized form that is compatible with the computer metaphor, but our behavior (performance) is degraded by processing factors and other sources of noise that are irrelevant to a characterization of knowledge and (by extension) acquisition of knowledge. This is a perfectly reasonable intellectual move, but as we will see in more detail below, it has led to certain difficulties in characterizing the nature of learning that often result in the statement that learning is impossible.

(2) Absolute rules. Like the symbolic representations described above, the algorithms contained in a computer program also take a discrete form. If the discrete symbols that trigger a given rule are present in the input, then that rule must apply, and give an equally discrete symbol or string of symbols as its output. Conversely, if the relevant symbols are not present in the input, then the rule in question will not apply. There is no room for anything in between, no coherent way of talking about 50% of a rule, or (for that matter) weak vs. strong rules. Indeed, this is exactly the reason why computers are so much more reliable than human beings for many computational purposes.

Presented with the well-known mismatch between human behavior and the absolute status of rules in a serial digital computer, proponents of the First Computer Metaphor for Mind usually resort to the same competence/performance described above. Alternatively, there have been attempts to model the probabilistic nature of human behavior by adding weights to rules, a device that permits the model to decide which rule to apply (or in what order of preference) when a choice has to be made. The problem is that these weights are in no way a natural product or property of the architecture in which they are embedded, nor are they produced automatically by the learning process. Instead, these weights are arbitrary, ad hoc devices that must be placed in the system by hand -- which brings us to the next point.

(3) Learning as programming. The serial digital computer is not a self-organizing system. It does not learn easily. Indeed, the easiest metaphor for learning in a system of this kind is programming; that is, the rules that must be applied to inputs of some kind are placed directly into the system -- by man, by Nature or by the hand of God. To be sure, there is a literature on computer learning in the field of artificial intelligence. However, most of these efforts are based on a process of hypothesis testing. In such learning models, two essential factors are provided a priori: a set of hypotheses that will be tested against the data, and an algorithm for deciding which hypothesis provides the best fit to those data. This is by its very nature a strong nativist approach to learning. It is not surprising that learning theories of this kind are regularly invoked by linguists and psycholinguists with a strong nativist orientation. There is no graceful way for the system to derive new hypotheses (as opposed to modifications of a pre-existing option). Everything that really counts is already there at the beginning.

Once again, however, we have an unfortunate mismatch between theory and data in cognitive science. Because the hypotheses tested by a traditional computer learning model are discrete in nature (based on the rule and representations described above), learning (a.k.a. ``selection'') necessarily involves a series of discrete decisions about the truth or falsity of each hypothesis. Hence we would expect change to take place in a crisp, step-wise fashion, as decisions are made, hypotheses are discarded, and new ones are put in their place. But human learning rarely proceeds in this fashion, characterized more often by error, vacillation and backsliding. In fact, the limited value of the serial digital computer as a metaphor for learning is well known. Perhaps for this reason, learning and development have receded into the background in modern cognitive psychology, while the field has concentrated instead on issues like the nature of representation, processes of recognition and retrieval, and the various stages through which discrete bits of information are processed (e.g., various buffers and checkpoints in a serial process of symbol manipulation). Developmental psychologists working within this framework (or indirectly influenced by it) have moved away from the study of change and self-organization toward a catalogue of those representations that are there at the beginning (e.g., the ``competent infant'' movement in cognition and perception; the parameter-setting movement in developmental psycholinguistics ), and/or a characterization of how the processes that elaborate information mature or expand across the childhood years (i.e., changes in performance that ``release'' the expression of pre-existing knowledge).

4) The hardware/software distinction. One of the most unfortunate consequences of the First Computer Metaphor for cognitive science in general and developmental psychology in particular has been the acceptance of a strong separation between software (the knowledge -- symbols, rules, hypotheses, etc. -- that is contained in a program) and hardware (the machine that is used to implement that program). From this perspective, the machine itself places very few constraints on our theory of knowledge and (by extension) behavior, except perhaps for some relatively banal concerns about capacity (e.g., there are some programs that one simply cannot run on a small personal computer with limited memory).

The distinction between hardware and software has provided much of the ammunition for an approach to philosophy of mind and cognitive science called Functionalism (Fodor, 1981; See Footnote 1). Within the functionalist school, the essential properties of mind are derived entirely from the domains on which the mind must operate: language, logic, mathematics, three-dimensional space, etc. To be sure, these properties have to be implemented in a machine of some kind, but the machine itself does not place interesting constraints on mental representations (i.e., the objects manipulated by the mind) or functional architecture (i.e., the abstract system that manipulates those objects). This belief has justified an approach to cognition that is entirely independent of neuroscience, thereby reducing the number and range of constraints to which our cognitive theories must respond. As a by-product (since divorces usually affect both parties), this approach has also reduced the impact of cognitive theories and cognitive phenomena on the field of neuroscience.

The separation between biology and cognition has had particularly serious consequences for developmental psychology, a field in which biology has traditionally played a major role (i.e., a tradition that includes Freud, Gesell, Baldwin, and Piaget, to name a few). Not only have we turned away from our traditional emphasis on change, but we have also turned away from the healthy and regular use of biological constraints on the study of developing minds. Ironically, some of the strongest claims about innateness in the current literature have been put forth in complete disregard of biological facts. Very rich forms of object perception and deep inferences about three-dimensional space are ascribed to infants before 3 - 4 months of age, conclusions which are difficult to square with (for example) well-known limitations on visual acuity and/or the immaturity of higher cortical regions in that age range. The underlying assumption appears to be that our cognitive findings have priority, and if there is a mismatch between cognitive and biological conclusions, we probably got the biology wrong (which may be the case some of the time -- but surely not all the time!).

It seems to us that we need all the constraints that can be found to make sense of a growing mass of information about cognitive development, language development, perceptual development, social development. Furthermore, we suspect that developmental neuroscience would also profit from a healthy dose of knowledge about the behavioral functions of the neural systems under study. Finally, we would all be better off if we could find a computational model (or class of models) in which it would be easier to organize and study the mutual constraints that hold between mental and neural development -- which brings us to the next computer metaphor.

(3) The Second Computer Metaphor and Its Implications for Development

During the 1950's and 60's, when the First Computer Metaphor for mind began to influence psychological research, some information scientists were exploring the properties of a different and competing computational device called the Perceptron (Rosenblatt, 1958, 1962). The roots of this approach can be traced to earlier work in cybernetics (Minsky, 1956; von Neumann, 1951, 1958) and in neurophysiology (Eccles, 1953; Hebb, 1949; McCulloch & Pitts, 1943). In a perceptron network, unlike the serial digital computer, there was not a clear distinction between processor and memory, nor did it operate on symbols in the usual sense of the term. Instead, the perceptron network was composed of a large number of relatively simple ``local'' units that worked in parallel to perceive, recognize and/or categorize an input. These local units or ``nodes'' were organized into two layers, an ``input set'' and an ``output set''. In the typical perceptron architecture, every unit on the input layer was connected by a single link to each and every unit on the output layer (see Figure 1). These connections varied in degree or strength, from 0 to 1 (in a purely excitatory system) or from -1 to +1 (in a system with both activation and inhibition). A given output unit would ``fire'' as a function of the amount of input that it received from the various input units, with activation collected until a critical firing threshold was reached (see also McCulloch and Pitts, 1943). Individual acts of recognition or categorization in a Perceptron reflect the collective activity of all these units. Knowledge is a property of the connection strengths that hold between the respective input and output layers; the machine can be said to ``know'' a pattern when it gives the correct output for a given class of inputs (including novel members of the input class that it has never seen before, i.e., generalization).

Figure 1. A simple perceptron network.

There are some obvious analogies between this system and the form of computation carried out in real neural systems, e.g., excitatory and inhibitory links, summation of activation, firing thresholds, and above all the distribution of patterns across a large number of inter-connected units. But this was not the only advantage that perceptrons offered, compared with their competitors. The most important property of perceptrons was (and is) their ability to learn by example.

During the teaching and learning phase, a stimulus is registered on the input layer in a distributed fashion, by turning units on or off to varying degrees. The system produces the output that it currently prefers (based, in the most extreme tabula rasa case, on a random set of connections). Each unit in this distributed but ``ignorant'' output is then compared with the corresponding unit in the ``correct'' output. If a given output unit within a distributed pattern has ``the right answer'', its connection strengths are left unchanged. If a given output has ``the wrong answer'', the size of the error is calculated by a simple difference score (i.e., ``delta''). All of the connections to that erroneous output are then increased or decreased in proportion to the amount of error that they were responsible for on that trial. This procedure then continues in a similar fashion for other trials. Because the network is required to find a single set of connection weights which allow it to respond correctly to all of the patterns it has seen, it typically succeeds only by discovering the underlying generalizations which relate inputs to outputs. The important and interesting result is that the network is then able to respond appropriately not only to stimuli it has seen before, but to novel stimuli as well. The learning procedure is thus an example of learning inductively.

Compared with the cumbersome hypothesis-testing procedures that constitute learning in serial digital computers, learning really appears to be a natural property of the perceptron. Indeed, perceptrons are able to master a broad range of patterns, with realistic generalization to new inputs as a function of their similarity to the initial learning set. The initial success of these artificial systems had some impact on theories of pattern recognition in humans. The most noteworthy example is Selfridge's ``Pandemonium Model'' (Selfridge, 1958), in which simple local feature detectors or ``demons'' work in parallel to recognize a complex pattern. Each demon scans the input for evidence of its preferred feature; depending on its degree of certainty that the relevant feature has appeared, each demon ``shouts'' or ``whispers'' its results. In the Pandemonium Model (as in the Perceptron), there is no final arbiter, no homunculus or central executive who puts all these daemonical inputs together. Rather, the ``solution'' is an emergent property of the system as a whole, a global pattern produced by independent, local computations. This also means that results or solutions can vary in their degree of resemblance to the ``right'' answer, capturing the rather fuzzy properties of human categorization that are so elusive in psychological models inspired by the serial digital computer.

So far so good. And yet this promising line of research came to a virtual end in 1969, when Minsky and Papert published their famous book Perceptrons. Minsky and Papert (who were initial enthusiasts and pioneers in perceptron research) were able to prove that perceptrons are only capable of learning a limited class of first-order, linearly separable patterns. These systems are incapable of learning second-order relations like ``A or B but not both'' (i.e., logical exclusive OR), and by extension, any pattern of equivalent or greater complexity and inter-dependence. This fatal flaw is a direct product of the fact that perceptrons are two-layered systems, with a single direct link between each input and output unit. If A and B are both ``on'' in the input layer, then they each automatically ``turn on'' their collaborators on the output layer. There is simply no place in the system to record the fact that A and B are both on simultaneously, and hence no way to ``warn'' their various collaborators that they should shut up on this particular trial. It was clear even in 1969 that this problem could be addressed by adding another layer somewhere in the middle, a set of units capable of recording the fact that A and B are both on simultaneously, and therefore capable of inhibiting output nodes that would normally turn on in the presence of either A or B. So why not add a set of ``in between'' units, creating 3 or 4 or N-layered perceptrons? Unfortunately, the learning rules available at that time (e.g., the simple delta rule) did not work with multilayered systems. Furthermore, Minsky and Papert offered the conjecture that such a learning rule would prove impossible in principle, due to the combinatorial complexity of delta calculations and ``distribution of blame'' in an n-layered system. As it turns out, this conjecture was wrong (after all, a conjecture is not a proof). Nevertheless, it was very influential. Interest in the perceptron as a model of complex mental processes dwindled in many quarters. From 1970 on, most of artificial intelligence research abandoned this architecture in favor of the fast, flexible and highly programmable serial digital computer. And most of cognitive psychology followed suit. (For a somewhat different account of this history, see Papert, 1988; a good collection of historically important documents can be found in Anderson & Rosenfeld, 1989.).

Parallel distributed processing was revived in the late 1970's and early 1980's, for a variety of reasons. In fact, the computational advantages of such systems were never entirely forgotten (Anderson, 1972; Feldman & Ballard, 1980; Hinton & Anderson, 1981; Kohonen, 1977; Willshaw, Buneman, & Longuet-Higgins, 1969), and their resemblance to real neural systems continued to exert some appeal (Grossberg, 1968, 1972, 1987). But the current ``boom'' in parallel distributed processing or ``connectionism'' was inspired in large measure by the discovery of a learning rule that worked for multi-layered systems (Rumelhart, Hinton and Williams, 1986; Le Cun, 1985). The Minsky-Papert conjecture was overturned, and there are now many impressive demonstrations of learning in multilayered neural nets, including learning of n-order dependencies like ``A or B but not both'' (Rumelhart and McClelland, 1986 See Footnote 2). Multilayer networks have been shown to be universal function approximators, which means they can approximate any function to any arbitrary degree of precision (Hornik, Stinchcombe, & White; 1989). Such a network is shown in Figure 2.

Figure 2. A simple perceptron network.

Another reason for the current popularity of connectionism derives from technical advances in the design of parallel computing systems. It has become increasingly clear to computer scientists that we are close to the absolute physical limits on speed and efficiency in serial systems -- and yet the largest and fastest serial computers still cannot come close to the speed with which our small, slow, energy-efficient brains recognize patterns and decide where and how to move. As Carver Mead has pointed out (Mead, 1989), it is time to ``reverse-engineer Nature'', to figure out the principles by which real brains compute information. It is still the case that most connectionist simulations are actually carried out on serial digital computers (which mimic parallelism by carrying out a set of would-be parallel computations in a series, and waiting for the result until the next wave of would-be parallel computations is ready to go). But new, truly parallel architectures are coming on line (e.g., the now-famous Connection Machine) to implement those discoveries that have been made with pseudo-parallel simulations. Parallel distributed processing appears to be the solution elected by Evolution, and (if Mead is right) computer science will have to move in this direction to capture the kinds of processing that human beings do so well.

For developmental psychologists, the Second Computer Metaphor holds some clear advantages for the study of change in human beings. The first set involves the same four areas in which the First Computer Metaphor has let us down: the nature of representations, rules or ``mappings'', learning, and the hardware/software issue. The last two are advantages peculiar to connectionist networks: non-linear dynamics, and emergent form.

(1) Distributed representations. The representations employed in connectionist nets differ radically from the symbols manipulated by serial digital computers. First, these representations are ``coarse-coded'', distributed across many different units. Because of this property, it is reasonable to talk about the degree to which a representation is active or the amount of a representation that is currently available in this system (i.e., 50% of an ``A'' or 99% of the number ``7''). This also means that patterns can be built up or torn down in bits and pieces, accounting for the graded nature of learning in most instances, and for the gradual or graded patterns of breakdown that are typically displayed by brain-damaged individuals (Hinton and Shallice, 1991; Marchman, 1992; Seidenberg & McClelland, 1989; Schwartz, Saffron, & Dell, 1990). Second, the same units can participate in many different patterns, and many different patterns coexist in a super-imposed fashion across the same set of units. This fact that can be used to account for degrees of similarity between patterns, and for the ways in which patterns penetrate, facilitate and/or interfere with one another at various points in learning and development (for an expanded discussion of this point, see Bates, Thal and Marchman, 1991).

(2) Graded Rules. Contrary to rumor, it is not the case that connectionist systems have no rules. However, the rules or ``mappings'' employed by connectionist nets take a very different form from the crisp algorithms contained within the programs employed by a serial digital computer. These include the learning rule itself (i.e., the principle by which the system reduces error and ``decides'' when it has reached a good fit between input and output), and the functions that determine when and how a unit will fire. But above all, the ``rules'' in a connectionist net include the connections that hold among units, i.e., the links or ``weights'' that embody all the potential mappings from input to output across the system as a whole. This means that rules (like representations) can exist by degree, and vary in strength.

It should also be clear from this description that it is difficult to distinguish between rules and representations in a connectionist net. The knowledge or ``mapping potential'' of a network is comprised of the units that participate in distributed patterns, and the connections among those units. Because all these potential mappings coexist across the same ``territory'', they must compete with one another to resolve a given input. In the course of this competition, the system does not ``decide'' between alternatives in the usual sense; rather, it ``relaxes'' or ``resolves'' into a (temporary) state of equilibrium. In a stochastic system of this kind, it is possible for several different networks to reach the same solution to a problem, each with a totally different set of weights. This fact runs directly counter to the tendency in traditional cognitive and linguistic research to seek ``the rule'' or ``the grammar'' that underlies a set of behavioral regularities. In other words, rules are not absolute in any sense -- they can vary by degree within a given individual, and they can also vary in their internal structure from one individual to another. We believe that these properties are far more compatible with the combination of universal tendencies and individual variation that we see in the course of human development, and they are compatible with the remarkable neural and behavioral plasticity that is evident in children who have suffered early brain injury (Thal et al. 1991; Marchman, 1992).

(3) Learning as structural change. As we pointed out earlier, much of the current excitement about connectionist systems revolves around their capacity for learning and self-organization. Indeed, the current boom in connectionism has brought learning and development back onto center stage in cognitive science. These systems really do change as a function of learning, displaying forms of organization that were not placed there by the programmer (or by Nature, or by the Hand of God). To be sure, the final product is co-determined by the initial structure of the system and the data to which it is exposed. These systems are not anarchists, nor solipsists. But in no sense is the final product ``copied'' or programmed in. Furthermore, once the system has learned it is difficult for it to ``unlearn'', if by ``unlearning'' we mean a return to its pristine prelearning state. This is true for the reasons described in (1) and (2): the knowledge contained in connectionist nets is contained in and defined by its very architecture, in the connection weights that currently hold among all units as a function of prior learning. Knowledge is not ``retrieved'' from some passive store, nor is it ``placed in'' or ``passed between'' spatially localized buffers. Learning is structural change, and experience involves the activation of potential states in that system as it is currently structured.

From this point of view, the term ``acquisition'' is an infelicitous way of talking about learning or change. Certain states become possible in the system, but they are not acquired in the usual sense, i.e., found or purchased or stored away like nuts in preparation for the winter. This property of connectionist systems permits us to do away with problems that have been rampant in certain areas of developmental psychology, e.g., the problem of determining ``when'' a given piece of knowledge is acquired, or ``when'' a rule finally becomes productive. Instead, development (like the representations and mappings on which it is based) can be viewed as a gradual process; there is no single moment at which learning can be said to occur (but see non-linearity, below).

(4) Software as Hardware. We have stated that knowledge in connectionist nets is defined by the very structure of the system. For this reason, the hardware/software distinction is impossible to maintain under the Second Computer Metaphor. This is true whether or not the structure of connectionist nets as currently conceived is ``neurally real'', i.e., like the structure that holds in real neural systems. We may still have the details wrong (indeed, we probably do), but the important point for present purposes is that there is no further excuse for ignoring potential neural constraints on proposed cognitive architectures. The distinction that has separated cognitive science and neuroscience for so long has fallen, like the Berlin Wall. Some cognitive psychologists and philosophers of science believe that is not a good thing (and indeed, the same might be said someday for the Berlin Wall). But we are convinced that this historic is a good one, especially for those of us who are interested in the codevelopment of mind and brain. We are going the right direction, even though we have a long way to go.

(5) Non-linear dynamics. Connectionist networks are non-linear dynamical systems, a fact that follows from several properties of connectionist architecture including the existence of intervening layers between inputs and outputs (permitting the system to go beyond linear mappings), the non-linear threshold functions that determine how and when a single unit will fire, and the learning rules that bring about a change in the weighted connections between units. Because these networks are non-linear systems, they can behave in unexpected ways, mimicking the U-shaped learning functions and sudden moments of ``insight'' that challenged old Stimulus-Response theories of learning, and helped to bring about the cognitive revolution in the 1960's (Plunkett & Marchman, 1991a, 1991b; MacWhinney, 1991).

(6) Emergent form. Because connectionist networks are non-linear systems, capable of unexpected forms of change, they are also capable of producing truly novel outputs. In trying to achieve stability across a large number of superimposed, distributed patterns, the network may hit on a solution that was ``hidden'' in bits and pieces of the data; that solution may be transformed and generalized across the system as a whole, resulting in what must be viewed as a qualitative shift. This is the first precise, formal embodiment of the notion of emergent form -- an idea that stood at the heart of Piaget's theory of change in cognitive systems. As such, connectionist systems may have the very property that we need to free ourselves from the Nature-Nurture controversy. New structures can emerge at the interface between ``nature'' (the initial architecture of the system) and ``nurture'' (the input to which that system is exposed). These new structures are not the result of black magic, or vital forces. They are the result of laws that govern the integration of information in non-linear systems -- which brings us to our final section.


(4) Some Common Misconceptions about Connectionism

It is no doubt quite clear to the reader that we are enthusiastic about the Second Computer Metaphor, because we believe that it will help us to pick up a cold trail that Piaget first pioneered, moving toward a truly interactive theory of change. But we are aware of how much there is to do, and how many pitfalls lie before us. We are also aware of some of the doubts and worries about this movement that are currently in circulation. Perhaps it would be useful to end this essay with some answers to some common misconceptions about connectionism, with special reference to the application of connectionist principles within developmental psychology.

Worry #1. ``Connectionism is nothing but associationism, and we already know the limits of associationism'' (e.g., Fodor and Pylyshyn, 1988). As we pointed out above, multi-layer connectionist nets are non-linear dynamical systems, whereas the familiar associationist models of the past rested on assumptions of linearity. This is both the good news, and the bad news. The good news is that non-linear systems can learn relationships of considerable complexity, and they can produce surprising and (of course) non-linear forms of change. The bad news is that no one really understands the limits and capabilities of non-linear dynamical systems. Maybe this is also good news: we have finally met our goal, after years of physics envy, because we have finally reached the same frontiers of ignorance as the physicists! Presumably, the limits of these systems will someday be known (although probably not within our lifetimes). But right now, it would be grossly premature to claim that connectionist networks can ``never'' perform certain functions. Anyone who claims that we already know the limits of this kind of associationism has been misinformed.

Worry #2. ``There are no interesting internal representations in connectionist nets'' (e.g., Pinker & Prince, 1988). There are indeed complex and rich representations in connectionist networks, and transformations that do the same work as rules in classical systems. However, these rules and representations take a radically different form from the familiar symbols and algorithms of serial digital computers and/or generative linguistics. The representations and rules embodied in connectionist nets are implicit and highly distributed. Part of the challenge of modern research on neural networks is to understand exactly what a net has learned after is has reached some criterion of performance. So far, the answer appears to be that they do not look like anything we have ever seen before (for examples, see Elman, 1989, 1990, 1991).

Worry #3. ``Connectionist nets only yield interesting performance on cognitive problems when the experiment `sneaks in' the solution by (a) fixing the internal weights until they work, or (b) laying out the solution in the input'' (e.g., Lachter & Bever, 1988). Part of the fascination of connectionist modelling lies in the fact that it offers the experimenter so many surprises. These are self-organizing systems that learn how to solve a problem. As the art is currently practiced, NO ONE fiddles with the internal weights but the system itself, in the course of learning. Indeed, in a simulation of any interesting level of complexity, it would be virtually impossible to reach a solution by ``hand-tweaking'' of the weights. As for the issue of ``sneaking the solution into the input'', we have seen several simulations in which the Experimenter did indeed try to make the input as explicit as possible -- and yet the system stubbornly found a different way to solve the problem. Good connectionist modelers approach their simulations with the same spirit of discovery and breathless anticipation that is very familiar to those who carry out real experiments with real children. Aside from being close to impossible, cheating would not be any fun at all -- and the hand-crafting of solutions is usually considered a form of cheating.

Worry #4. ``The supposed commitment to neural plausibility is a scam; no one really takes it seriously.'' Connectionists work at many different levels between brain and behavior. In current simulations of higher cognitive processes, it is true that the architecture is ``brain- like'' only in a very indirect sense. In fact, the typical 100-neuron connectionist toy is ``brain-like'' only in comparison with the serial digital computer (which is wildly unlike nervous systems of any known kind). The many qualities that separate real brains from connectionist simulations have been described in detail elsewhere (Hertz, Krogh and Palmer, 1991; Crick, 1989; Churchland and Sejnowski, in press). The real questions are: (a) is there anything of interest that can be learned from simulations in simplified systems, and (b) can connectionists ``add in'' constraints from real neural systems in a series of systematic steps, approaching something like a realistic theory of mind and brain? Of course we still do not know the answer to either of these questions, but there are many researchers in the connectionist movement who are trying to bring these systems closer to neural reality. For example, efforts are underway to study the computational properties of different neuronal types. Some researchers are exploring analogues to synaptogenesis and synaptic pruning in neural nets. Others are looking into the computational analogues of neural transmitters within a fixed network structure. The current hope is that work at all these different levels will prove to be compatible, and that a unified theory of the mind and brain will someday emerge. Of course we are a long way off, but the commitment by most of the researchers that we know in this field is a very serious one. It has launched a new spirit of interdisciplinary research in cognitive neuroscience, one with important implications for developmental psychology.

Worry #5. ``Connectionism is anti-nativist, and efforts are underway to reinstate a tabula rasa approach to mind and development'' (e.g., Kirsh, 1992). It is true that many current simulations assume something like a tabula rasa in the first stages of learning (e.g., a random ``seeding'' of weights among fully-connected units before learning begins). This has proven to be a useful simplifying assumption, in order to learn something about the amount and type of structure that has to be assumed for a given type of learning to go through. But there is no logical incompatibility between connectionism and nativism. Indeed, just as many historians have argued that Franklin Delano Roosevelt saved capitalism, connectionism may prove to be the salvation of nativist approaches to mind. The problem with current nativist theories is that they offer no serious account of what it might mean in biological terms for a given structure or idea to be innate. In neural networks, it is possible to explore various avenues for building in innate structure, including minor biases that have major structural consequences across a range of environmental conditions (Jacobs, Jordan, & Barto, 1991). In fact, within connectionist models there are coherent ways to talk about 90% or 10% of any innate idea! This is an approach that has not been explored in any detail to date, but the possibilities are intriguing, and might (ironically enough) end up being connectionism's greatest contribution to developmental cognitive neuroscience.

To conclude, we are willing to speculate that we will soon see a revival of Piagetian theory within a connectionist framework -- not a mindless reinterpretation of the old theory in modern jargon, but a return to Piaget's program of genetic epistemology, instantiating his principles of equilibration and adaptation in concrete systems that really work -- and really change. As we said before, Piaget spent the later decades of his life seeking a way of formalizing the theory, to answer critics (including Piaget himself) who charged that his principles of change were much too vague. We think that Piaget would have loved these new possibilities if he had lived to see them. We now have an opportunity to pick up the threads of his old program and move it forward into an exciting new decade, incorporating all the new insights and new empirical information that has been gained in the interim, without abandoning the fundamental commitment of developmental psychology to the study of change.

FOOTNOTES




1. This particular school of Functionalism has little to do with, and is indeed diametrically opposed to, an approach within linguistics and psycholinguistics alternatively called Functional Grammar or Cognitive Linguistics. For discussions, see Bates and MacWhinney, 1989; Langacker, 1987; Lakoff, 1987; Givón, 1984.

2. A number of readable introductions to connectionism are now available. See Bechtel and Abrahamsen, 1991; Churchland and Sejnowski, in press; Dayhoff, 1990. An excellent but more technical introduction can be found in Hertz, Krogh, & Palmer, 1991.

























REFERENCES




Anderson, J.A. (1972). A simple neural network generating an interactive memory. Mathematical Bio-Sciences, 8, 137-160.

Anderson, J.A., & Rosenfeld, E. (1989). Neurocomputing: Foundations of Research. MIT Press/Bradford Books.

Baillargeon, R., & de Vos, J. (1991). Object permanence in young infants: Further evidence. Child Development, 62, 1227-1246.

Bates, E., Thal, D. and Marchman, V. (1991). Symbols and syntax: A Darwinian approach to language development. In N. Krasnegor, D. Rumbaugh, E. Schiefelbusch and M. Studdert-Kennedy (Eds.) The biological and behavioral determinants of language development. Hillsdale, NJ: Erlbaum.

Bruner, J., & Sherwood, V. (1976). Peekaboo and the learning of rule structures. In J. S. Bruner, A. Jolly & K. Sylva (Eds.), Play: Its role in development and evolution. New York: Basic Books, Inc.

Bechtel, W. and Abrahamsen, A. (1991). Connectionism and the mind. Oxford: Basic Blackwood.

Churchland, P. and Sejnowsky, T. (in press). The net effect. Cambridge, MA: MIT Press/Bradford Books.

Crick, F. (1989). The recent excitement about neural networks. Nature, 337, 129 - 132.

Dayhoff, J. (1990.) Neural network architectures. New York: Van Nostrand Reinhold.

Eccles, J.L. (1953). The neurophysiological basis of mind. Oxford: Clarendon.

Elman, J. (1989). Structured representations and connectionist models. In The Eleventh Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Erlbaum.

Elman, J. (1990). Finding structure in time. Cognitive Science, 14, 179 - 211.

Elman, J. (1991) Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7, 195-225.

Feldman, J. A., & Ballard, D.H. (1980). Computing with connections. TR 72. University of Rochester: Computer Science Department.

Ferguson, C., & Snow, C. (1978). Talking to children. Cambridge: Cambridge University Press.

Fodor, J.A. (1981) Representations. Brighton (Sussex): Harvester Press.

Fodor, J.A., & Pylyshyn, Z.W. (1988). Connectionism and cognitive architecture: A critical analysis. In S. Pinker & J. Mehler (Eds.), Connections and Symbols. Cambridge, MA: MIT Press/Bradford Books. Pp. 3-71.

Givo\xab n, T. (1984). Syntax: A functional-typological introduction. Volume I. Amsterdam: John Benjamins.

Grossberg, S. (1968). Some physiological and biochemical consequences of psychological postulates. Proceedings of the National Academy of Science, USA, 60, 758 - 765.

Grossberg, S. (1972). Neural expectation: Cerebellar and retinal analogs of cells fired by leranable or unlearned pattern classes. Kybernetik 10, 49 - 57.

Grossberg, S. (1987). The adaptive brain, 2 vols. Amsterdam: Elsevier.

Hebb, D. (1949) The organization of behavior. New York: Wiley.

Hertz, J., Krogh, A. and Palmer, R. (1991). Introduction to the theory of neural computation. Redwood City, California: Addison Wesley.

Hinton, G.E., & Shallice, T. (1991). Lesioning a connectionist network: Investigations of acquired dyslexia. Psychological Review, 98, 74-95.

Hinton, G.E., & Anderson, J.A. (1981). Parallel models of associative memory. Hillsdale, NJ: Erlbaum.

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359-366.

Hyams, N. (1986). Language acquisition and the theory of parameters. Dordrecht & Boston: Reidel.

Jacobs, R., Jordan, M. and Barto, A. (1991). Task decomposition through competition in a modular connectionist architecture: the what and where visual tasks. Cognitive Science, 15, 219 - 250.

Kellman, P. J.,Spelke, E. S., & Short, K. R. (1986). Infant perception of object unity from translatory motion in depth and vertical translation. Child Development, 57, 72-86.

Kirsh, D. (1992). PDP Learnability and innate knowledge of language. Center for Research in Language Newsletter Vol. 6, no. 3. University of California, San Diego.

Kohonen, T. (1977). Associative memory: A system-theoretical approach. Berlin: Springer.

Lachter, J., & Bever, T.G. (1988). The relation between linguistic structure and associative theories of language learning: A constructive critique of some connectionist learning models. In S. Pinker & J. Mehler (Eds.), Connections and Symbols. Cambridge, MA: MIT Press/Bradford Books. Pp. 3-71.

Lakoff, G. (1987). Fire, women, and dangerous things: What categories reveal about the mind. Chicago: University of Chicago Press.

Langacker, R. (1987). Foundations of cognitive grammar: Theoretical perspectives. Volume I. Stanford: Stanford University Press.

Le Cun, Y. (1985). Une procédure d'apprentissage pour re\xab seau à seuil assymétrique. In Cognitiva 85: à la Frontière de l'Intelligence Artificielle des Sciences de la Connaissance des Neurosciences (Paris 1985), 599 - 604.

Lightfoot, D. (1991). The child's trigger experience -- Degree-0 learnability. Behavioral Brain Sciences, 14:2.

MacWhinney, B (1991) Implementations are not conceptualizations: Revising the verb-learning model. Cognition, 40, 121 - 157.

Marchman, V. (1992). Language learning in children and neural networks: Plasticity, capacity, and the critical period. (Technical Report 9201). Center for Research in Language, University of California, San Diego.

McClelland, J. and Rumelhart, D. (1986). Parallel distributed processing: explorations in the microstructure of cognition, Vol. 2. Cambridge, Mass.: MIT Press/Bradford Books.

McCulloch, W. and Pitts, W. (1943). A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115 - 133. Reprinted in J. Anderson and E. Rosenfeld (Eds.), Neurocomputing: Foundations of research. Cambridge, Mass.: MIT Press.

Mead, C. (1989). Analog VLSI and neural systems. Inaugural address presented to the Institute for Neural Computation, October, 1989. University of California, San Diego.

Minsky, M. (1956). Some universal elements for finite automata. In C.E. Shannon & J. McCarthy (Eds.), Automata studies. Princeton: Princeton University Press. Pp. 117-128.

Minsky, M. and Papert, S. (1969). Perceptrons. Cambridge, Mass.: MIT Press.

Papert, S. (1988). One AI or Many? Daedalus: Artificial Intelligence. Winter, 1988.

Piaget, J. (1952). The origins of intelligence in children. New York: International Universities Press.

Piaget, J. (1970a). Structuralism. New York: Basic Books.

Piaget, J. (1970b). Genetic epistemology. New York: Columbia University Press.

Piaget, J. (1971). Biology and knowledge: An essay on the relations between organic regulations and cognitive processes. Chicago: University of Chicago Press.

Piatelli-Palmarini, M. (1989). Evolution, selection, and cognition: From ``learning'' to parameter setting in biology and the study of language. Cognition, 31, 1-44.

Pinker, S., & Prince, A. (1988). On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. In S. Pinker & J. Mehler (Eds.), Connections and Symbols. Cambridge, MA: MIT Press/Bradford Books. Pp. 3-71.

Plunkett, K. and Marchman, V. (1991a) U-shaped learning and frequency effects in a multi-layered perceptron: implications for child language acquisition. Cognition, 38, 43-102.

Plunkett, K. and Marchman, V. (1991b) From rote learing to system building. (Technical Report 9020). Center for Research in Language, University of California, San Diego.

Roeper, T. and Williams, E., Eds. (1987). Parameter setting. Dordrecht and Boston: Reidel.

Rogoff, B. (1990). Apprenticeship in thinking: Cognitive development in social context. New York: Oxford University Press.

Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386-408.

Rosenblatt, F. (1962). Principles of neurodynamics. New York: Spartan.

Rumelhart, D., Hinton, G. and Williams, R. (1986). Learning representations by back- propagating errors. Nature, 323, 533 - 536.

Rumelhart, D., McClelland, J. and the PDP Research Group (1986). Parallel distributed processing: explorations in the microstructure of cognition, Vol. 1. Cambridge, Mass.: MIT/Bradford Books.

Schwartz, M.F., Saffran, E.M., & Dell, G.S. (1990). Comparing speech error patterns in normals and jargon aphasics: Methodological issues and theoretical implications. Presented to the Academy of Aphasia, Baltimore, MD.

Seidenberg, M., & McClelland, J.L. (1989). A distributed developmental model of visual word recognition and naming. Psychological Review, 96, 523-568.

Selfridge, O.G. (1958). Pandemonium: a paradigm for learning. In Mechanisation of Thought Processes: Proceedings of a Symposium Held at the National Physical Laboratory, November 1958. London: HMSO. Pp. 513-526.

Spelke, E. (1990). Principles of object perception. Cognitive Science, 14, 29-56.

Spelke, E. (1991). Physical knowledge in infancy: Reflections on Piaget's theory. In S. Carey and R. Gelman (Eds.), The epigenesis of mind: essays on biology and cognition. Hillsdale, New Jersey: Erlbaum, 133 - 169.

Thal, D., Marchman, V., Stiles, J., Aram, D., Trauner, D., Nass, R., & Bates, E. (1991). Early lexical development in children with focal brain injury. Brain and Language, 40, 491-527.

von Neumann, J. (1951). The general and logical theory of automata. In L.A. Jeffress (Ed.), Cerebral mechanisms in behavior. New York: Wiley.

von Neumann, J. (1958). The computer and the brain. New Haven: Yale University Press.

Werner, H. (1948). Comparative psychology of mental development. New York: International Universities Press.

Willshaw, D.J., Buneman, O.P., & Longuet-Higgins, H.C. (1969). Nonholographic associative memory. Nature, 222, 960-962.




Last Modified: 01:55pm PST, February 16, 1996