Re: Schyns 3.4. - 4.

From: HARNAD Stevan (harnad@coglit.soton.ac.uk)
Date: Sat Mar 07 1998 - 21:17:59 GMT


> From: Walker-Hall James <jrwh195@soton.ac.uk

>> One of the problems with this idea is that high-dimensional spaces
>> are mostly empty. To illustrate, imagine discretizing a line, a
>> squared plane, a cube and a hypercube with tiles of equal size
>> (e.g., 10 tiles per side). There is a geometric increase (in this
>> example, 101, 102, 103, 104) in the number of tiles that cover the
>> objects. If each tile is represented by an n-dimensional data point,
>> the example shows that one needs approximately 10n tiles to cover an
>> n- dimensional space. If the input distribution varies along many
>> degrees of freedom, a learning problem in high-dimensional space may
>> require an unrealistically large training set to discover robust
>> features, even if an asymptotic solution exists in principle.

jwh> This may (?) be saying that a model concerned with high-dimensional
jwh> properties (like the one proposed here) is inefficient, because 3d
jwh> objects that are similar in form, but vary in size, will differ a lot
jwh> in their representation in n- dimensional space. This would surely
jwh> imply that we would find it difficult to categorize a large cube and a
jwh> small cube as ^cubes^. But I think I am probably going wrong here...

Partly right. I think it's saying that if you represent all the
features of a set of objects in a very big n-dimensional space, and
then you try to train a model to sort the objects into, say, 2
categories (on the basis of a feature or features shared by one of the
categories and not the other -- i.e., an invariant feature or
features), then (1) the problem may be too hard, requiring the
sampling of unrealistically many of the members of the two categories, or,
even worse, (2) there may be no invariant feature(s) to be found at
all, and all you can do is memorise every single object.

The intuition is that when an invariant feature is somewhere in a
many-dimensional space, it is like looking for a needle in a haystack to
try to find it.

jwh> If you have a system that^s good at categorizing a limited set of
jwh> classes, then if you expose it to stimuli that are outside what it can
jwh> do, it will be biased towards the sets that it^s predisposed to.

That's right. And depending on the way the world really ease, that bias
may be helpful (if most of the rest of the world is similar to your
training sample) or it may be unhelpful (if the training sample was not
representative enough or not representative of enough of the world).

This is called, in machine learning, the "credit/blame assignment
problem": If you or a machine are trying to learn to sort a set of
objects correctly into their categories (let's say there are two
categories, and you get corrective feedback after every trial, letting
you know whether you sorted correctly or incorrectly), in general, if
there are a lot of features, then when you sort an object correctly,
you can't be sure what you did right (which of the many features that
it had were the relevant ones) and when you sort one incorrectly, you
can't be sure what you did wrong. At any point, you are sorting, on the
basis of some feature(s) or other(s), but maybe you're basing it on the
wrong feature(s).

Learning algorithms or learning rules are methods that modelers have
designed for trying to get around the credit/blame assignment problem.
The worst method is of course to try to remember everything by rote by
sampling every single object. (If there is an infinite number of
objects, or even an extremely large number, this is not possible, even
if there were the time and the space to sample that many.) So learning
rules need to find ways of reducing the features to a size that is
manageable. And it needs to be possible to generalise from a training
set that is not unreasonably large to all the rest of the cases that the
categoriser is likely to encounter.

A famous variant of the credit/blame assignment problem is the problem
of "underdetermination" of scientific theories by scientific data.
Patterns of data are often so complex that more than one theory can
explain them, and so there is no way of knowing which theory is the
correct one: Tomorrow, or someday, the one you chose may go wrong.

Sometimes the theory is so underdetermined by the data, it is impossible
to find it by learning. That's how it is with Universal Grammar, UG,
and the data the child has from which to try to learn UG: The "poverty
of the stimulus" is a form of underdetermination.

>> However, low bias comes at the cost of high variance, the second
>> component of the error (where variance means the discrepancy between
>> the correct categorization and the categorization of the network).

jwh> If you address the bias by making the ^machine^ (or whatever it is we
jwh> are talking about here... I^m not particularly sure myself...) more
jwh> general, you are then going to get more error (or high variance) in the
jwh> output (the categorizations that the machine makes).

The bias of a learning machine is a measure of how closely it clings to
the tentative features it has learned so far. If it gives them up too
easily in the face of error, that's not good, because that way it may
never settle on the correct solution. If it clings to them too hard, it
will not learn quickly enough from its mistakes.

By the way, these learning machines include us.

>> There is high variance because a flexible system is too sensitive to
>> the data: It learns many idiosyncrasies of the exemplars (e.g.,
>> differences in lighting conditions, rotation in depth, translation
>> in the plane, and so forth) before learning the invariants of a
>> category. Consequently, experience with many exemplars is necessary
>> for the network to "forget" idiosyncrasies and learn relevant
>> abstractions. Only with great experience is the system able to
>> categorize accurately (keep the variance low).

jwh> So the way to get around this bias/variance trade-off is to expose the
jwh> system to lots and lots of different types of stimuli. If you were to
jwh> expose this thing to a series of objects that are similar in most
jwh> dimensions, then the machine would learn to focus on the limited set of
jwh> dimensions that vary between those stimuli. In future encounters with
jwh> other sorts of stimuli, it will still focus on this limited set of
jwh> dimensions it has used in the past, and this would not be conducive to
jwh> successful categorizations. An example off the top of my head would be
jwh> to train the machine in human faces discrimination, (make the example
jwh> even more potent by saying they are sibling faces), and then require it
jwh> to discriminate between aircraft or something (or something very
jwh> different from human faces... you know what I mean.)

The example's fine. But sometimes the learner does not have the time to
keep sampling on and on; and sometimes the training sample turns out to
be unrepresentative of the kinds of categorisation problems the learner
will encounter later on.

jwh> I would imagine this process is analogous to the one that children go
jwh> through in the ^critical^ years of development, and that the biased
jwh> machines may be analogous to the cats that can only see vertical lines
jwh> after being raised in a room that only has vertical lines on the wall.

Good connections!

Critical periods are probably best thought of as machines with a
"prepared bias" even before encountering any data. Once the critical
period is over, the prepared bias is gone and learning is much harder.
That's how it is for the learning of pattern vision in kittens (through
sensorimotor trial and error), and for the "learning" of language by
children. They start out with prepared biases, but lose them if they
don't encounter the genetically anticipated experience during the
critical period.

>> Complex supervised categorization problems in high- dimensional
>> spaces would be simplified if it were possible to reduce the
>> dimensionality of the input. Several linear and nonlinear
>> dimensionality reduction techniques have been designed to achieve
>> this goal. Underlying dimensionality reduction is the idea that
>> information processing is divided into two distinct stages. A first
>> stage constructs a representation of the environment and a second
>> stage uses this representation for higher-level cognition such as
>> categorization and object recognition. It is hoped that the
>> constructed representation in a smaller dimensional space is more
>> useful than the raw input representation.

jwh> I don^t get this as I thought Schyns was arguing for categorization at
jwh> a high-dimensional level, and now he is trying to reduce it.

The point is that even after the dimensionality is reduced, it's still
high. All approaches to category learning need to do SOME dimensional
reduction.

jwh> Anyway,
jwh> it then says that human information processing starts off with raw
jwh> visual input that becomes more and more ^chunked^ and ^organized^ as it
jwh> gets to higher stages through visual processing, and then into even
jwh> more higher more general processes (such as remembering).

>> To illustrate, consider the popular technique called Principal
>> Components Analysis. If redundancies exist in the input data, there
>> should be fewer sources of variation than there are dimensions
>> (i.e., p << n). PCA finds the first k orthogonal directions... etc...

jwh> I think this is just mathematical method for reducing data in high
jwh> dimensional space. When we are analyzing objects at such high
jwh> dimensional space, there will obviously be lots of empty space, and
jwh> this just cuts that out.

Not quite: It's not that a lot of it is empty, necessarily, but a lot of
it is redundant. Think of a lot of the variance as covariance, which
is like correlation: If I have 70 dimensions, but the values on many
of them are correlated (so whenever something is high on dimension
27, it's also high on dimensions 50-70), then I can try to reduce it to
a few uncorrelated (orthogonal, independent) dimensions.

>> In analogy to the functional (re)organization of somatosensory maps,
>> we would like the formal definition of "important lower-dimensional
>> structures" to be closer to the categorization task the system needs
>> to solve.

jwh> So, if we are to reduce to lower-dimensional models, the reduction take
jwh> place not blindly (ie simply cut out all redundant info), instead,
jwh> reduction should be done with the to the aims of the categorization in
jwh> mind....

Well you couldn't even find the redundant or correlated dimensions
"blindly." But you needn't worry about the technical details of actual
dimensionality reduction techniques. Many methods exist, some completely
general, some specialised for certain kinds of data.

>> In the reviewed dimensionality reduction techniques, the feature
>> extraction stage operates independently of higher-level processes;
>> thus there is no guarantee that the extracted features will be
>> useful for higher-level processes (Mozer, 1994).

jwh> I think this is similar to the point I just made. If the machine
jwh> reduces data at a high dimensional level, it should do it bearing in
jwh> mind some sort of context. (The higher-level processes). For example,
jwh> current goals and motivations etc.

(If we only knew what "motivations" are in machine terms.) Roughly
speaking, higher-level abstractions may or may not go "against the
grain" of lower-level ones. Your lower level features may or may not be
useful for the higher level categorisations you need to make.

>> The functionality principle suggests that the categorizations being
>> learned should influence the features that are extracted. In other
>> words, top- down information should constrain the search for
>> relevant dimensions/features of categorization.

jwh> When learning a new category, we are told the defining features of that
jwh> category. This should therefore determine the dimensions and features
jwh> that we are concerned with initially. As we get better at
jwh> discriminating the category concerned, the process becomes more
jwh> refined, in that we become more specific on what data is relevant and
jwh> what can be reduced.

The Schyns et al. target article is not very concrete about this, What
they mean literally is something like this: When I look at objects such
as microscope slides, it should not just be their simple features that
guide my sorting; using biological theory and examples of plant and
animal cells I should be able to develop "constrained" cell-detectors
with complex features "created" just for that high level problem. Using
these features should in turn influence how I perceive slides (and
perhaps other objects as well).

>> Thus, we believe the serial process of (1) projecting high dimension
>> space onto a new lower dimension space, then (2) determining
>> categorization with new dimensions, will have to be modified such
>> that the second process informs the first (see also Intrator, 1993).

jwh> I think this is saying that objects are initially analyzed at a very
jwh> detailed level, (in high dimensional space) and are then reduced to
jwh> more manageable representations. The process takes advantage of a kind
jwh> of feedback loop, so that the defining low level dimensional features
jwh> of the category become more and more refined.

It's a bit vague, but roughly it means that the features we "construct"
to do more abstract categorisation can also influence more concrete
categorisation.



This archive was generated by hypermail 2b30 : Tue Feb 13 2001 - 16:23:20 GMT