TOWARD RESOLVING THE SIGNIFICANCE TESTING DEBATE:

ELECTRONIC PUBLISHING AND EDITORIAL DECISION MAKING

Keywords: statistical significance; electronic publishing; publication bias; null hypothesis;

meta analysis; editorial decision making; hypothesis testing

David M. Lane

Department of Psychology

Rice University

Houston, TX

77005 USA

lane@rice.edu

Miguel A. Quinones

Department of Psychology

Rice University

Houston, TX

77005 USA

mickey@rice.edu

ABSTRACT: The debate over the use of statistical significance testing in the social

sciences has heated up in recent years. This paper presents a brief review of the common

criticisms of significance testing and argues that, as long as journal editors must

choose among a number of manuscripts, significance testing provides useful

information for making these choices. Electronic publishing is presented as a way

of resolving the current impasse in this debate.

I. INTRODUCTION

1. In a recent article, Schmidt (1996) portrays two visions of statistics and its

role in psychological research. In the first, statistical significance testing is the

primary means of data analysis. Although researchers are happily using the

significance tests they learned in graduate school, these researchers have an

inadequate understanding of the underlying logic of statistical inference. As a

result, they use significance tests to answer questions that significance tests, in

principle, cannot answer. This frequently leads to such serious misinterpretations

of results that the accumulation of scientific knowledge is severely impede d.

In the second vision, the shackles of significance testing have been thrown off

and researchers rely primarily on interval estimation to analyze their results. Data

from well-designed experiments are published without regard to significance

testing and meta-analysis is used to combine the results across studies to reach

valid conclusions.

2. Schmidt argues that significance testing has nothing valid to contribute to the

analysis of data and that it should neither be taught in graduate school statistics

courses nor included in published reports of scientific works. Although we are

sympathetic to many of Schmidt's points, and agree with him that his second

vision is one to be strived for, we believe Schmidt has overlooked an important

characteristic of significance testing: significant results are more conclusive than

nonsignificant results.

3. We begin this paper by summarizing the arguments of Schmidt and others

(e.g., Cohen, 1994) against the use of significance tests. We then present the

argument that significant results are more conclusive than nonsignificant results.

This leads us to conclude that given the premium on journal space, editorial

decisions will be and should be heavily influenced by the results of significance

tests. This makes it very difficult to move toward Schmidt's second vision of the

role of statistics. The proposed solution lies in a new model for publishing and

editorial decision making based on electronic publishing on the World Wide Web.

II. ARGUMENTS AGAINST SIGNIFICANCE TESTING

4. Numerous authors have pointed out errors that psychologists and other

researchers have made interpreting and applying significance tests (see Chow,

1996, for a review). Among these errors are (1) accepting the null hypothe sis for

nonsignificant results, (2) interpreting the probability value as the probability the

null hypothesis is false, (3) confusing statistical significance and effect size, and

(4) interpreting the probability value as the probability of obtaining a significant

outcome in a subsequent finding. Schmidt argued convincingly that the first of

these errors, accepting the null hypothesis, has been the most detrimental to the

advance of knowledge. As an example, Schmidt described how the failure of some

studies to find significant correlations between employment tests and job

performance led to numerous investigations of a phenomenon that is best explained

as sampling error.

5. The crux of criticisms of significance testing is not that significance testing is

misused, for that would only indicate that it should be used correctly. Instead, the

argument is that significance testing has nothing of value to offer and that

researchers use it because they mistakenly believe that significance testing

provides information that, in fact, it does not. For example, most researchers

would like their statistical analysis to provide the probability that the null

hypothesis is false. Instead, significance testing gives only the probability of the

outcome (or a more extreme outcome) given that the null hypothesis is true.

Critics of significance testing claim that most users of significance testing

believe they have an objective way to determine whether or not a null hypothesis is

true. However, in most if not all realistic experimental situations, the prior

probability of the null hypotheses being true is essentially zero. To illustrate,

suppose an experimenter were interested in comparing two methods for teaching

subjects how to use a piece of computer software. It is not conceivable that the

population difference between the two methods could be exactly zero. If the

(population) mean time to perform a task after being trained with Method A is

3.6242541 minutes, is it conceivable that 3.6242541 minutes is also the mean time

to perform after being trained with Method B? The argument, therefore, is that this

whole to do about rejecting the null hypothesis is much ado about nothing s ince the

null hypothesis is virtually always known to be false in the first place (cf. Cohen,

1994).

III. SIGNIFICANCE TESTING AS CONCLUSIVENESS TESTING

6. Despite these and other criticisms of significance testing, significance tests do

make one important contribution: they indicate whether or not a set of

experimental data is conclusive. Consider a "crucial" experiment in which

competing theories make opposite predictions: Theory 1 predicts subjects in

Condition A should outperform subjects in Condition B whereas Theory 2 predicts

the opposite. Assume, for the sake of argument, that it is implausible that the two

conditions result in exactly the same (population) level of performance. That

leaves two possibilities: (1) Condition A > Condition B and (2) Condition A <

Condition B.

7. Assume that an experimenter does a significance test and finds that the

difference is statistically significant at the .01 level and that the mean for Condition

A is greater than the mean for condition B. This finding allows the researcher to

make a statement about the 99% confidence interval on the difference between

population means. It is well known that if a statistic (the difference between

sample means in this case) is significantly different from a hypothesized value

(zero in this case) then the confidence interval associated with the significance test

(99% confidence interval for .01 level) will not contain the hypothesized

parameter. For this example, the significant outcome means that the 99%

confidence interval on mean A -mean B does not contain zero. Instead, all values in the

interval will be greater than zero. Thus all "plausible" values of mean A - mean B will be

positive and the experimenter will be justified in concluding that mean A > mean B and

therefore that Theory 1 is supported.

8. If the result had not been significant, then values on both sides of zero would

have been included in the 99% confidence interval on mean A -mean B. This means that

both theoretical outcomes are still plausible: mean A could be greater than mean B but mean

B could be greater than mean A.

9. Although theories rarely live or die based on the result of a single study, a

significant result certainly leads to a stronger conclusion than a nonsignificant

result. Specifically, a significant result provides the basis for a researcher to draw a

conclusion about the direction of an effect (see also Frick, 1996). Once this

conclusion is reached, the experimenter may wish to try variations on the

experiment to determine the boundary conditions of the effect or to see if the size

of the effect depends on other factors. A nonsignificant result is an inconclusive

result. As such, it does not support or confirm the null hypothesis. Instead, it fails

to determine conclusively the direction of the effect.

10. As an aside, if significant results were called "conclusive results" then the

propensity of researchers to accept the null hypothesis implicitly would be

diminished. Instead of reporting "the difference between means was not

significant" researchers would report "the direction of the difference between

means was not determined conclusively."

IV. SIGNIFICANCE TESTING AND EDITORIAL DECISION MAKING

11. The negative consequences of using significance testing in editorial decision

making have been discussed for some time (Bakan, 1966; Greenwald, 1975) . We

believe the most serious problem is the incompatibility between significance

testing and effect-size estimation. Specifically, if significant results are a criterion

for publication, then published articles will contain inflated estimates of effect size

(cf. Hedges, 1984; Lane & Dunlap, 1978). Since the power of psychological

experiments is often relatively low (Cohen, 1962), this inflation can be substantial.

For example, Lane and Dunlap found that when the true difference between two

groups (each with n=3D10) was 8 IQ points and alpha was set at .05, the observed

mean difference between the two groups was over 18 points when only significant

results were considered. Naturally, as the alpha level was decreased (made more

stringent), the amount of overestimation of the true difference rose dramatically.

There have been some methods proposed for dealing with the bias inherent in

our current system. Rosenthal (1979) presented a procedure for addressing the "file

drawer" problem in meta-analytic research where only significant results are

included. He showed that as the number of published studies increases, the

probability of drawing incorrect conclusions by failing to include nonsignificant

unpublished studies becomes trivial. However, Rosenthal's analysis does not

speak to the issue of bias in estimating effect size. Hedges (1984) developed an

analytical procedure for estimating effect size based on a set of estimates from a

distribution truncated by including only significant results. Although Hedges's

procedure is a major contribution, it is not a perfect solution. For example,

complications arise when some but not all of the published articles report

significant results. Hedges recommends that in such situations, the nonsign ificant

results be discarded and his procedure applied to the significant outcomes.

Although this is a generally good solution, there are occasions in which this would

result in an unacceptably large amount of information being lost. Moreover,

complex situations such as one in which the probability of a paper being accepted

is a continuous monotonic function of the probability level are difficult to accommodate. In

short, no method for correcting for a bias can be quite as good as not having the

bias in the first place.

12. The bias would be eliminated by basing editorial decisions solely on the basis

of a paper's introduction and method section. However, there are two problems

with this approach. First, as pointed out by Lane and Dunlap (1978), an experiment

based on an unconventional theoretical perspective would not be very interesting if

the data contradicted the theory. Second, studies with conclusive results will (and

should) be preferred to studies with inconclusive results. Consider an editor who,

due to limited (and expensive) journal space, can only accept one of two papers

being considered. Both papers address equally-important topics using equal ly

rigorous methods. Paper 1 seeks to determine the relative effectiveness Conditions

A and B while Paper 2 seeks to determine the relative effectiveness of Conditions

C and D. In Paper 1, Condition A is significantly better than Condition B,

allowing the conclusion that, in the population, Condition A is better than

Condition B. In paper 2, Condition C is better, but not significantly better, than

Condition D. Paper 2, therefore, is unable to conclude which condition is better in

the population. The editor is faced with choosing between a conclusive experiment

and an inconclusive one. There seems little doubt that the conclusive experiment

should have a higher priority.

13. Although only one would be accepted, both of the above papers are worthy of

publication in the sense that they contain contributions to the field. A scientist

doing a meta-analysis of the difference between Conditions C and D would

certainly be interested in the results of Paper 2 even though the results of that paper

do not stand on their own. Moreover, if other experiments are conducted

comparing Conditions C and D and only the significant ones are published, the true

difference between these conditions will be vastly overestimated (Hedges, 1984;

Lane & Dunlap, 1978). Nonetheless, given the premium on journal space, the first

paper would have priority over the second. Editorial decision making must involve

judging the contribution of one paper relative to the contributions of other papers

vying for publication. Since a major aspect of a paper's contribution is the

conclusiveness of its results, statistical significance necessarily plays a critical role

in the decision process.

V. ELECTRONIC PUBLISHING AND SIGNIFICANCE TESTING

14. A recent article in Science reports on the huge explosion in electronic

publishing in the physical sciences (Taubes, 1996): At the end of 1995, over 100

peer-reviewed science journals were available over the internet. Some of these

journals use electronic publishing as a supplement to their regular paper

publication. However, an increasing number of journals are becoming strictly on-

line publications. For example, the psychology journal Psycoloquy has been

around for several years and is only available in electronic form.

A critical difference between electronic and paper publications is that the

marginal cost of an electronically-published article is negligible. It is known from

microeconomic theory that a firm should continue to increase production as long as

the marginal revenue is greater than the marginal cost. In the present context, this

means that a paper should be published as long as an article makes a positive

contribution to the field. Therefore, unlike the present publication system where

the contributions of papers are judged relative to contributions of other papers,

electronic publishing allows papers to be judged on their own merit. A well-

designed study producing inconclusive results makes a positive contribution to the

field and should be published. Since the cost of making this information available

to the research community is negligible, it is hard to justify keeping the

information from being disseminated.

15. The policy of ignoring the outcome of significance tests would be of great

benefit. Researchers could use significance tests as a short hand for whether a

confidence interval contains zero. However, they would be encouraged to refer to

these as "conclusiveness" tests thus avoiding two potential misuses of significance

testing: (a) using significance testing as a measure of effect size and (b) accepting

the null hypothesis when it is not rejected.

16. Naturally, researchers interested in estimating effect size would find the

elimination of significance testing as a criterion for publication highly desirable.

Schmidt's concern that significance testing is hindering the accumulation of

scientific knowledge would be addressed.

17. There are a number of other benefits of on-line publications. One important

benefit is speed of publication. By eliminating the production phase, it is possible

to go from submission to publication in a manner of weeks rather than months.

Another advantage is the ability to search the journal using key words and phrases.

It is also possible for articles to provide links to related articles or data throughout

the text or in the reference section. Authors could provide a link to the raw data

used in the study. This would allow scientists to double check the work and

replicate the analyses.

18. Several objections could be raised to the proposition that journals should be

published electronically and that these electronic journals should not use

significance testing as a criterion for publication. One objection is based on what

Greenwald (1975) calls the cultural truism that "... incompetence is more likely to

lead to erroneous nonsignificant, 'negative,' or null results." (p. 2). In refuting

this "cultural truism," Greenwald acknowledges that incompetence can have the

effect of introducing noise into the data. For example, an incompetent

experimenter could increase error variance by making random errors in data

transcribing, by running the experiment in an environment with distracting noise,

or by inaccurately placing electrodes. However, as Greenwald points out, o ther

more common types of incompetence result in systematic errors and thus a

tendency to falsely reject the null hypothesis. Examples include demand

characteristics, nonrandom sampling, invalid or contaminated manipulations, and

apparatus malfunctions.

19. A second objection to eliminating significance testing as a criterion for

publication is that so many studies would be published that there would be an

information overload. We believe a policy that rejects valid studies simply

because their publication would make it more difficult for researchers to stay

current would be a mistake. Although publishing more papers would certainly

require some adjustments such as increasing the number of articles that are

literature reviews and/or meta-analyses, we believe the problems would be

relatively minor.

20. The single largest barrier to electronic publishing is probably the attitude of

the academic community itself. It is clear that articles published in new on-line

journals would not carry the same weight as those published in more established

print journals. It is probably a matter of time, however, before attitudes towards

on-line publications change and the major journals are published electronically.

Electronic publishing will need to maintain a high level of quality control and

editorial oversight. The major difference is that the quality control will be focused

more on the theoretical justification and experimental methods and less on the

outcome of significance tests.

VI. CONCLUSION

21. All indications are that the debate over significance testing will continue for

some time. Our position is that until all meritorious studies can be publis hed, the

present system for deciding which studies are more worthy of publication is

necessary. Significance testing allows one to make statements about the

conclusiveness of results and, therefore, in spite of the adverse consequences of

doing so, significance testing should continue to be used as an important criterion

in editorial decision making.

22. We propose that electronic publishing may provide the answer to the current

dilemma. By decreasing the marginal costs of publication, practically all

theoretically sound and well-designed studies can be published without regard to

the statistical significance of the results. This should decrease the emphasis on

null-hypothesis testing as well as increase the validity of meta-analyses. Given the

current rate of growth on the internet, this vision may soon be practical.

ACKNOWLEDGMENT

An earlier version of this paper was presented at the 12th Annual

Conference of the Society for Industrial and Organizational Psychology, April,

1997. St. Louis, MO. We thank Larry James for his helpful comments on an

earlier version of this manuscript.

REFERENCES

Bakan, D. (1966). The test of significance in psychological research.

Psychological Bulletin, 66, 432-437.

Chow, S.L. (1988). Significance test or effect size? Psychological Bulletin,

103, 105-110.

Chow, S.L. (1996). Statistical significance: Rationale, validity, and utility.

Thousand Oaks, CA: Sage.

Cohen, J. (1962). The statistical power of abnormal-social psychological

research: A review. Journal of Abnormal and Social Psychology, 65, 145-153 .

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49,

997-1003.

Frick, R.A. (1996). The appropriate use of null hypothesis testing.

Psychological Methods, 1, 379-390.

Greenwald, A.G. (1975). Consequences of prejudice against the null

hypothesis. Psychological Bulletin, 82, 1-20.

Hedges, L.V. (1984). Estimation of effect size under nonrandom sampling:

The effects of censoring studies yielding statistically insignificant mean

differences. Journal of Educational Statistics, 9, 61-85.

Lane, D.M. & Dunlap, W.P (1978). Estimating effect size: Bias resulting

from the significance criterion in editorial decisions. British Journal of

Mathematical and Statistical Psychology, 31, 107-112.

Rosenthal, R. (1979). The "file drawer problem" and tolerance for null

results. Psychological Bulletin, 86, 638-641.

Schmidt, F.L. (1996). Statistical significance testing and cumulative

knowledge in psychology: Implications for training researchers. Psychological

Methods, 1, 115-129.

Taubes, G. (1996). Science journals go wired. Science, 271, 764-766.

17

----------