Psychologist Daryl Bem of Cornell University is just about to publish a parapsychology paper entitled ‘Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect’ in the Journal of Personality and Social Psychology (there is a draft of the article here). Bem’s paper suggests that future events can affect participants’ performance on well-known psychological tasks. The work has received lots of media attention, and several journalists have asked what I think about it.
Bem describes several studies in the paper, but two of them (Studies 8 and 9) have been the centre of much of the attention because (i) Study 9 produced the largest effect size of any of the experiments and (ii) Bem has released the software from these studies to researchers interested in replicating his work (Caroline Watt (Edinburgh University) and I have set up a registry for anyone attempting to do this here).
Stuart Ritchie (Edinburgh University) and I are planning to replicate the study. Yesterday we went over the procedure in detail and I think that the studies contain a potential methodological problem.
The studies were run by student experimenters, with other students acting as participants. The study software presented participants with a list of 48 words (e.g., CAT, SOFA, MUG, DESK), and then asked them to type all of the words that they could remember into the computer. The software then randomly selected half of the words in the original list (e.g., CAT, MUG) and presented them to the participants again. The participant did not see the non-selected words (e.g., SOFA, DESK). Let’s refer to the selected words as the ‘target words’ and the non-selected words as the ‘control’ words. Accoding to Bem’s results, participants were significantly more likely to remember the words in the ‘target’ than ‘control’ list (i.e., they appeared to be better able to remember those words that they would later see a second time.).
The potential problem is in the scoring. The experimenters used a second piece of software to score participants’ responses. Of course, participants may have misspelled remembered words (e.g., typing ‘CTT’ instead of ‘CAT’) or come up with words that were not on the original list (e.g., typing ‘CAR’ instead of ‘CAT’). To deal with this, the scoring software was designed to automatically go through the participant’s responses and to flag up any words that were not absolutely identical to the words that were not in the original list. The experimenter then had to go through these ‘unknown’ words manually, and either correct the spelling or tell the software to ignore them because they did not appear on the original list. To prevent any possibility of unconscious bias, the experimenter should have been doing this blind to the words in the ‘target’ and ‘control’ lists. Unfortunately, this was not the case.
The scoring programme listed the words submitted by the participant in one column. To the right of this were two more columns showing the ‘target’ and ‘control’ lists. Furthermore, when the experimenter made each decision about an ‘unknown’ word they had to change data in the columns containing the ‘target’ and ‘control’ lists. This procedure presented an opportunity for subjective bias to enter the scoring system. For example, if one of the words presented in the original list was ‘CAT”, and the participant typed ‘CAR’, does the experimenter re-code this as CAT? Or what if the participant typed ‘CTT’ – again, how should this be scored? In making these decisions the experimenter could have been unconsciously biased by whether the word CAT appears in the ‘target’ or ‘control’ lists.
The good news is that Bem’s programme stores all of the original data, so it should be possible to go through and recode the participants’ responses blind to whether the responses are present in the target or control lists. Until that happens it is problematic to interpret the results from these two studies. In addition, it is important that any replications of these studies don’t duplicate this error.
UPDATE 22/11/10. Daryl Bem replied as follows:
This is a response to Richard (Wiseman’s) concern about the ability of the experimenter to correct misspelled words while being able to observe which corrections will help the psi hypothesis (because the misspelled word is a practice word) or work against the psi hypothesis. This is a legitimate concern and I will modify the database so that that the category information is not available to the experimenter when he or she makes spelling corrections.
The program that runs the experiment automatically calculates the results of the session, ignoring all words it doesn’t recognize as literal copies of the test words. This analysis is also transferred to the database, which is set up so that the experimenter cannot change it or any of the original words as typed by the participant. Any changes made by the experimenter in the database are explicitly shown as changes, and a security check flags records in which the experimenter has corrected any of the original words. In other words, there is a complete record of the original data that cannot be altered. As an additional check, the critical data appear in the output file in both unencrypted and encrypted form, and only I know the encryption formula. If anything is changed in the output, the security flag in the database will read “False.”
Any experimenter who wishes can simply ignore the option to correct misspellings. It will make little difference to the results, as the following shows.
My two experiments included 150 participants, who recalled a total of 2,920 words, of which 45 (1.5%) were misspelled. 23 of those were practice words and 22 of those words were non-practice control words, for a net “gain” of one word for the psi hypothesis. Here are the results reported in my article (in which I corrected misspelled words) compared with the original program-calculated results (which ignores all unrecognized words). The score is a Differential Recall% score, which can range from -100% to +100%, with scores > 0 being in the “psi-predicted” direction.
Experiment 8:
Corrected DR% score = 2.27%, t(99) = 1.91, p = .029, d = .19
Uncorrected DR% score = 2.29%, t(99) = 1.95, p = .027, d = .20
Stimulus Seekers: Corrected DR% = 6.46%, t(42) = 3.76, p = .0003, d = .57
Uncorrected DR% = 6.50%, t(42) = 3.91, p = .0002, d = .60
Experiment 9:
Corrected DR% = 4.21%, t(49) = 2.96, p = .002, d = .42
Uncorrected DR% = 4.05%, t(49) = 2.86, p = .003, d = .40
As can be seen, Experiment 8 is trivially helped by the corrections; Experiment 9 is trivially hurt.
Additional observations: Half of the words used in this experiment are common words, as determined by “Frequency Analysis of English Usage” by Francis and Kucera (e.g., apple, doctor) and half are uncommon (e.g., gorilla, rabbi) Although Richard uses CTT and CAT as examples to illustrate the ambiguity of correcting misspellings, in fact only a few different words were misspelled by anyone, and they are among the uncommon words or commonly misspelled words in the list (e.g., potatoe for potato). So, Richard’s hypothetical example, notwithstanding, in practice the correction of misspelled words is actually very straightforward and unambiguous. “Intrusions,” words that aren’t in the original list, are also very easy to spot. (I can furnish the list to whoever wants to try a blind correction exercise, but I don’t want to publish it here lest it ruin future participants.)