UWA Psychology: MRC Psycholinguistic Database (Reference)

MRC Psycholinguistic Database: Machine Usable Dictionary. Version 2.00

Informatics Division Science and Engineering Research Council Rutherford Appleton Laboratory Chilton, Didcot, Oxon, OX11 0QX Michael Wilson 1 April 1987

MRC Machine Usable Dictionary. Version 2.00

The MRC Psycholinguistic Database version 1, was provided as an on-line service (see Coltheart, 1981b). The service drew on three files and several access programs. The first file was a dictionary of words, the second and third files were sets of word association norms from the Edinburgh Thesaurus. The service has now been discontinued.

This second version of the MRC Psycholinguistic Database is being provided as a computer usable resource rather than as a service. An updated version of the dictionary file from the database (referred to here as MRC2.DCT) is being provided for public research use along with some programs which can be used either to access the dictionary, or as examples on which to model programs which match users' specific needs. This database dictionary differs from other machine usable dictionaries in that it includes not only syntactic information but also psychological data for the entries (see Amsler, 1984 for a review of other machine-readable dictionaries). It also differs from most conventional dictionaries in that it does not currently attempt to provide any semantic information. It is designed to be of use to psycholinguists in selecting stimulus materials for testing; for use by researchers in Artificial Intelligence as a source of information required for natural language processing and cognitive simulation; and for use by computer scientists who wish to use the word lists and syntactic information in the design of text processors.

The MRC Psycholinguistic Database: Machine Usable Dictionary and utility programs are available for research purposes from the Oxford Text Archive as item 1054 on their list at a nominal fee to cover handling costs. Their address is:

The Machine Usable Dictionary File.

The file contains 150837 words and provides information about 26 different linguistic properties, although it is not the case that information about every property is available for every one of the 150837 words: nobody, for example, has yet collected imagery ratings on such a large set of words, and thus only 9240 of the words possess an imagery rating.

The dictionary file does not contain any information which is original to it, but was assembled by merging a number of smaller databases of limited availability:

The dictionary file currently occupies 11 Mbyte as a sequential plain text file. Each line of the file represents the field for one word. The longest entry is 130 characters; e.g.

040320021615167000000093057530228435500000 JJ SABLE|eI/bl|eIbl|20

The composition of the dictionary file is summarised in Table 1, which specifies the linguistic properties described in an entry. The first column of Table 1 indicates the columns/field in the file containing the data. The last four properties are held in variable length fields separated by a | character. The second column indicates the name of the data field used elsewhere in programs and documentation. The third column specifies the identity of the linguistic property, and the fourth column indicates the number of words in the database for which information about a particular linguistic property is available. The first fourteen properties are stored in the file as numerical values. For these properties, the occurrence count refers to the number of non zero entries.

COLUMN	NAME	PROPERTY	OCCURRENCES
1-2	NLET	Number of letters in the word	150837
3-4	NPHON	Number of phonemes in the word	38438
5	NSYL	Number of syllables in the word	89402
6-10	K-F-FREQ	Kucera and Francis written frequency	29778
11-12	K-F-NCATS	Kucera and Francis number of categories	29778
13-15	K-F-NSAMP	Kucera and Francis number of samples	29778
16-21	T-L-FREQ	Thorndike-Lorge frequency	25308
22-25	BROWN-FREQ	Brown verbal frequency	14529
26-28	FAM	Familiarity	9392
29-31	CONC	Concreteness	8228
32-34	IMAG	Imagery	9240
35-37	MEANC	Mean Colerado Meaningfulness	5450
38-40	MEANP	Mean Pavio Meaningfulness	1504
41-43	AOA	Age of Acquisition	3503
44	TQ2	Type	44976
45	WTYPE	Part of Speech	150769
46	PDWTYPE PD	Part of Speech	38390
47	ALPHSYL	Alphasyllable	15938
48	STATUS	Status	89550
49	VAR	Varient Phoneme	1445
50	CAP	Written Capitalised	4585
51	IRREG	Irregular Plural	23111
\|	WORD	the actual word	150837
\|	PHON	Phonetic Transcription	38420
\|	DPHON	Edited Phonetic Transcription	136982
\|	STRESS	Stress Pattern	38390

Some of the properties listed in Table 1 are obvious; others require explanation as follows:

The distribution of entries in the WORD field by the number of letters that they contain is shown in Table 2.

NUMBER OF OCCURRENCES	NLET
31	1
168	2
1342	3
4719	4
10199	5
16818	6
21118	7
22302	8
20426	9
16409	10
11697	11
7566	12
4451	13
2342	14
1158	15
479	16
250	17
81	18
32	19
14	20
4	21
1	22
2	23

The distribution of entries in the WORD field by the number of phonemes that they contain is shown in Table 3.

The distribution of entries in the WORD field by the number of syllables that they contain is shown in Table 4.

The first of these refers to a word's frequency of occurrence as given in the norms of Kucera and Francis (1967). The maximum frequency in the file is 69971, the minimum is 0. The meaning of K-F-NCATS and K-F-NSAMP are defined by Kucera and Francis (1967).

This is the frequency of occurrence as given in the L count of Thorndike and Lorge (1942). If you plan to use this frequency count, you are advised to read details about it in the Thorndike-Lorge book. For example, the frequency value of a singular word which has a regular plural includes the frequency of the plural form, and this is true for other kinds of derivations too.

This stands for the frequency of occurence in verbal language derived from the London-Lund Corpus of English Conversation by Brown (1984). There are 14529 entries for 8985 different strings in the WORD field. The range of entries is 0 - 6833 with a mean of 35 and a standard deviation of 252.

This stands for 'printed familiarity'. The FAM values were derived from merging three sets of familiarity norms: Pavio (unpublished), Toglia and Battig (1978) and Gilhooly and Logie (1980). The method by which these three sets of norms were merged is described in detail in Appendix 2 of the MRC Psycholinguistic Database User Manual (Coltheart, 1981a). This method may not meet with everyone's approval. FAM values lie in the range 100 to 700 with the maximum entry of 657, a mean of 488 and a standard deviation of 99: note that they are integer values (in the original norms the equivalent range was 1.00 to 7.00).

This is concreteness, and it too is derived from a merging of the Pavio, Colerado, and Gilhooly-Logie norms: details of merging are given in Appendix 2 of the MRC Psycholinguistic Database User Manual (Coltheart, 1981a). CONC values are integer, in the range 100 to 700 (min: 158; max 670; mean 438; s.d. 120).

This is imageability, derived from merging the three sets of norms referred to above, and having values in the range 100 to 700 (min 129; max 669; mean 450; s.d. 108).

These are the meaningfulness ratings from the Toglia and Battig (1978), multiplied by 100 to produce a range from 100 to 700 (min 127; max 667; mean 415; s.d. 78).

This is the meaningfulness from the norms of Pavio (unpublished) multiplied by 100 to produce a range from 100 to 700. The two sets of meaningfulness ratings were not merged because their correlations were low ( only + .529) and the mean values for a set of words common to the two sets of norms were very low (see Toglia and Battig, 1978, Table 2).

These differences are due to differences in the instructions to subjects. Thus the two sets of meaningfulness ratings are not comparable, and so were kept seperate (min 192; max 922; mean 600; s.d. 107).

This is age of acquisition from the norms of Gilhooly and Logie (1980), multiplied by 100 to produce a range from 100 to 700 (min 125; max 697; mean 405; s.d. 120).

When TQ2 has the value Q (40810 occurrences), this word is a derivational variant of another.

This is syntactic category as represented in the SOED database assembled by Dolby, Resnikoff and MacMurray (1963). There are ten different syntactic categories, coded as shown in Table 5.

SYNTACTIC CATEGORY	CODE	OCCURRENCES
Noun	N	77355
Adjective	J	25547
Verb	V	30725
Adverb	A	4243
Preposition	R	230
Conjunction	C	108
Pronoun	U	134
Interjection	I	352
Past Participle	P	5939
Other	O	6136

When you are interested in syntactic category, WTYPE can sometimes be unsatisfactory. For example, the words FREEZE and HARASS are Nouns according to WTYPE (as well as verbs); and indeed when these are looked up in SOED or Webster's, they are described as nouns. If you want to avoid such esoteric usages, PDWTYPE may be useful. It refers to the syntactic categories given in Jones' Pronouncing Dictionary (Jones, 1963), and very unusual uses of words are not considered. However PDWTYPE uses only four categories, not ten: these four are noun (N, 22061 occurrences), verb (V, 6333 occurrences), adjective (J, 8817 occurrences) and other (O, 1179 occurrences). The mapping from WTYPE to PDWTYPE is shown in Table 6.

If this = A, then the word is an abbreviation (130 occurrences); if S, the word is a suffix (282 occurrences); if P, a prefix (1374 occurrences); if H, the word is hyphenated (13716 occurrences); if T, a multi-word phrasal unit (436 occurrences). For all of these categories, NSYL = 0. For all other words ALPHYSL is blank.

The 15 possible categories of STATUS are listed in Table 7; these are as given in the Dolby database (Dolby et al., 1963) derived from the Shorter Oxford English Dictionary, and perusal of Table 7 should make the meanings of these categories sufficiently clear.

STATUS OF WORD	CODE	OCCURRENCES
Dialect	D	2780
Alien	F	6003
Archaic	A	959
Colloquial	Q	405
Capital	C	2
Erroneous	N	0
Nonsense	E	62
Nonce Word	W	33
Obsolete	O	10549
Poetical	P	183
Rare	R	2756
Rhetorical	H	22
Specialised	$	7731
Standard	S	58065
Substandard	Z	0

This refers to words which have the same spelling but different pronunciation and syntactic classes. When the pronunciations differ only in respect of stress (e.g. object, insult) VAR = O (212 occurrences).When the pronunciations differ phonemically (e.g. moderate, abuse), VAR = B (1233 occurrences).

If this = C, then the word is normally written with an initial capital letter. This can be used as an indicator of proper nouns such as the names of people, towns, states and countries.

This refers to the plurality of words. Where IRREG = Z, the word is plural (17441 occurrences), this can be used in conjunction with TQ2 to select irregular forms; where IRREG = Y, the word is a singular form (1024 occurrences); where IRREG = B, the word is both the singular and the plural form (151 occurrences); where IRREG = N, the word has no plural form (4407 occurrences); where IRREG = P, the word is plural but acts singular (88 occurrences)

The dictionary is ordered by the ascii sequence of these strings. Although there are 150837 entries in the dictionary, there are only 115331 different strings. The distribution of homographs is as follows:

The 12th edition of Daniel Jones's Pronouncing Dictionary (Jones, 1963) was transferred to magnetic tape by Professor L. Guierre (Guierre, 1966). These are used as the basis of the phonetic transcriptions in the PHON field. The phonetic symbols used on this tape were adjusted following suggestions from Roger Mitton (see Mitton, 1986) to conform to the U.K. Alvey standard for machine readable phonetic transcription (Wells, 1986). The changes in phonetic symbols used from Coltheart (1981a) made by by Quinlan (1986) include: devoiced consonants have been folded into their voiced equivalents; Coltheart (1981a) refers to the symbol 3, which has been ditched as no occurrence could be found; I( and U( have been mapped into I and U respectively. The symbols currently used in PHON field are a '/' character to denote syllable boundaries and those presented in Table 8 with, where printable, the International Phonetic Alphabet equivalents. The DPHON field uses these symbols without the syllable distinguisher, but with the inclusion of the TQ2 symbols following the phonetic transcription. DPHON also includes the following three characters: - + R. The hyphen is used to represent the hyphen in hyphenated spellings. The 'R' character is used to represent a final R in the first part of hyphenated words which is only pronunced if the second part of a hyphenated word begins with a vowel. The '+' sign is used to indicate the division between the two parts of a compound noun written without a space (indicated by ALPHSYL = T) or hyphenation (indicated by ALPHSYL = H).

CONSONANTS			VOWELS
IPA PHONETIC SYMBOL	EXAMPLE	DATABASE PHONETIC SYMBOL	IPA PHONETIC SYMBOL	EXAMPLE	DATABASE PHONETIC SYMBOL
p	put	p	i:	bean	i
b	but	b	a:	barn	A
t	ten	t	:	born	O(oh)
d	den	d	u:	boon	u
k	can	k	v	burn	3
m	man	m	i	pit	I
n	not	n	S	pet	e
l	like	l	de	pat	&
r	run	r	^	putt	V
f	full	f	o	pot	0 (zero)
v	very	v	C	good	U
s	some	s	]	about	@
z	zeal	z	ei	bay	eI
h	hat	h	ai	buy	aI
w	went	w	i	boy	oI (oh)
g	game	g	oC	no	@U
t^	chain	tS	aC	now	aU
dz	Jane	dZ	i]	peer	I@
\	long	9	S]	pair	e@
O	thin	T	C]	poor	u@
I	then	D
^	ship	S
Q	measure	Z
j	yes	j

Amsler, R.A. (1984). Machine-Readable Dictionaries. In M.E. Williams (Ed.), Annual Review of Information Science and Technology (ARIST), 19, 161-209. American Society for Information Science (ASIS); Knowledge Industry Publications, Inc.

Brown, G.D.A. (1984). A frequency count of 190,000 words in the London-Lund Corpus of English Conversation. Behavioural Research Methods Instrumentation and Computers, 16 (6), 502-532.

Coltheart, M. (1981a). MRC Psycholinguistic Database User Manual: Version 1. [This is a now hard-to-find "in house" production. Mike Wilson has kindly provided an OCR transcript online.]

Coltheart, M. (1981b). The MRC Psycholinguistic Database. Quarterly Journal of Experimental Psychology, 33A, 497-505.

Dolby, J.L, Resnikoff, H.L. and MacMurray, F.L. (1963). A tape dictionary for linguistic experiments. In Proceedings of the American Federation of information processing societies: Fall Joint Computer Conference, Volume 24. Baltimore, MD: Spartan Books. 419-23.

Gilhooly, K.J. and Logie, R.H. (1980). Age of acquisition, imagery, concreteness, familiarity and ambiguity measures for 1944 words. Behaviour Research Methods and Instrumentation, 12, 395-427.

Guierre, L. (1966). Un codage des mots anglais en vue de l'analyse automatique de leur structure phonetique. Etudes de linguistique appliquee, 4, 48-64.

Kiss, G.R., Armstrong, C., Milroy, R. and Piper, J (1973). An associative thesaurus of English and its computer analysis. In Aitkin, A.J., Bailey, R.W., and Hamilton-Smith, N. (Eds.), The computer and Literary Studies. Edinburgh: University Press.

Kucera and Francis, W.N. (1967). Computational Analysis of Present-Day American English. Providence: Brown University Press.

Mitton, R. (1986). A description of the files CUVOALD.DAT and CUV2.DAT. The machine usable form of the Oxford Advanced Learner's Dictionary. The Oxford Text Archive: Oxford, U.K.

Pavio, A., Yuille, J.C. and Madigan, S.A. (1968). Concreteness, imagery and meaningfulness values for 925 words. Journal of Experimental Psychology Monograph Supplement, 76 (3, part 2).

Quinlan, P. (1986). Description of machine-readable dictionary files. Report. Dept. of Psychology, Birkbeck College, London.

Svartik, J. and Quirk, R. (1980). A Corpus of English Conversation. Lund: Gleerup.

Thorndike, E.L. and Lorge, I. (1944). The Teacher's Word Book of 30,000 Words. New York: Teachers College, Columbia University.

Toglia, M.P. and Battig, W.R. (1978). Handbook of Semantic Word Norms. New York: Erlbaum.

Wells, J.W. (1986). A standardised machine-readable phonetic notation. In Proceedings of the IEE conference on speech input/output: techniques and applications. London, Easter 1986.

Contents of file mrcs2.doc (distributed with MRC Psycholinguistic Database)
Edited/Hyperized March 6 1997, Craig Clark, UWA Psychology

NUMBER OF OCCURRENCES	NSYL
58081	0
12485	1
32837	2
27751	3
14159	4
4530	5
856	6
134	7
14	8
1	9

NUMBER OF ENTRIES	NUMBER OF WORDS
1	94225
2	22132
3	2967
4	703
5	96
6	20
7	5


	MRC Psycholinguistic Database

Dict Utility Interface \| Reference
NLET \| NPHON \| NSYL \| K-F-FREQ, K-F-NCATS, K-F-NSAMP \| T-L-FREQL \| BROWN-FREQ \| FAM \| CONC \| IMAG \| MEANC \| MEANP \| AOA \| TQ2 \| WTYPE \| PDWTYPE \| ALPHSYL \| STATUS \| VAR \| CAP \| IRREG \| WORD \| PHON and DPHON \| Dict \| Getentry \| References \|

NUMBER OF OCCURRENCES	NPHON
109060	0
32	1
276	2
1442	3
3396	4
4561	5
4985	6
4691	7
4199	8
3317	9
2429	10
1536	11
862	12
450	13
206	14
110	15
42	16
9	17
3	18
3	19

OCCURRENCES	WTYPE	PDWTYPE
3751	A
492	A	O
47	C
61	C	O
261	I
91	I	O
16730	J
8817	J	J
55294	N
22061	N	N
5785	O
351	O	O
5939	P
115	R
115	R	O
65	U
69	U	O
24392	V
6333	V	V

MRC Psycholinguistic Database

NLET | NPHON | NSYL | K-F-FREQ, K-F-NCATS, K-F-NSAMP | T-L-FREQL | BROWN-FREQ | FAM | CONC | IMAG | MEANC | MEANP | AOA | TQ2 | WTYPE | PDWTYPE | ALPHSYL | STATUS | VAR | CAP | IRREG | WORD | PHON and DPHON | Dict | Getentry | References |