MRC Psycholinguistic Database |
|
Dict Utility Interface | Reference | |
NLET | NPHON | NSYL | K-F-FREQ, K-F-NCATS, K-F-NSAMP | T-L-FREQL | BROWN-FREQ | FAM | CONC | IMAG | MEANC | MEANP | AOA | TQ2 | WTYPE | PDWTYPE | ALPHSYL | STATUS | VAR | CAP | IRREG | WORD | PHON and DPHON | Dict | Getentry | References | |
MRC Psycholinguistic
Database: Machine
Usable Dictionary. Version
2.00
Informatics Division Science and Engineering Research
Council Rutherford Appleton Laboratory Chilton,
Didcot, Oxon, OX11 0QX Michael Wilson 1
April 1987
MRC Machine Usable Dictionary. Version
2.00
The MRC Psycholinguistic Database version 1, was provided as an
on-line service (see Coltheart, 1981b). The service drew on three
files and several access programs. The first file was a
dictionary of words, the second and third files were sets of word
association norms from the Edinburgh Thesaurus. The service has
now been discontinued.
This second version of the MRC Psycholinguistic Database is being
provided as a computer usable resource rather than as a service.
An updated version of the dictionary file from the database
(referred to here as MRC2.DCT) is being provided for public
research use along with some programs which can be used either to
access the dictionary, or as examples on which to model programs
which match users' specific needs. This database dictionary
differs from other machine usable dictionaries in that it
includes not only syntactic information but also psychological
data for the entries (see Amsler, 1984 for a review of other
machine-readable dictionaries). It also differs from most
conventional dictionaries in that it does not currently attempt
to provide any semantic information. It is designed to be of use
to psycholinguists in selecting stimulus materials for testing;
for use by researchers in Artificial Intelligence as a source of
information required for natural language processing and
cognitive simulation; and for use by computer scientists who wish
to use the word lists and syntactic information in the design of
text processors.
The MRC Psycholinguistic Database: Machine Usable Dictionary and
utility programs are available for research purposes from the
Oxford Text Archive as item 1054 on their list at a nominal fee
to cover handling costs. Their address is:
Oxford Text Archive Oxford University Computing Service 13 Banbury Road, Oxford OX2 6NN U.K. Tel: Oxford (0865) 56721 JANET electronic mail address: [email protected]
The Machine Usable Dictionary File.
The file contains 150837 words and provides information about 26
different linguistic properties, although it is not the case that
information about every property is available for every one of
the 150837 words: nobody, for example, has yet collected imagery
ratings on such a large set of words, and thus only 9240 of the
words possess an imagery rating.
The dictionary file does not contain any information which is
original to it, but was assembled by merging a number of smaller
databases of limited availability:
- the tape dictionary of Dolby, Resnikoff and MacMurray (1963) which was created by taking all the left justified bold faced words from the Shorter Oxford English Dictionary together with the parts of speech given by that dictionary. In addition, words were taken from the Cornell University tape of 20,000 commonly used words, and the parts of speech for all these words found in the third edition of Webster's New International Dictionary.
- the Edinburgh Associative Thesaurus (Kiss, Armstrong, Milroy and Piper, 1973);
- the Colerado Norms (Toglia and Battig, 1978);
- the Pavio Norms (unpublished, these are an expansion of the norms of Pavio, Yuille and Madigan, 1968);
- the Gilhooly-Logie norms (Gilhooly and Logie, 1980);
- the Kucera-Francis written frequency count (Kucera and Francis, 1967);
- the Thorndike-Lorge written frequency count (Thorndike and Lorge, 1944; L count);
- the phonetic transcriptions from Daniel Jones' Pronouncing Dictionary of English Language, 12th Edition (see Guierre, 1966).
- 2500 proper names from the Machine Usable Version of the Oxford Advanced Learner's Dictionary (Mitton, 1986) which were added to the published version of the dictionary and are not covered by the copyright held by the Oxford University Press.
- The frequency count for spoken English from the London-Lund Corpus of English Conversation (Svartvik and Quirk, 1980; Brown, 1984).
The dictionary file currently occupies 11 Mbyte
as a sequential plain text file. Each line of the file represents
the field for one word. The longest entry is 130 characters; e.g.
040320021615167000000093057530228435500000 JJ
SABLE|eI/bl|eIbl|20
The composition of the dictionary file is summarised in Table 1,
which specifies the linguistic properties described in an entry.
The first column of Table 1 indicates the columns/field in the
file containing the data. The last four properties are held in
variable length fields separated by a | character. The second
column indicates the name of the data field used elsewhere in
programs and documentation. The third column specifies the
identity of the linguistic property, and the fourth column
indicates the number of words in the database for which
information about a particular linguistic property is available.
The first fourteen properties are stored in the file as numerical
values. For these properties, the occurrence count refers to the
number of non zero entries.
Table 1. The Dictionary File.
COLUMN | NAME | PROPERTY | OCCURRENCES |
1-2 | NLET | Number of letters in the word | 150837 |
3-4 | NPHON | Number of phonemes in the word | 38438 |
5 | NSYL | Number of syllables in the word | 89402 |
6-10 | K-F-FREQ | Kucera and Francis written frequency | 29778 |
11-12 | K-F-NCATS | Kucera and Francis number of categories | 29778 |
13-15 | K-F-NSAMP | Kucera and Francis number of samples | 29778 |
16-21 | T-L-FREQ | Thorndike-Lorge frequency | 25308 |
22-25 | BROWN-FREQ | Brown verbal frequency | 14529 |
26-28 | FAM | Familiarity | 9392 |
29-31 | CONC | Concreteness | 8228 |
32-34 | IMAG | Imagery | 9240 |
35-37 | MEANC | Mean Colerado Meaningfulness | 5450 |
38-40 | MEANP | Mean Pavio Meaningfulness | 1504 |
41-43 | AOA | Age of Acquisition | 3503 |
44 | TQ2 | Type | 44976 |
45 | WTYPE | Part of Speech | 150769 |
46 | PDWTYPE PD | Part of Speech | 38390 |
47 | ALPHSYL | Alphasyllable | 15938 |
48 | STATUS | Status | 89550 |
49 | VAR | Varient Phoneme | 1445 |
50 | CAP | Written Capitalised | 4585 |
51 | IRREG | Irregular Plural | 23111 |
| | WORD | the actual word | 150837 |
| | PHON | Phonetic Transcription | 38420 |
| | DPHON | Edited Phonetic Transcription | 136982 |
| | STRESS | Stress Pattern | 38390 |
Some of the properties listed in Table 1 are obvious; others require explanation as follows:
The distribution of entries in the WORD field by the number of letters that they contain is shown in Table 2.
Table 2. The Distribution of Word Lengths Given by NLET.
NUMBER OF OCCURRENCES | NLET |
31 | 1 |
168 | 2 |
1342 | 3 |
4719 | 4 |
10199 | 5 |
16818 | 6 |
21118 | 7 |
22302 | 8 |
20426 | 9 |
16409 | 10 |
11697 | 11 |
7566 | 12 |
4451 | 13 |
2342 | 14 |
1158 | 15 |
479 | 16 |
250 | 17 |
81 | 18 |
32 | 19 |
14 | 20 |
4 | 21 |
1 | 22 |
2 | 23 |
The distribution of entries in the WORD field by the number of phonemes that they contain is shown in Table 3.
The distribution of entries in the WORD field by the number of syllables that they contain is shown in Table 4.
K-F-FREQ, K-F-NCATS, K-F-NSAMP
The first of these refers to a
word's frequency of occurrence as given in the norms of Kucera
and Francis (1967). The maximum frequency in the file is 69971,
the minimum is 0. The meaning of K-F-NCATS and K-F-NSAMP are
defined by Kucera and Francis (1967).
Table 3. The Distribution
of Phoneme Counts Given by NPHON.
NUMBER OF OCCURRENCES | NPHON |
109060 | 0 |
32 | 1 |
276 | 2 |
1442 | 3 |
3396 | 4 |
4561 | 5 |
4985 | 6 |
4691 | 7 |
4199 | 8 |
3317 | 9 |
2429 | 10 |
1536 | 11 |
862 | 12 |
450 | 13 |
206 | 14 |
110 | 15 |
42 | 16 |
9 | 17 |
3 | 18 |
3 | 19 |
Table 4. The Distribution of Syllable
Counts Given by NSYL.
NUMBER OF OCCURRENCES | NSYL |
58081 | 0 |
12485 | 1 |
32837 | 2 |
27751 | 3 |
14159 | 4 |
4530 | 5 |
856 | 6 |
134 | 7 |
14 | 8 |
1 | 9 |
This is the frequency of occurrence as given in the L count of Thorndike and Lorge (1942). If you plan to use this frequency count, you are advised to read details about it in the Thorndike-Lorge book. For example, the frequency value of a singular word which has a regular plural includes the frequency of the plural form, and this is true for other kinds of derivations too.
This stands for the frequency of occurence in verbal language derived from the London-Lund Corpus of English Conversation by Brown (1984). There are 14529 entries for 8985 different strings in the WORD field. The range of entries is 0 - 6833 with a mean of 35 and a standard deviation of 252.
This stands for 'printed familiarity'. The FAM values were derived from merging three sets of familiarity norms: Pavio (unpublished), Toglia and Battig (1978) and Gilhooly and Logie (1980). The method by which these three sets of norms were merged is described in detail in Appendix 2 of the MRC Psycholinguistic Database User Manual (Coltheart, 1981a). This method may not meet with everyone's approval. FAM values lie in the range 100 to 700 with the maximum entry of 657, a mean of 488 and a standard deviation of 99: note that they are integer values (in the original norms the equivalent range was 1.00 to 7.00).
This is concreteness, and it too is derived from a merging of the Pavio, Colerado, and Gilhooly-Logie norms: details of merging are given in Appendix 2 of the MRC Psycholinguistic Database User Manual (Coltheart, 1981a). CONC values are integer, in the range 100 to 700 (min: 158; max 670; mean 438; s.d. 120).
This is imageability, derived from merging the three sets of norms referred to above, and having values in the range 100 to 700 (min 129; max 669; mean 450; s.d. 108).
These are the meaningfulness ratings from the Toglia and Battig (1978), multiplied by 100 to produce a range from 100 to 700 (min 127; max 667; mean 415; s.d. 78).
This is the meaningfulness from the norms of Pavio (unpublished) multiplied by 100 to produce a range from 100 to 700. The two sets of meaningfulness ratings were not merged because their correlations were low ( only + .529) and the mean values for a set of words common to the two sets of norms were very low (see Toglia and Battig, 1978, Table 2).
These differences are due to differences in the instructions to subjects. Thus the two sets of meaningfulness ratings are not comparable, and so were kept seperate (min 192; max 922; mean 600; s.d. 107).
This is age of acquisition from the norms of Gilhooly and Logie (1980), multiplied by 100 to produce a range from 100 to 700 (min 125; max 697; mean 405; s.d. 120).
When TQ2 has the value Q (40810 occurrences), this word is a derivational variant of another.
This is syntactic category as represented in the SOED database assembled by Dolby, Resnikoff and MacMurray (1963). There are ten different syntactic categories, coded as shown in Table 5.
Table 5. Syntactic Category Codes for WTYPE
SYNTACTIC CATEGORY | CODE | OCCURRENCES |
Noun | N | 77355 |
Adjective | J | 25547 |
Verb | V | 30725 |
Adverb | A | 4243 |
Preposition | R | 230 |
Conjunction | C | 108 |
Pronoun | U | 134 |
Interjection | I | 352 |
Past Participle | P | 5939 |
Other | O | 6136 |
When you are interested in syntactic category, WTYPE can sometimes be unsatisfactory. For example, the words FREEZE and HARASS are Nouns according to WTYPE (as well as verbs); and indeed when these are looked up in SOED or Webster's, they are described as nouns. If you want to avoid such esoteric usages, PDWTYPE may be useful. It refers to the syntactic categories given in Jones' Pronouncing Dictionary (Jones, 1963), and very unusual uses of words are not considered. However PDWTYPE uses only four categories, not ten: these four are noun (N, 22061 occurrences), verb (V, 6333 occurrences), adjective (J, 8817 occurrences) and other (O, 1179 occurrences). The mapping from WTYPE to PDWTYPE is shown in Table 6.
Table 6. The Mapping from WTYPE to PDWTYPE
OCCURRENCES | WTYPE | PDWTYPE |
3751 | A | |
492 | A | O |
47 | C | |
61 | C | O |
261 | I | |
91 | I | O |
16730 | J | |
8817 | J | J |
55294 | N | |
22061 | N | N |
5785 | O | |
351 | O | O |
5939 | P | |
115 | R | |
115 | R | O |
65 | U | |
69 | U | O |
24392 | V | |
6333 | V | V |
If this = A, then the word is an abbreviation (130 occurrences); if S, the word is a suffix (282 occurrences); if P, a prefix (1374 occurrences); if H, the word is hyphenated (13716 occurrences); if T, a multi-word phrasal unit (436 occurrences). For all of these categories, NSYL = 0. For all other words ALPHYSL is blank.
The 15 possible categories of STATUS are listed in Table 7; these are as given in the Dolby database (Dolby et al., 1963) derived from the Shorter Oxford English Dictionary, and perusal of Table 7 should make the meanings of these categories sufficiently clear.
Table 7. The Possible Values of STATUS
STATUS OF WORD | CODE | OCCURRENCES |
Dialect | D | 2780 |
Alien | F | 6003 |
Archaic | A | 959 |
Colloquial | Q | 405 |
Capital | C | 2 |
Erroneous | N | 0 |
Nonsense | E | 62 |
Nonce Word | W | 33 |
Obsolete | O | 10549 |
Poetical | P | 183 |
Rare | R | 2756 |
Rhetorical | H | 22 |
Specialised | $ | 7731 |
Standard | S | 58065 |
Substandard | Z | 0 |
This refers to words which have the same spelling but different pronunciation and syntactic classes. When the pronunciations differ only in respect of stress (e.g. object, insult) VAR = O (212 occurrences).When the pronunciations differ phonemically (e.g. moderate, abuse), VAR = B (1233 occurrences).
If this = C, then the word is normally written with an initial capital letter. This can be used as an indicator of proper nouns such as the names of people, towns, states and countries.
This refers to the plurality of words. Where IRREG = Z, the word is plural (17441 occurrences), this can be used in conjunction with TQ2 to select irregular forms; where IRREG = Y, the word is a singular form (1024 occurrences); where IRREG = B, the word is both the singular and the plural form (151 occurrences); where IRREG = N, the word has no plural form (4407 occurrences); where IRREG = P, the word is plural but acts singular (88 occurrences)
The dictionary is ordered by the ascii sequence of these strings. Although there are 150837 entries in the dictionary, there are only 115331 different strings. The distribution of homographs is as follows:
NUMBER OF ENTRIES | NUMBER OF WORDS |
1 | 94225 |
2 | 22132 |
3 | 2967 |
4 | 703 |
5 | 96 |
6 | 20 |
7 | 5 |
The 12th edition of Daniel Jones's Pronouncing Dictionary (Jones, 1963) was transferred to magnetic tape by Professor L. Guierre (Guierre, 1966). These are used as the basis of the phonetic transcriptions in the PHON field. The phonetic symbols used on this tape were adjusted following suggestions from Roger Mitton (see Mitton, 1986) to conform to the U.K. Alvey standard for machine readable phonetic transcription (Wells, 1986). The changes in phonetic symbols used from Coltheart (1981a) made by by Quinlan (1986) include: devoiced consonants have been folded into their voiced equivalents; Coltheart (1981a) refers to the symbol 3, which has been ditched as no occurrence could be found; I( and U( have been mapped into I and U respectively. The symbols currently used in PHON field are a '/' character to denote syllable boundaries and those presented in Table 8 with, where printable, the International Phonetic Alphabet equivalents. The DPHON field uses these symbols without the syllable distinguisher, but with the inclusion of the TQ2 symbols following the phonetic transcription. DPHON also includes the following three characters: - + R. The hyphen is used to represent the hyphen in hyphenated spellings. The 'R' character is used to represent a final R in the first part of hyphenated words which is only pronunced if the second part of a hyphenated word begins with a vowel. The '+' sign is used to indicate the division between the two parts of a compound noun written without a space (indicated by ALPHSYL = T) or hyphenation (indicated by ALPHSYL = H).
Table 8. Phonetic Symbols used in the Dictionary
CONSONANTS | VOWELS | ||||
IPA PHONETIC SYMBOL | EXAMPLE | DATABASE PHONETIC SYMBOL | IPA PHONETIC SYMBOL | EXAMPLE | DATABASE PHONETIC SYMBOL |
p | put | p | i: | bean | i |
b | but | b | a: | barn | A |
t | ten | t | : | born | O(oh) |
d | den | d | u: | boon | u |
k | can | k | v | burn | 3 |
m | man | m | i | pit | I |
n | not | n | S | pet | e |
l | like | l | de | pat | & |
r | run | r | ^ | putt | V |
f | full | f | o | pot | 0 (zero) |
v | very | v | C | good | U |
s | some | s | ] | about | @ |
z | zeal | z | ei | bay | eI |
h | hat | h | ai | buy | aI |
w | went | w | i | boy | oI (oh) |
g | game | g | oC | no | @U |
t^ | chain | tS | aC | now | aU |
dz | Jane | dZ | i] | peer | I@ |
\ | long | 9 | S] | pair | e@ |
O | thin | T | C] | poor | u@ |
I | then | D | |||
^ | ship | S | |||
Q | measure | Z | |||
j | yes | j |
Amsler, R.A. (1984). Machine-Readable
Dictionaries. In M.E. Williams (Ed.), Annual Review of
Information Science and Technology (ARIST), 19, 161-209. American
Society for Information Science (ASIS); Knowledge Industry
Publications, Inc.
Brown, G.D.A. (1984). A frequency count
of 190,000 words in the London-Lund Corpus of English
Conversation. Behavioural Research Methods Instrumentation
and Computers, 16 (6), 502-532.
Coltheart, M. (1981a). MRC
Psycholinguistic Database User Manual: Version 1. [This is a now hard-to-find
"in house" production. Mike Wilson has kindly provided
an OCR transcript online.]
Coltheart, M. (1981b). The MRC
Psycholinguistic Database. Quarterly Journal of Experimental
Psychology, 33A, 497-505.
Dolby, J.L, Resnikoff, H.L. and MacMurray,
F.L. (1963). A tape dictionary for linguistic experiments.
In Proceedings of the American Federation of information
processing societies: Fall Joint Computer Conference, Volume 24.
Baltimore, MD: Spartan Books. 419-23.
Gilhooly, K.J. and Logie, R.H. (1980). Age
of acquisition, imagery, concreteness, familiarity and ambiguity
measures for 1944 words. Behaviour Research Methods and
Instrumentation, 12, 395-427.
Guierre, L. (1966). Un codage des mots
anglais en vue de l'analyse automatique de leur structure
phonetique. Etudes de linguistique appliquee, 4, 48-64.
Kiss, G.R., Armstrong, C., Milroy, R. and
Piper, J (1973). An associative thesaurus of English and its
computer analysis. In Aitkin, A.J., Bailey, R.W., and
Hamilton-Smith, N. (Eds.), The computer and Literary Studies.
Edinburgh: University Press.
Kucera and Francis, W.N. (1967). Computational
Analysis of Present-Day American English. Providence: Brown
University Press.
Mitton, R. (1986). A description of the
files CUVOALD.DAT and CUV2.DAT. The machine usable form of the
Oxford Advanced Learner's Dictionary. The Oxford Text
Archive: Oxford, U.K.
Pavio, A., Yuille, J.C. and Madigan, S.A.
(1968). Concreteness, imagery and meaningfulness values for
925 words. Journal of Experimental Psychology Monograph
Supplement, 76 (3, part 2).
Quinlan, P. (1986). Description of
machine-readable dictionary files. Report. Dept. of
Psychology, Birkbeck College, London.
Svartik, J. and Quirk, R. (1980). A
Corpus of English Conversation. Lund: Gleerup.
Thorndike, E.L. and Lorge, I. (1944). The
Teacher's Word Book of 30,000 Words. New York: Teachers
College, Columbia University.
Toglia, M.P. and Battig, W.R. (1978). Handbook
of Semantic Word Norms. New York: Erlbaum.
Wells, J.W. (1986). A standardised
machine-readable phonetic notation. In Proceedings of the IEE
conference on speech input/output: techniques and applications.
London, Easter 1986.
Contents of file mrcs2.doc (distributed with MRC
Psycholinguistic Database)
Edited/Hyperized March 6 1997, Craig Clark, UWA Psychology