UWA Psychology Logo 

MRC Psycholinguistic Database

 
Dict Utility Interface | Reference 
NLETNPHONNSYLK-F-FREQ, K-F-NCATS, K-F-NSAMPT-L-FREQLBROWN-FREQFAMCONCIMAGMEANCMEANPAOATQ2WTYPEPDWTYPEALPHSYLSTATUSVARCAPIRREGWORDPHON and DPHONDictGetentryReferences

  
MRC Psycholinguistic Database: Machine Usable Dictionary. Version 2.00
  
Informatics Division Science and Engineering Research Council Rutherford Appleton Laboratory Chilton, Didcot, Oxon, OX11 0QX Michael Wilson 1 April 1987  


MRC Machine Usable Dictionary. Version 2.00  

The MRC Psycholinguistic Database version 1, was provided as an on-line service (see Coltheart, 1981b). The service drew on three files and several access programs. The first file was a dictionary of words, the second and third files were sets of word association norms from the Edinburgh Thesaurus. The service has now been discontinued. 

This second version of the MRC Psycholinguistic Database is being provided as a computer usable resource rather than as a service. An updated version of the dictionary file from the database (referred to here as MRC2.DCT) is being provided for public research use along with some programs which can be used either to access the dictionary, or as examples on which to model programs which match users' specific needs. This database dictionary differs from other machine usable dictionaries in that it includes not only syntactic information but also psychological data for the entries (see Amsler, 1984 for a review of other machine-readable dictionaries). It also differs from most conventional dictionaries in that it does not currently attempt to provide any semantic information. It is designed to be of use to psycholinguists in selecting stimulus materials for testing; for use by researchers in Artificial Intelligence as a source of information required for natural language processing and cognitive simulation; and for use by computer scientists who wish to use the word lists and syntactic information in the design of text processors. 

The MRC Psycholinguistic Database: Machine Usable Dictionary and utility programs are available for research purposes from the Oxford Text Archive as item 1054 on their list at a nominal fee to cover handling costs. Their address is: 

Oxford Text Archive
Oxford University Computing Service
13 Banbury Road,
Oxford OX2 6NN
U.K.

Tel: Oxford (0865) 56721
JANET electronic mail address: [email protected]

The Machine Usable Dictionary File. 

The file contains 150837 words and provides information about 26 different linguistic properties, although it is not the case that information about every property is available for every one of the 150837 words: nobody, for example, has yet collected imagery ratings on such a large set of words, and thus only 9240 of the words possess an imagery rating. 

The dictionary file does not contain any information which is original to it, but was assembled by merging a number of smaller databases of limited availability: 

The dictionary file currently occupies 11 Mbyte as a sequential plain text file. Each line of the file represents the field for one word. The longest entry is 130 characters; e.g.

040320021615167000000093057530228435500000 JJ SABLE|eI/bl|eIbl|20 

The composition of the dictionary file is summarised in Table 1, which specifies the linguistic properties described in an entry. The first column of Table 1 indicates the columns/field in the file containing the data. The last four properties are held in variable length fields separated by a | character. The second column indicates the name of the data field used elsewhere in programs and documentation. The third column specifies the identity of the linguistic property, and the fourth column indicates the number of words in the database for which information about a particular linguistic property is available. The first fourteen properties are stored in the file as numerical values. For these properties, the occurrence count refers to the number of non zero entries. 
 

Table 1. The Dictionary File.  

COLUMN NAME PROPERTY OCCURRENCES
1-2  NLET  Number of letters in the word  150837 
3-4  NPHON  Number of phonemes in the word  38438 
5  NSYL  Number of syllables in the word  89402 
6-10  K-F-FREQ  Kucera and Francis written frequency  29778 
11-12  K-F-NCATS  Kucera and Francis number of categories  29778 
13-15  K-F-NSAMP  Kucera and Francis number of samples  29778 
16-21  T-L-FREQ  Thorndike-Lorge frequency  25308 
22-25  BROWN-FREQ  Brown verbal frequency  14529 
26-28  FAM  Familiarity  9392 
29-31  CONC  Concreteness  8228 
32-34  IMAG  Imagery  9240 
35-37  MEANC  Mean Colerado Meaningfulness  5450 
38-40  MEANP  Mean Pavio Meaningfulness  1504 
41-43  AOA  Age of Acquisition  3503 
44  TQ2  Type  44976 
45  WTYPE  Part of Speech  150769 
46  PDWTYPE PD  Part of Speech  38390 
47  ALPHSYL  Alphasyllable  15938 
48  STATUS  Status  89550 
49  VAR  Varient Phoneme  1445 
50  CAP  Written Capitalised  4585 
51  IRREG  Irregular Plural  23111 
|  WORD  the actual word  150837 
|  PHON  Phonetic Transcription  38420 
|  DPHON  Edited Phonetic Transcription  136982 
|  STRESS  Stress Pattern  38390 

 

Some of the properties listed in Table 1 are obvious; others require explanation as follows: 

 

NLET

The distribution of entries in the WORD field by the number of letters that they contain is shown in Table 2. 

Table 2. The Distribution of Word Lengths Given by NLET.  

NUMBER OF OCCURRENCES  NLET 
31  1 
168  2 
1342  3 
4719  4 
10199  5 
16818  6 
21118  7 
22302  8 
20426  9 
16409  10 
11697  11 
7566  12 
4451  13 
2342  14 
1158  15 
479  16 
250  17 
81  18 
32  19 
14  20 
4  21 
1  22 
2  23 

 

NPHON 

The distribution of entries in the WORD field by the number of phonemes that they contain is shown in Table 3. 

 

NSYL

The distribution of entries in the WORD field by the number of syllables that they contain is shown in Table 4. 

 

K-F-FREQ, K-F-NCATS, K-F-NSAMP

The first of these refers to a word's frequency of occurrence as given in the norms of Kucera and Francis (1967). The maximum frequency in the file is 69971, the minimum is 0. The meaning of K-F-NCATS and K-F-NSAMP are defined by Kucera and Francis (1967). 

Table 3. The Distribution of Phoneme Counts Given by NPHON.
 

 NUMBER OF OCCURRENCES  NPHON 
109060  0 
32  1 
276  2 
1442  3 
3396  4 
4561  5 
4985  6 
4691  7 
4199  8 
3317  9 
2429  10 
1536  11 
862  12 
450  13 
206  14 
110  15 
42  16 
9  17 
3  18 
3  19 

 

Table 4. The Distribution of Syllable Counts Given by NSYL.   

NUMBER OF OCCURRENCES  NSYL 
58081  0 
12485  1 
32837  2 
27751  3 
14159  4 
4530  5 
856  6 
134  7 
14  8 
1  9 

 

T-L-FREQL

This is the frequency of occurrence as given in the L count of Thorndike and Lorge (1942). If you plan to use this frequency count, you are advised to read details about it in the Thorndike-Lorge book. For example, the frequency value of a singular word which has a regular plural includes the frequency of the plural form, and this is true for other kinds of derivations too.

 

BROWN-FREQ

This stands for the frequency of occurence in verbal language derived from the London-Lund Corpus of English Conversation by Brown (1984). There are 14529 entries for 8985 different strings in the WORD field. The range of entries is 0 - 6833 with a mean of 35 and a standard deviation of 252.

 

FAM 

This stands for 'printed familiarity'. The FAM values were derived from merging three sets of familiarity norms: Pavio (unpublished), Toglia and Battig (1978) and Gilhooly and Logie (1980). The method by which these three sets of norms were merged is described in detail in Appendix 2 of the MRC Psycholinguistic Database User Manual (Coltheart, 1981a). This method may not meet with everyone's approval. FAM values lie in the range 100 to 700 with the maximum entry of 657, a mean of 488 and a standard deviation of 99: note that they are integer values (in the original norms the equivalent range was 1.00 to 7.00).

 

CONC 

This is concreteness, and it too is derived from a merging of the Pavio, Colerado, and Gilhooly-Logie norms: details of merging are given in Appendix 2 of the MRC Psycholinguistic Database User Manual (Coltheart, 1981a). CONC values are integer, in the range 100 to 700 (min: 158; max 670; mean 438; s.d. 120).

 

IMAG 

This is imageability, derived from merging the three sets of norms referred to above, and having values in the range 100 to 700 (min 129; max 669; mean 450; s.d. 108).

 

MEANC 

These are the meaningfulness ratings from the Toglia and Battig (1978), multiplied by 100 to produce a range from 100 to 700 (min 127; max 667; mean 415; s.d. 78).

 

MEANP 

This is the meaningfulness from the norms of Pavio (unpublished) multiplied by 100 to produce a range from 100 to 700. The two sets of meaningfulness ratings were not merged because their correlations were low ( only + .529) and the mean values for a set of words common to the two sets of norms were very low (see Toglia and Battig, 1978, Table 2). 

These differences are due to differences in the instructions to subjects. Thus the two sets of meaningfulness ratings are not comparable, and so were kept seperate (min 192; max 922; mean 600; s.d. 107).

 

AOA 

This is age of acquisition from the norms of Gilhooly and Logie (1980), multiplied by 100 to produce a range from 100 to 700 (min 125; max 697; mean 405; s.d. 120).

 

TQ2 

When TQ2 has the value Q (40810 occurrences), this word is a derivational variant of another.

 

WTYPE 

This is syntactic category as represented in the SOED database assembled by Dolby, Resnikoff and MacMurray (1963). There are ten different syntactic categories, coded as shown in Table 5. 

Table 5. Syntactic Category Codes for WTYPE   

SYNTACTIC CATEGORY  CODE  OCCURRENCES 
Noun  N  77355 
Adjective  J  25547 
Verb  V  30725 
Adverb  A  4243 
Preposition  R  230 
Conjunction  C  108 
Pronoun  U  134 
Interjection  I  352 
Past Participle  P  5939 
Other  O  6136 

 

 

PDWTYPE 

When you are interested in syntactic category, WTYPE can sometimes be unsatisfactory. For example, the words FREEZE and HARASS are Nouns according to WTYPE (as well as verbs); and indeed when these are looked up in SOED or Webster's, they are described as nouns. If you want to avoid such esoteric usages, PDWTYPE may be useful. It refers to the syntactic categories given in Jones' Pronouncing Dictionary (Jones, 1963), and very unusual uses of words are not considered. However PDWTYPE uses only four categories, not ten: these four are noun (N, 22061 occurrences), verb (V, 6333 occurrences), adjective (J, 8817 occurrences) and other (O, 1179 occurrences). The mapping from WTYPE to PDWTYPE is shown in Table 6. 

Table 6. The Mapping from WTYPE to PDWTYPE  

OCCURRENCES  WTYPE  PDWTYPE 
3751  A   
492  A  O 
47  C   
61  C  O 
261  I   
91  I  O 
16730  J   
8817  J  J 
55294  N   
22061  N  N 
5785  O   
351  O  O 
5939  P   
115  R   
115  R  O 
65  U   
69  U  O 
24392  V   
6333  V  V 

 

ALPHSYL

If this = A, then the word is an abbreviation (130 occurrences); if S, the word is a suffix (282 occurrences); if P, a prefix (1374 occurrences); if H, the word is hyphenated (13716 occurrences); if T, a multi-word phrasal unit (436 occurrences). For all of these categories, NSYL = 0. For all other words ALPHYSL is blank.

 

STATUS 

The 15 possible categories of STATUS are listed in Table 7; these are as given in the Dolby database (Dolby et al., 1963) derived from the Shorter Oxford English Dictionary, and perusal of Table 7 should make the meanings of these categories sufficiently clear. 

Table 7. The Possible Values of STATUS   

STATUS OF WORD  CODE  OCCURRENCES 
Dialect  D  2780 
Alien  F  6003 
Archaic  A  959 
Colloquial  Q  405 
Capital  C  2 
Erroneous  N  0 
Nonsense  E  62 
Nonce Word  W  33 
Obsolete  O  10549 
Poetical  P  183 
Rare  R  2756 
Rhetorical  H  22 
Specialised  $  7731 
Standard  S  58065 
Substandard  Z  0 

 

VAR

This refers to words which have the same spelling but different pronunciation and syntactic classes. When the pronunciations differ only in respect of stress (e.g. object, insult) VAR = O (212 occurrences).When the pronunciations differ phonemically (e.g. moderate, abuse), VAR = B (1233 occurrences).

 

CAP 

If this = C, then the word is normally written with an initial capital letter. This can be used as an indicator of proper nouns such as the names of people, towns, states and countries.

 

IRREG 

This refers to the plurality of words. Where IRREG = Z, the word is plural (17441 occurrences), this can be used in conjunction with TQ2 to select irregular forms; where IRREG = Y, the word is a singular form (1024 occurrences); where IRREG = B, the word is both the singular and the plural form (151 occurrences); where IRREG = N, the word has no plural form (4407 occurrences); where IRREG = P, the word is plural but acts singular (88 occurrences)

 

WORD 

The dictionary is ordered by the ascii sequence of these strings. Although there are 150837 entries in the dictionary, there are only 115331 different strings. The distribution of homographs is as follows:    

NUMBER OF ENTRIES  NUMBER OF WORDS 
1  94225 
2  22132 
3  2967 
4  703 
5  96 
6  20 
7  5 

 

PHON and DPHON

The 12th edition of Daniel Jones's Pronouncing Dictionary (Jones, 1963) was transferred to magnetic tape by Professor L. Guierre (Guierre, 1966). These are used as the basis of the phonetic transcriptions in the PHON field. The phonetic symbols used on this tape were adjusted following suggestions from Roger Mitton (see Mitton, 1986) to conform to the U.K. Alvey standard for machine readable phonetic transcription (Wells, 1986). The changes in phonetic symbols used from Coltheart (1981a) made by by Quinlan (1986) include: devoiced consonants have been folded into their voiced equivalents; Coltheart (1981a) refers to the symbol 3, which has been ditched as no occurrence could be found; I( and U( have been mapped into I and U respectively. The symbols currently used in PHON field are a '/' character to denote syllable boundaries and those presented in Table 8 with, where printable, the International Phonetic Alphabet equivalents. The DPHON field uses these symbols without the syllable distinguisher, but with the inclusion of the TQ2 symbols following the phonetic transcription. DPHON also includes the following three characters: - + R. The hyphen is used to represent the hyphen in hyphenated spellings. The 'R' character is used to represent a final R in the first part of hyphenated words which is only pronunced if the second part of a hyphenated word begins with a vowel. The '+' sign is used to indicate the division between the two parts of a compound noun written without a space (indicated by ALPHSYL = T) or hyphenation (indicated by ALPHSYL = H). 

Table 8. Phonetic Symbols used in the Dictionary     

CONSONANTS VOWELS
IPA PHONETIC SYMBOL  EXAMPLE  DATABASE PHONETIC SYMBOL  IPA PHONETIC SYMBOL  EXAMPLE  DATABASE PHONETIC SYMBOL 
p  put  p  i:  bean  i 
b  but  b  a:  barn  A 
t  ten  t  :  born  O(oh) 
d  den  d  u:  boon  u 
k  can  k  v  burn  3 
m  man  m  i  pit  I 
n  not  n  S  pet  e 
l  like  l  de  pat  & 
r  run  r  ^  putt  V 
f  full  f  o  pot  0 (zero) 
v  very  v  C  good  U 
s  some  s  ]  about  @ 
z  zeal  z  ei  bay  eI 
h  hat  h  ai  buy  aI 
w  went  w  i  boy  oI (oh) 
g  game  g  oC  no  @U 
t^  chain  tS  aC  now  aU 
dz  Jane  dZ  i]  peer  I@ 
\  long  9  S]  pair  e@ 
O  thin  T  C]  poor  u@ 
I  then  D       
^  ship  S       
Q  measure  Z       
j  yes  j       

 

References

Amsler, R.A. (1984). Machine-Readable Dictionaries. In M.E. Williams (Ed.), Annual Review of Information Science and Technology (ARIST), 19, 161-209. American Society for Information Science (ASIS); Knowledge Industry Publications, Inc. 

Brown, G.D.A. (1984). A frequency count of 190,000 words in the London-Lund Corpus of English Conversation. Behavioural Research Methods Instrumentation and Computers, 16 (6), 502-532. 

Coltheart, M. (1981a). MRC Psycholinguistic Database User Manual: Version 1. [This is a now hard-to-find "in house" production. Mike Wilson has kindly provided an OCR transcript online.] 

Coltheart, M. (1981b). The MRC Psycholinguistic Database. Quarterly Journal of Experimental Psychology, 33A, 497-505. 

Dolby, J.L, Resnikoff, H.L. and MacMurray, F.L. (1963). A tape dictionary for linguistic experiments. In Proceedings of the American Federation of information processing societies: Fall Joint Computer Conference, Volume 24. Baltimore, MD: Spartan Books. 419-23.  

Gilhooly, K.J. and Logie, R.H. (1980). Age of acquisition, imagery, concreteness, familiarity and ambiguity measures for 1944 words. Behaviour Research Methods and Instrumentation, 12, 395-427. 

Guierre, L. (1966). Un codage des mots anglais en vue de l'analyse automatique de leur structure phonetique. Etudes de linguistique appliquee, 4, 48-64. 

Kiss, G.R., Armstrong, C., Milroy, R. and Piper, J (1973). An associative thesaurus of English and its computer analysis. In Aitkin, A.J., Bailey, R.W., and Hamilton-Smith, N. (Eds.), The computer and Literary Studies. Edinburgh: University Press. 

Kucera and Francis, W.N. (1967). Computational Analysis of Present-Day American English. Providence: Brown University Press. 

Mitton, R. (1986). A description of the files CUVOALD.DAT and CUV2.DAT. The machine usable form of the Oxford Advanced Learner's Dictionary. The Oxford Text Archive: Oxford, U.K. 

Pavio, A., Yuille, J.C. and Madigan, S.A. (1968). Concreteness, imagery and meaningfulness values for 925 words. Journal of Experimental Psychology Monograph Supplement, 76 (3, part 2). 

Quinlan, P. (1986). Description of machine-readable dictionary files. Report. Dept. of Psychology, Birkbeck College, London. 

Svartik, J. and Quirk, R. (1980). A Corpus of English Conversation. Lund: Gleerup. 

Thorndike, E.L. and Lorge, I. (1944). The Teacher's Word Book of 30,000 Words. New York: Teachers College, Columbia University. 

Toglia, M.P. and Battig, W.R. (1978). Handbook of Semantic Word Norms. New York: Erlbaum. 

Wells, J.W. (1986). A standardised machine-readable phonetic notation. In Proceedings of the IEE conference on speech input/output: techniques and applications. London, Easter 1986.  


Contents of file mrcs2.doc (distributed with MRC Psycholinguistic Database)
Edited/Hyperized March 6 1997, Craig Clark, UWA Psychology

Web Manager / mrc2.html / [email protected]