A Broad-Bandwidth, Public-Domain, Personality Inventory Measuring the Lower-Level Facets of Several Five-Factor Models

Lewis R. Goldberg

University of Oregon and Oregon Research Institute
(1999) In I. Mervielde, I. Deary, F. De Fruyt, & F. Ostendorf (Eds.), Personality Psychology in Europe, Vol. 7.
(pp. 7-28). Tilburg, The Netherlands: Tilburg University Press.

This chapter is a plea for help in changing the way that we construct new measures of personality characteristics. Because I am going to propose a somewhat radical alternative to conventional practice, those who are satisfied with the pace of progress in the technology of personality assessment may not be pleased with my arguments.

In my view, however, the science of personality assessment has progressed at a dismally slow pace since the first personality inventories were developed over 75 years ago. What is usually taken to be the earliest personality instrument, Woodworth's Personal Data Sheet (PDS), was published in 1917, and since that time thousands of other instruments have been developed. Like the PDS, most of these have been of limited bandwidth, typically providing measures of one, two, or at most three traits. Virtually all of these narrow-bandwidth instruments are in the public domain--the items and their scoring keys having been published in scientific books, journal articles, or student theses or dissertations. The items are freely used by other scientists, either in their original form or quite commonly in some customized format. Examples of attributes measured by such narrow-bandwidth instruments include Achievement-Motivation, Adjustment, Conservatism, Coronary-Risk, Dogmatism, Empathy, Extraversion-Introversion, Guilt, Hostility, Locus of Control, Masculinity and/or/versus Femininity, Narcissism, Neuroticism, Openness to Experience, Optimism, Private and Public Self-Consciousness, Right-Wing Authoritarianism, Self-Disclosure, Self-Esteem, Self-Monitoring, Sensation-Seeking, Test-Anxiety, and Trust.

On the other hand, most broad-bandwidth personality inventories (like the MMPI, CPI, 16PF, and NEO-PI) are proprietary instruments, whose items are copyrighted by the test authors. As a consequence, the instruments cannot be used freely by other scientists, who thus cannot contribute to their further development and refinement. Indeed, broad-bandwidth inventories are rarely revised. At most, after many decades of commercial use, some of the most dated items might be changed and/or new norms established. For many inventories, nothing is ever done at all.

The manuals for some of these commercial inventories include tables of correlations between the scale scores and various criterion indices. But, such empirical findings are rarely used to actually influence scale development, much less to continually improve the quality of the scales. Even worse, virtually all of the findings from different inventories are incommensurate. Test authors are not encouraged to conduct comparative validity studies, pitting their instrument against one or more others as predictors of the same set of criterion indices. As a result, neither the science of personality assessment nor its applied practitioners have any information about the comparative performance of the different instruments available in the marketplace. There is no Consumers Union for testing our tests.

One basic problem is that scientific goals may become subjugated to commercial interests. I believe that it is time for a change: I envisage an international effort to develop and continually refine a broad-bandwidth personality inventory, whose items are in the public domain, and whose scales can be used for both scientific and commercial purposes. No one investigator alone has access to many diverse criterion settings; but the international scientific community has such access, and by pooling our findings we should be able to devise instruments over the next decade that make our present ones seem like ancient relics.

To get there, we need to start somewhere. To begin, we must agree on the solutions to at least three problems: (1) We need a taxonomic framework for organizing the nearly infinite variety of individual differences that might be measured. (2) We need a common item format, one that is amenable to faithful translation across diverse languages. And, (3) we need a mode of communication--an effective logistical procedure for investigators to easily obtain the items and the findings from previous studies, as well as the data for re-analyses; in addition, we need a way for investigators to add new items to the pool, along with findings about their properties. For the first time, the solutions to all three problems may now be at hand.


While the technology of personality assessment has remained stagnant over the last decades, the fundamental taxonomic problem in personality assessment may be close to a solution. In spite of strong denials by some vocal critics (e.g., Block, 1995), I think that most investigators would agree that the general framework for a comprehensive structure of phenotypic personality attributes seems finally to be visible (Digman, 1990; Goldberg, 1981, 1993b, 1995; John, 1990; Saucier & Goldberg, 1996b). In a variety of Indo-European and other languages, analyses of large samples of trait-descriptive adjectives have generally led to a structural representation--often referred to as the Big-Five factor structure--which seems to incorporate most phenotypic personality attributes (Goldberg, 1990; Saucier & Goldberg, 1996a).

One way of viewing this model is as a hierarchical structure with the Big-Five factors at or near the top of the hierarchy, below which are located the various lower-level "facets" that are measured by particular narrow-bandwidth personality measures (Goldberg, 1993a). Although there is some agreement in the personality literature about the characteristics of the higher-level factors, there is no such agreement about an optimal set of lower-level facets. For example, there are 45 bipolar dimensions in the AB5C model of the Big Five proposed by Hofstee, de Raad, and Goldberg (1992); there are 30 bipolar dimensions in the Five-Factor model of Costa and McCrae as operationalized in their revised NEO inventory (NEO-PI-R); there are about 30 to 35 facets implied in the scales in Gough's California Psychological Inventory (CPI); and there are the well-known 16 primary factors in the hierarchical structure incorporated by Cattell in his Sixteen Personality Factors Questionnaire (16PF). Because agreement has not yet been reached on the relative superiority of any one of these competing lower-level structures, it behooves us to incorporate them all in our preliminary inventory, so that they can be compared empirically.

Although an inventory that includes a systematic set of lower-level facets can easily generate the higher-level Big-Five factors, the reverse is not true. Inventories that incorporate only five dimensions can not provide the specific variance associated with each of the lower-level facets. Because most of the variance in our instruments is specific to each particular trait, inventories that measure only the Big Five will necessarily be less useful than more comprehensive ones in most applied contexts. Indeed, the optimum number of variables to include in regression analyses of individual differences is limited only by considerations of statistical power, and thus of sample size. Recent empirical "demonstrations" of this psychometric principle by Mershon and Gorsuch (1988) and by Ashton, Jackson, Paunonen, Helmes, and Rothstein (1995) are hardly needed, unless one assumes that the only reliable variance in personality measures is that common variance associated with the Big-Five factors.


One major source of the Big-Five factor structure has been findings from analyses based on the "Lexical Hypothesis"--namely that the most important ways that individuals differ from each other will eventually come to be encoded as single attribute-descriptive terms (e.g., trait adjectives and type nouns) in the lexicons of the world's languages. Although the use of such single terms is necessary for the establishment of an indigenous structure in each new language under study, these descriptors are not ideal for use as the items in multi-scale personality inventories. There are at least three interrelated problems with their use: (1) First of all, the same property that provides their major strength in fundamental taxonomic studies, namely their relatively finite number within any language, necessarily limits their utility as purveyors of the complex nuances of lower-level personality description; said another way, there are not enough of them--certainly not enough for redundant and thus reliable measurement in all regions of personality space. (2) In addition, trait adjectives and type nouns encode personality traits at an extremely high level of abstractness. Although research by Hampson, John, and Goldberg (1986) demonstrates substantial differences in breadth within the total set of English trait adjectives (e.g., Extraverted versus Talkative, or Reliable versus Punctual), even the most narrow of such terms (e.g., Talkative and Punctual) are still quite abstract. Most test authors prefer items that are more behaviorally and/or contextually specified. (3) Perhaps as a consequence of the abstractness of trait adjectives and type nouns, it is often not possible to find one-to-one translations for them in different languages, even languages as close linguistically as Dutch, German, and English (Hofstee, Kiers, de Raad, Goldberg, & Ostendorf, 1997). Given the desirability of international collaboration in the development of new assessment methods, this is a highly undesirable feature of their use as test stimuli.

Instead, I propose that we begin this project using an item format that is more contextualized and thus longer than trait adjectives, yet is more compact and thus shorter than the items in many modern personality inventories. The Groningen personality team of Hofstee, de Raad, and Hendriks have been the major proponents of this item format, and they have used it to develop an initial pool of over a thousand Dutch items which they hoped might cover many of the facets of the Big-Five factor structure; findings from analyses of 914 of these Dutch items can be found in Hendriks (1997). I worked with the Groningen team to translate most of these items into their English equivalents. From this initial English item pool, I selected about 750 in their original translations, and then added about 500 new English items that have as yet no Dutch translations. The resulting pool of 1,252 English items--which I have dubbed the International Personality Item Pool (IPIP)--has now been administered in three parts to participants in an adult community sample. Participants in this sample have also been administered an inventory of 360 trait-descriptive adjectives, which include 100 unipolar markers of the Big-Five factor structure (Goldberg, 1992), as well as an inventory of 525 of the most familiar person-descriptive adjectives in English. In addition, these participants have completed a variety of commercial personality inventories, including the NEO-PI-R, CPI, TCI, and the 16PF.


If scientists world-wide are to participate together to construct the next generation of personality inventories, they need an effective method for obtaining previous findings and data, and for adding their own new findings and data. With the rapid development of the World-Wide-Web (WWW) and the associated expansion of File Transfer Protocol (FTP) sites, it is now possible to access scientific data banks easily and economically through electronic means. Indeed, it is my prediction that over the next few years the phrase "public domain" will come to mean "accessible via the World-Wide-Web." To start this process, I have set up a WWW site (http://ipip.ori.org/ipip), which includes the tables in this chapter, so as to provide easy access to this information from computer terminals throughout the world.


Table 1 presents some of the characteristics of preliminary IPIP scales targeted at the 45 bipolar AB5C facets. For each of these new scales, Table 1 lists the number of items keyed positively, the number keyed negatively, and the total number of items in the scale; in addition, the mean item intercorrelation is provided, along with the Coefficient Alpha reliability estimate. My aspiration is to develop roughly 10-item scales, with Alphas that range from .70 to .90, and average .80. Of the 45 preliminary scales, 43 have Alphas of .70 or above, and 18 have reliabilities of .80 or above; the mean of all 45 scales is .78. The items included in each of the 45 preliminary AB5C scales are listed in the Appendix.

Table 2 provides a comparison between some characteristics of the 30 facet scales from the NEO-PI-R (Costa & McCrae, 1992) and 30 similar constructs measured in the IPIP pool. Of the two sets of 30 constructs, only 13 are labeled identically; differences in scale labels are most pronounced in the Openness domain, where all six facets have different labels. On average, the IPIP scales include 10 items, with about half keyed in each direction. The average of the Coefficient Alpha values is a bit higher for the IPIP scales (.80) than for the NEO scales (.75). The average correlation between corresponding scales in the two sets is .73, which translates into a correlation of .94 when corrected for attenuation due to the unreliabilities of the two scales in each pair.

Table 3 presents a comparison between the scales on the 16PF (Conn & Rieke, 1994) and 16 new IPIP scales constructed to measure the same constructs; the scales are ordered by their associations with the Big-Five factor structure. All but one of the IPIP scales include 10 items, about half keyed in each direction; of the 16PF scales, which are also balanced in their keying, nine include 10 items, five include 11 items, and one each includes 14 and 15 items. Cattell has preferred labeling his scales with letters (A through Q4), but the manual also lists the short verbal labels presented in Table 3; only two of these trait labels are the same as those of the corresponding IPIP scales. Because the average item intercorrelations are slightly higher for the IPIP than for the 16PF scales (.29 versus .21), the average IPIP Coefficient Alpha is also somewhat higher (.80 versus .74). The average correlation between corresponding scales in the two sets is .66, which translates into a corrected correlation of .86.

An analogous table focused on the lower-level constructs in Cloninger's Temperament and Character Inventory (TCI) is available from the author. Of the 31 TCI constructs, 30 are included in the corresponding IPIP scale set--11 with Coefficient Alpha reliability estimates of .80 or more, 15 others with reliabilities of .70 or more, and 4 others with reliabilities just slightly below .70. The average Alpha values of the two sets of scales are quite similar (.77 and .78); on average, the pairs of scales correlated .64, which corresponds to a correlation of .83 when corrected for the scale reliabilities.

Also available from the author is a similar table comparing 33 of the scales in the CPI (Gough, 1996) with the corresponding 33 preliminary IPIP scales targeted at those constructs. The original 33 CPI scales vary in length from 28 to 70 items, averaging 39; in contrast, the IPIP scales include about 10 items, usually about half keyed in each direction. The IPIP scales are far more homogeneous than the CPI scales, with mean item intercorrelations that average .26, as compared to a mere .08 for the CPI scales. As a consequence, although the IPIP scales are much shorter than the CPI scales, both sets of scales have similar Alpha coefficients; for the original CPI scales the Alphas vary from .53 to .88, averaging .74, whereas for the IPIP scales the values range from .62 to .87, averaging .76. The average correlation between the corresponding CPI and IPIP scales is .62, which translates into a corrected correlation of .84.

Table 4 provides a summary of the reliability estimates for the five sets of preliminary IPIP scales, plus a comparison with the original 16PF, CPI, TCI, and NEO scales. Given the large size of this subject sample, one would not expect much attenuation of the IPIP reliabilities in new samples. As a consequence, the most reasonable conclusion one can make from these findings is that the average reliability coefficients for the four sets of preliminary IPIP scales are quite similar, and all of them are at least as high as the values for the average NEO, CPI, TCI, and 16PF scales. Hopefully, with the help of other investigators, even more reliable IPIP scales will become available over the next decade.


Most of us would agree that the single most important question to be asked of new personality measures concerns their utility as predictors of diverse and important human outcomes. Given the great scientific interest of late in behavioral medicine and health psychology, I used as initial criterion variables some indices of health-related behaviors and practices. A factor analysis of the 39 items in a reasonably comprehensive Health Activities Questionnaire (HAQ: Vickers, Conway, & Hervig, 1990) yielded three orthogonal factors: (1) Risk-Avoidance (12 questions such as "I carefully obey traffic rules," "I do not drink," "I avoid high-crime areas" versus "I take chances crossing the street," "I speed while driving," "I drive after drinking"); (2) Good Health Practices (12 items such as "I exercise to stay healthy," "I eat a balanced diet," "I see a dentist for regular checkups," "I watch my weight," "I don't smoke"); and (3) General Health Concerns (15 items such as "I gather information on things that affect my health," "I avoid areas with high pollution," "I take health food supplements, "I watch for possible signs of major health problems").

The orthogonal factor scores on each of these three health factors were used as criterion variables, along with the factor scores on the first unrotated component from an analysis of all 39 HAQ items (Total Health-Related Practices). In stepwise regression analyses, the scores from each of the five IPIP preliminary scale sets and each of the four original 16PF, CPI, TCI, and NEO scale sets were included separately as the predictor variables. To control for any effects of subject sex, age, or educational level, all the analyses were repeated using two additional procedures: (a) Using hierarchical regression procedures, the three demographic variables were entered first, followed by the personality scale scores, and (b) the residual factor scores from the HAQ were included as criterion variables after the effects of the three demographic variables had been partialed out.

The validity findings are easy to summarize: At each step in the regression analyses, for each of the four criterion variables, and for each of the three types of regression procedures, the preliminary IPIP scale sets were typically more highly predictive than the original inventory scale sets. Of the four criteria, the most predictable were Risk Avoidance and Total Health-Related Practices, and for these two criteria the IPIP scales were always more predictive than the original inventory scales. For example, in the standard regression analyses predicting Risk Avoidance from the TCI and corresponding IPIP scale sets, the multiple correlations at each of the first four steps, for the TCI scales versus the IPIP scales were: (1) .42/.57; (2) .49/.63; (3) .52/.64; and (4) .54/.65. The corresponding values for predicting the same criterion using hierarchical regression analyses were (1) .49/.60; (2) .54/.65; (3) .55/.66; and (4) .57/.67.

In summary, then, the initial evidence regarding the reliability and predictive utility of the preliminary IPIP scales is quite favorable. With the help of other investigators throughout the world, we should be able to refine these scales, and develop new ones, so as to provide substantially improved measures of important personality attributes. As a reader of this chapter, will you join me in this scientific adventure?


This research program has been funded by Grant MH-49227 from the National Institute of Mental Health, U. S. Public Health Service. The author is enormously indebted to Willem K. B. Hofstee, Boele de Raad, and Jolijn Hendriks for their development of the initial Dutch item pool from which many of the IPIP items were translated.


Table 1
Table 2
Table 3
Table 4


Return Home