What Happens If We Compare Chopsticks With Forks? The
Impact of Making Inappropriate Comparisons in Cross-Cultural
Research
By: Fang Fang Chen
University of Delaware
Acknowledgement: I would like to express appreciation to
Donna Coffman, Larry Cohen, Samuel Gaertner, Kimberly
Juliano, Shanhong Luo, Beth Morling, Kristopher Preacher,
Robert Simons, Stephen West, and Zugui Zhang for their
thoughtful comments. Special thanks go to Lyle Jones and
Roger Millsap for their insights on scale development and
measurement invariance. I am also grateful to the
Quantitative Forum in the Psychology Department at the
University of North Carolina at Chapel Hill for fruitful
discussion at the early stage of this work.
Correspondence concerning this article should be addressed to: Fang Fang Chen, Department of Psychology, University of Delaware, Wolf Hall, Newark, DE 19716 Electronic Mail may be sent to:
[email protected].
Culture affects people in a variety of basic psychological
domains, including self-concept, attribution and reasoning,
interpersonal communication, negotiation, intergroup relations,
and psychological well-being (for review, see Brewer & Chen,
2007; Fiske, Kitayama, Markus, & Nisbett,
1998, 2004; Lehman, Chiu, &
Schaller; Markus
& Kitayama, 1991;
Oyserman, Coon,
& Kemmelmeier, 2002). Suppose we
were interested in studying self-esteem and life satisfaction in
the People's Republic of China and the United States. We may
wish to test the mean differences between the two cultural
groups on the two constructs and, further, to examine whether
the relationship of self-esteem to life satisfaction is stronger
in one culture than in the other. Could we simply use scales
developed in one culture, such as Rosenberg's self-esteem scale
(Rosenberg,
1965), in both cultural groups and then
compare the results? To make valid comparisons across different
cultural or ethnic groups, we must address an important
question: Are we comparing the same constructs across different
groups?
What Is Measurement Invariance and Why Is It Important in
Cross-Cultural Research?
When we compare scale scores, such as self-esteem, across
different groups, we make a critical assumption that the scale
measures the same construct in all of the groups. If that
assumption is true, comparisons and analyses of those scores are
valid, and subsequent interpretations are meaningful. However,
if that assumption does not hold, such comparisons do not
produce meaningful results. This is the general issue of
measurement invariance.
Measurement invariance is the equivalence of a measured
construct in two or more groups, such as people from different
cultures. It assures that the same constructs are being assessed
in each group. Measurement invariance is an important issue if a
researcher wishes to make group comparisons (e.g.,
Byrne &
Watkins, 2003; Reise, Widaman, & Pugh,
1993; Riordan & Vanderberg,
1994; Van de Vijver & Leung,
1997; Widaman & Reise,
1997). Meaningful comparisons of statistics,
such as means and regression coefficients, can only be made if
the measures are comparable across different groups.
Cross-cultural researchers have long recognized the importance
of ensuring construct comparability in different cultural or
ethnic groups (Berry,
1969; Irvine & Carroll,
1980; Poortinga, 1989;
Van de Vijver
& Leung, 1997). However, it is
the development of measurement invariance tests
(Jöreskog,
1971; Meredith, 1993;
Millsap &
Everson, 1993; Sörbom,
1978; Widaman & Reise, 1997)
and the recent development of advanced statistical tools that
have made it possible to perform rigorous tests of measurement
invariance.
Measurement invariance can be tested when a scale is composed
of multiple items or subscales. With continuous variables, the
most frequently used technique for testing measurement
invariance is multiple-group confirmatory factor analysis (CFA;
F. F. Chen,
2007; F. F. Chen, Sousa, & West,
2005; F. F. Chen & West,
2008; Meredith, 1993;
Millsap &
Everson, 1993; Widaman & Reise,
1997). In factor analytic terms, the items
serve as indicators of the common factor (i.e., the construct
that the items intend to measure) in a CFA model. The basic idea
of applying multiple-group CFA to test measurement invariance is
to examine the interrelations between the indicators (i.e.,
items or subscales) and the factors that the indicators are
supposed to measure. Multiple-group CFA can be used to test the
equivalence of the factor structure (i.e., number of factors),
factor loadings (i.e., unit of a scale), intercepts (i.e.,
origin of a scale), residual variance (i.e., precision of a
scale), and other aspects of a construct across different groups
in a series of hierarchical models.
The most basic level of measurement invariance is known as
configural invariance (Horn, McArdle, & Mason, 1983)
or factor-form invariance (Cheung & Rensvold,
2000). It tests whether similar, but not
identical, factors are measured in the groups
(Widaman &
Reise, 1997). The same item must be
associated with the same latent factor in each group, but the
factor loadings can differ across groups.
The second level of invariance is factor loading or metric
invariance. Factor loadings represent the strength of the
relationships between each factor and its associated items
(Bollen,
1989; Jöreskog & Sörbom,
1999). Factor loadings can be conceptualized
as the slopes of regression lines, that is, the weights obtained
by regressing the item responses on the underlying latent
factors. When factor loadings are equal, the unit of the
measurement is identical, and thus predictive relationships can
be compared across groups.
The third level of invariance is intercept or scalar
invariance. It tests whether an item has the same point of
origin across different groups. When invariance is achieved at
both the factor loading and intercept levels, scores from
different groups have the same unit of measurement (i.e., factor
loading) as well as the same origin (i.e., intercept), and thus
factor means can be compared across groups. Otherwise, it is not
certain whether group differences on factor means are
attributable to valid cultural differences or to measurement
artifacts.
The fourth level is the invariance of residual variance. It
tests the equivalence of the precision of a
scale. 1 Measurement invariance can be used to
test the invariance of a scale (i.e., an omnibus test in which
all items are tested simultaneously) as well as the invariance
of individual items (i.e., planned contrast in which one or more
items are tested). When items meet the standards of measurement
invariance, they are considered invariant; otherwise, they are
defined as non-invariant, lacking invariance, or having
measurement bias. It is possible that some of the items are
invariant, whereas others are not in a given scale. For detailed
procedures on testing measurement invariance and criteria on
evaluating measurement invariance, see F. F. Chen (2007);
F. F. Chen, Souse,
and West (2005); and Widaman and Reise
(1997).
What Factors Can Cause Lack of Measurement Invariance?
When scale scores are compared across different cultural
groups, a variety of sources can affect the equivalence of the
construct. Lack of configural invariance (i.e., the number of
factors that underlies a construct is different) is most likely
to occur when a construct is simply imported from one cultural
setting to another, because a construct can be more
differentiated in one culture than in another. For example, the
concept of individuation (Maslach, Stapp, & Santee,
1985) is best represented by two factors in
China, whereas it is unidimensional in the United States
(Kwan, Bond,
Boucher, Maslach, & Gan, 2002).
Similarly, filial piety is also a more elaborated concept in
China than in the United States (Hsieh, 1967).
Lack of loading invariance (i.e., unit of a scale) is likely to
arise from multiple causes. First, it can happen when a scale is
imported from one culture, such as the United States, to
another, such as China, but the definitions and meanings of that
concept do not fully overlap across different cultures. As a
result, the item content is more appropriate for one culture
than for the other. For example, for North Americans,
self-esteem mainly stems from having unique personal attributes
and individual achievements. In contrast, for people from
Eastern cultures, the self is deeply connected with family,
friends, groups, etc., and thus the sense of
“we” and interdependence with others may be
the most important source of self-esteem. Consequently, items
that tap the Western view of self-esteem, such as “I
am a person of worth,” and “I feel that I
have a number of good qualities,” may not be good
indicators of self-esteem in an Eastern context. The association
between Chinese participants' self-esteem and endorsement of
Western items (i.e., factor loadings) may be weaker than for
American participants. Second, lack of loading invariance can
come from inappropriate translation. When items are translated
from one language to another, their meanings can change,
particularly for idiomatic expressions. For example, items like
“I feel blue” as a measure of depression
would make Chinese participants feel that this item is out of
the blue. The American participants would thus respond to the
content of the item, whereas the Chinese participants would give
inconsistent answers. As a result, the strength of the
relationship (i.e., factor loading) between the items and the
depression construct would be weaker for the Chinese
participants than for their American counterparts. Third,
response sets, particularly the tendency to use or avoid extreme
responses, can result in lack of loading invariance. For
example, evidence suggests that U.S. participants have an
inclination to use the extreme ends of a response scale, whereas
Chinese participants are more likely to use the middle points
(C. Chen, Lee,
& Stevenson, 1995;
Hui &
Triandis, 1985), resulting in a
restricted range of responses among the Chinese participants.
Accordingly, factor loadings differ across the two groups.
Several factors can affect the origin of a scale, that is, the
intercept of the scale. First, social desirability, a tendency
to follow the social norms, can lead participants in one group
to consistently give higher or lower ratings than those in other
groups (Hui &
Triandis, 1985). For example, for the
item “How happy were you in the past week?,”
the true happy state might be 3 on a 5-point scale for
participants from both the United States and China. However, the
American may respond with 4 because of the need to preserve
positive self-esteem (e.g., Heine, Lehman, Markus, & Kitayama,
1999). Second, when a group is preoccupied
with its own defects or deficiencies, it may convey a stronger
desire for these values or traits. For example, survey ratings
indicate that some minority parents and students value the
importance of education more than do their European and Asian
counterparts. However, behavioral observations, such as the
amount of time that students stay in school and study, tell a
different story (cf. Peng, Nisbett, & Wong, 1997).
Third, people from different cultural groups may use different
reference frameworks in making judgments about themselves. For
example, current trait or attitude measures of individualism and
collectivism often fail to reveal the expected cultural
differences. However, when participants from Japan and Canada
were asked to compare themselves with either Canadians or
Japanese, the expected cultural differences were enhanced when
the cross-reference group was used (Heine, Lehman, Peng, &
Greenholtz, 2002). Under all three
scenarios, the origin of a scale would be different. A 3 in
Culture A may be equal to a 4 in Culture B, resulting in lack of
intercept invariance.
Given the comparative nature of the studies, it is quite a
challenging task to achieve measurement invariance in
cross-cultural research, particularly when we simply apply
instruments developed in one culture to other cultural contexts.
However, this is a common practice in applied research. To what
extent are these scales invariant cross-culturally, and how
confident are we about the conclusions drawn from these studies?
To address these issues, A literature review was conducted on
the instruments used in cross-cultural studies.
Are the Instruments Comparable Cross Culturally? Analysis of
the Current Practice
The following key words and 30 other similar words were used to
search articles published from 1993 2 to 2006 in the
PsycINFO database: “cross-cultural
invariance,” “factor invariance,”
“measurement invariance.” One hundred thirty
comparisons 3 met the following selection criteria: (a)
the instrument was originally developed in North America, (b)
Caucasian Americans/Canadians were used as the reference group,
(c) the article was published in a peer-reviewed journal, and
(d) factor loadings of each cultural or ethnic group were
reported or obtained upon request.
Analyses were performed to examine the pattern and severity of
factor-loading differences across the cultural or ethnic
comparisons. The analysis results, such as effect size, pattern
of non-invariance, and sample size, were used as the basis for
conducting subsequent simulated studies, in which bias in
regression slopes and means resulting from lacking measurement
invariance was examined.
Following the convention of Holland and Thayer (1988), the
mainstream cultural group is defined as the reference group
(e.g., United States), and the other ethnic minority or cultural
groups are defined as focal groups (e.g., China). To clarify the
nature of loading differences, two patterns of non-invariance
are defined in this review: (a) When all the non-invariant
loadings are higher in the reference group than in the focal
group, it is classified as a uniform pattern of non-invariance;
(b) when some of the non-invariant loadings are higher in the
reference group and some are higher in the focal group, it is
classified as a mixed pattern of non-invariance. In both cases,
the magnitude of the loading difference is the numerical
difference between the loadings for a given item across two
groups.
Among the 130 cross-cultural and cross-ethnic comparisons, 9
lacked configural invariance, which means that the number of
factors that underlie the items was different across groups.
These cases are excluded from further analysis, because it is
not meaningful to compare factor loadings when configural
invariance is not achieved. In the remaining 122 comparisons, 97
were based on standardized factor loadings, and 25 were based on
unstandardized factor loadings. Further analyses are based on
the standardized factor loadings, because unstandardized factor
loadings are subject to scaling, which prevents direct
comparisons across studies.
For 74 of the 97 standardized comparisons (74.2%), the average
loading was higher in the reference group than in the focal
group, and the average loading difference was .13
(SD = .08). Although the magnitude of the
average loading difference between the groups appears small, its
impact may not be trivial.
Findings further indicate that 14 of the 97 comparisons (14.4%)
had all loadings higher in the reference group (e.g., United
States) than in the focal group (e.g., China), showing a uniform
pattern of non-invariance. However, it was more common that only
a proportion of the items, rather than all items, had higher
loadings in the reference group than in the focal group: 26 of
the comparisons (26.8%) had at least 90% of the loadings higher
in the reference group, 48 of the comparisons (49.5%) had at
least 75% of the loadings higher in the reference group, 81 of
the comparisons (83.5%) had at least 50% of the loadings higher
in the reference group, and 94 of the comparisons (96.9%) had at
least 30% of the loadings higher in the reference group.
It is interesting that 7 of the 97 comparisons (7.2%) had about
half of the loadings higher in the reference group and the other
half higher in the focal group, showing a mixed pattern of
non-invariance.
Given these findings, it is important to examine bias in group
comparisons resulting from a proportion of non-invariant items,
in addition to bias associated with the condition in which all
loadings are higher in one group than in the other. It is also
meaningful to investigate bias associated with the pattern of
non-invariance, that is, whether the non-invariant loadings are
uniformly higher in one group or the pattern is mixed.
Although no studies have systematically examined the pattern of
factor loadings across different cultural and ethnic groups, the
findings from this review are consistent with the literature on
reliabilities. For example, in compensatory education research,
test scores obtained from the disadvantaged minority groups
often have lower reliability, compared with those of the
advantaged group (Campbell & Boruch, 1975).
Reviews of self-reported measures on values indicate that higher
reliability was more often reported in the American samples than
in other cultural groups (Peng et al., 1997). The lower
reliability in the focal groups is a reasonable indication of
lower factor loadings and is thus a sign of measurement
bias. 4
Given that reliabilities are more routinely reported than
factor loadings in published articles, a second search was
conducted. To limit the scope of the search, Rosenberg's (1965)
self-esteem scale was chosen, as it is perhaps one of the most
widely used scales cross-culturally. Using key words
“culture and Rosenberg self-esteem,” and
“cross-cultural and Rosenberg self-esteem,”
a search was performed on the PsycINFO database, and it was
limited to articles published from 1995 to 2006. Seventy-five
comparisons met the following criteria: (a) Rosenberg's
self-esteem measure was used cross-culturally or cross
ethnically, (b) Caucasian Americans/Canadians were used as the
reference group, (c) the article was published in a
peer-reviewed journal, and (d) reliability of the scale for each
cultural or ethnic group was reported or obtained upon request.
In 59 of the 75 comparisons (78.7%), reliability of the
Rosenberg self-esteem scale was higher in the Caucasian
Americans or Canadians than in the other cultural or ethnic
group(s), and the average difference in reliability was .07
(MU.S./Canada = .87, SD = .02 vs.
MNon-U.S./Canada = .80, SD = .08).
This pattern is particularly true when comparing North Americans
with Asians, because in 18 of the 21 comparisons (85.7%), scores
of North Americans had higher reliability than the scores of
Asians, and the difference in reliability was .09
(MU.S./Canada = .87, SD = .02 vs.
MNon-U.S./Canada = .78, SD = .05).
This analysis also indicates that, consistent with the
literature, North Americans have higher self-esteem than other
cultural or ethnic groups (Cohen's d = .31),
and this difference is moderately large between North Americans
and Asians (Cohen's d = .59). However, it is
possible that the lower reliability in the focal groups is, at
least in part, responsible for the commonly reported cultural
and ethnic difference in self-esteem.
What Happens When Instruments Are Not Comparable
Cross-Culturally? The Present Simulation Studies
When we compare diverse groups on the basis of instruments that
do not have the same psychometric properties, we may discover
erroneous “group differences” that are in
fact artifacts of measurement, or we may miss true group
differences that have been masked by these artifacts. As a
result, a harmful education program may be regarded as
beneficial to the students, or an effective health intervention
program may be considered of no use to depressive patients.
Although measurement invariance has been increasingly tested in
cross-cultural comparisons (e.g., Byrne & Campbell, 1999;
Little,
1997; Rhee, Uleman, & Lee;
1996; Steenkamp & Baumgartner,
1998), it is still usually assumed, rather
than tested. The author's review of articles in the
Journal of Personality and Social
Psychology from 1985 to 2005 indicate that although 48
articles involved cross-cultural comparisons of attitudes,
values, personality, and other self-reported surveys, only 8
studies (less than 17%) tested measurement invariance across
different cultural groups, with the remainder using a sum score
or mean score. The sum-score approach takes the total score of
the items in a scale, and similarly, the mean score takes the
average of the items. Both approaches assume that the measures
under study are invariant across different groups. In addition,
it is not uncommon to pool participants from different cultural
or ethnic groups for evaluation, a procedure that assumes
measurement invariance as well. However, as discovered in the
author's review, this assumption does not hold in many
applications.
To explore the consequences of making comparisons based on
non-invariant measures on the conclusions drawn from a study,
Millsap and Kwok
(2004) conducted an important series of
simulation studies. Given that school admission committees or
employers often select students or employees from different
ethnic or cultural backgrounds, Millsap and Kwok examined
selection bias based on a criterion that is only partially
invariant. Selection bias was defined by the accuracy of
classifying people according to two standards: a factor score
for each group and a composite score in which the group
difference in factor loadings was ignored. 5 Four categories
were created: (a) true positive, should be
selected on the basis of the factor score and was selected on
the basis of the composite score; (b) true
negative, should not be selected on the basis of the
factor score and was not selected on the basis of the composite
score; (c) false positive, should not be
selected but was selected; and (d) false
negative, should be selected but was not selected. It
was found that even small group differences in factor structure
could have substantial influence on selection accuracy,
particularly for sensitivity, which is the number of individuals
who were selected on the basis of both their factor score and
their composite score divided by the number of individuals who
were selected solely on the basis of their factor score. For
example, when the proportion of non-invariance varied from 0%
(control condition) to 75%, sensitivity could drop from 64.2% to
22.1%.
No studies have examined the bias that lack of measurement
invariance may introduce to commonly used statistics, such as
means and regression slopes, in group comparisons. For example,
suppose we were interested in asking whether self-esteem would
predict life satisfaction to the same degree for Chinese as for
Caucasian students. In what direction and to what extent would
the predictive relationship (i.e., the beta weight or regression
slope) be affected by lack of invariance in the self-esteem
measure? How would the relationship be biased if the outcome
measure, life satisfaction, also lacks invariance? In what
direction would the group means for self-esteem and life
satisfaction be biased? To address these issues, three
simulation studies were conducted to fill in this gap.
Overview of Present Simulation Studies
Many researchers have discussed the importance of testing
measurement invariance and are well aware that lack of
invariance can lead to possible bias in conclusions (e.g.,
Widaman &
Reise, 1997). However, this is the first
investigation that examines both the direction and degree of
bias resulting from various forms of non-invariance in
cross-cultural research. This information could be vital to
researchers when interpreting findings based on non-invariant
measures, because it can warn readers by specifying the
direction and degree of bias in each cultural or ethnic group,
given that the requirements for measurement invariance are often
difficult to meet in applied research. Second, this is also the
first study in which the simulation conditions are based on the
empirical findings in the cross-cultural literature, and it
therefore maximizes the external validity of the study. Third,
this investigation is particularly relevant to the
cross-cultural study of personality and social psychological
phenomena.
There are three major goals in the present investigation: (a)
to examine bias in regression slopes (beta weights) when factor
loadings are not invariant, as factor loading invariance is a
prerequisite for regression slope comparisons (e.g., When using
self-esteem to predict subjective well-being, how would the
predictive relationship be affected if the factor loadings of
self-esteem were different across groups?); (b) to explore bias
in means when factor loadings are not invariant, because factor
loading invariance is also a prerequisite for proper mean
comparisons (e.g., How would group means be biased when factor
loadings of self-esteem differ?); (c) to investigate bias in
means when intercepts (i.e., point of origin) are not invariant,
as intercept invariance is a prerequisite for mean comparisons,
in addition to factor loading invariance (Widaman & Reise,
1997; e.g., When one group has higher
intercepts in self-esteem than the other group, in what
direction would the means be biased in each group?) Given the
computational complexity and intensity, the Mplus software
program (Muthén & Muthén,
1998) was used to conduct the
simulation.
Study 1: Lack of Loading Invariance and Bias in Regression
Slopes
As discussed earlier, lack of invariance in factor loadings can
come from insufficient overlap in meaning of a construct between
cultural groups, inappropriate content of the items, translation
problems, the tendency to use or avoid extreme responses on a
response scale, differential responses to positively versus
negatively worded items, and other sources. Study 1 was
conducted to examine predictive bias between two constructs when
a predictor or an outcome measure lacks invariance in factor
loadings. This would allow us to examine bias in a predictive
relationship, such as using self-esteem to predict life
satisfaction across groups. When bias is found, one may discover
a bogus interaction effect of culture by predictor. For example,
self-esteem may be found to be a stronger predictor of
life-satisfaction for Caucasians than for Chinese, when in fact
the relationship is the same for both groups.
Design
To systematically examine bias in
regression slopes when the predictor or criterion lacks
loading invariance and to maximize the external validity
simultaneously, 4 (Proportion of Non-invariance: 87.5%, 75%,
50%, and 25%) × 2 (Pattern of Invariance: uniform
vs. mixed) × 2 (Ratio of Sample Size: 1 vs. 1, 4
vs. 1; total N = 300) experimental
conditions were generated (see Appendix for
detailed model parameters and additional justification for
parameter selections).
The proportion of non-invariance
conditions correspond approximately to the findings in the
author's literature review: 26 of the 97 comparisons had at
least 90% of the loadings higher in the reference group, 48
of the comparisons had at least 75% of the loadings higher
in the reference group, 81 of them had at least 50% of the
loadings higher in the reference group, and 94 of them had
at least 30% of the loadings higher in the reference group.
The proportion of non-invariance was
varied to serve two purposes: (a) to maximize the external
validity of the study (as found in the author's literature
review, in many of the applications, only a proportion of
the items, rather than all the items, in a scale are
non-invariant); (b) to explore whether the relationship
between the degree of bias corresponds monotonically to the
degree of non-invariance (i.e., to examine whether a greater
degree of non-invariance in factor loadings leads to a
greater degree of bias in regression slopes, which is
particularly important when the power of testing measurement
is considered). This issue is addressed further in the
discussion.
In the uniform pattern of non-invariance
condition, all non-invariant loadings were set higher in the
reference group (e.g., United States) than in the focal
group (e.g., China). In the mixed pattern of non-invariance
condition, about half of the items were set higher in the
reference group, whereas the other half were set higher in
the focal group. This condition was designed to match the
finding in the review as well, given that 7 of the 97
comparisons showed this pattern of non-invariance. The ratio
of sample size (1 vs. 1 and 4 vs. 1) also reflects the
findings in the review, because among 36.4% of the
comparisons, the ratio of sample size was less than 1.5, and
the average ratio of sample size was 4.67 across all
comparisons. 6 Finally, given that in applied
research, both the predictor and outcome variable may lack
invariance, such a condition was also examined. For
simplicity, the degree and direction of bias were equivalent
in both variables, and only the uniform condition was
considered.
The expected mean and covariance
structures were generated in Version 3.01 of Mplus
(Muthén & Muthén,
1998), and maximum likelihood estimation
was used to estimate models. First, a population matrix was
generated, corresponding to the parameterization of a target
two-group model. In the target model, the factor loadings
were different between the groups (except for the marker
variable, 7 which was set equal across the
groups); all other parameters (i.e., factor variance and
covariance, and residual variances) were set equal across
the groups. Second, a configural invariance model was fit to
the generated population matrix, in which the pattern of the
factor loadings was the same (i.e., the same item loaded on
the same factor[s]), whereas all loadings were freely
estimated in both groups. Third, a factor loading invariance
model was fit to the population matrix, in which all the
loadings were equated across the groups. Regression slopes
obtained from the loading invariance model with the true
values in the configural invariance model were compared to
determine the direction and degree of bias in the regression
slopes.
Results
Tables 1 and 2 present bias in regression slopes
when the predictor lacks loading invariance and when the
criterion lacks loading invariance, respectively.
Table
3
displays the results when both the predictor and the
criterion violate loading invariance. Relative bias was
calculated by subtracting the estimated slope in the loading
invariance model from the true regression slope and then
dividing the difference by the true regression slope. A
positive value indicates that the slope was overestimated,
and a negative valued indicates that the slope was
underestimated.
When a Predictor Is Lack of Factor Loading Invariance: Bias in
Regression Slopes (Study 1)
When a Criterion Is Lack of Factor Loading Invariance: Bias in
Regression Slopes (Study 1)
When Both the Predictor and Outcome Variable Are Lack of Factor
Loading Invariance: Bias in Regression Slopes (Study 1)
Predictor or criterion lack of loading
invariance—uniform
When the predictor, such as
self-esteem, lacked loading invariance, the regression
slope was underestimated in the reference group (e.g.,
United States) but overestimated in the focal group
(e.g., China). For example, in the case of self-esteem
predicting life satisfaction, when self-esteem is a
better measure for Americans than for Chinese, the
predictive relationship is weaker for Americans than for
Chinese, even when the true relationship (as specified
in the simulation) is the same for both groups. As a
result, an artificial interaction effect of Culture
× Self-Esteem is created. The degree of bias
(i.e., the extent to which the slope is overestimated or
underestimated, or the artificially created group
difference in the slope) is affected by the proportion
of non-invariant items, group membership, and ratio of
sample size (i.e., sample size of the reference group
vs. focal group). That is, when the proportion of
non-invariant items increases, bias increases; bias is
bigger in the focal group than in the reference group,
especially when the proportion of non-invariance is
large. When sample size increases in the reference group
relative to the focal group, bias decreases in that
group but increases in the focal group.
When the criterion lacked loading
invariance, the opposite pattern was found, that is, the
regression slope was overestimated in the reference
group (e.g., United States) but underestimated in the
focal group (e.g., China). Given the same example, when
life satisfaction is a more appropriate instrument for
Americans than for Chinese, the regression slope is
larger for Americans than for Chinese, even when the
predictive relationship is the same for both groups.
Consequently, lack of invariance in life satisfaction
creates a pseudo interaction effect of Culture
× Self-Esteem. As in the case when the
predictor is non-invariant, degree of bias is affected
by the proportion of non-invariance, group membership,
and ratio of sample size. That is, when the proportion
of non-invariant items increases, bias increases; bias
is bigger in the focal group than in the reference
group, especially when the proportion of non-invariant
items is large. When the reference group has a larger
sample size, bias decreases in that group but increases
in the focal group.
Predictor or criterion lack of loading
invariance—mixed
When the pattern of non-invariant
items in the predictor was mixed, bias in the regression
slope was reduced in both groups. Similarly, when the
pattern of lack of loading invariance in the criterion
was mixed, bias in the regression slope was also reduced
in both groups. Thus, when some of the loadings are
higher in the reference group and some are higher in the
focal group, artificially created group difference in
the predictive relationship is reduced because bias
associated with the reference group and bias associated
with the focal group tend to cancel each other out.
However, reduced bias in regression slopes does not
imply that the measures are invariant.
Both predictor and criterion lack of loading
invariance—uniform
When both the predictor, such as
self-esteem, and the outcome variable, such as life
satisfaction, lacked loading invariance, and when the
direction and degree of non-invariance were comparable
in both groups, bias was reduced. However, this result
does not imply that using non-invariant measures
simultaneously in the predictor and the criterion is the
solution to lack of measurement invariance. Instead, it
suggests that when lack of invariance occurs in both the
predictor and the outcome variable, statistical bias
associated with the non-invariant predictor and bias
associated with the non-invariant outcome variable tend
to cancel each other out.
Summary
The results of Study 1 indicate that
lack of factor-loading invariance could lead to
substantial bias in regression slopes. The direction of
bias depends on whether a predictor or criterion lacks
invariance. When the reference group had higher loadings
in the predictor, the regression slope was
underestimated in the reference group but overestimated
in the focal group. When the reference group had higher
loadings in the criterion, the opposite pattern was
found. Under both conditions, a bogus interaction effect
was produced. However, when some of the loadings were
higher in the reference group and some were higher in
the focal group, bias in the regression slopes was
reduced. When lack of loading invariance occurred in
both the predictor and outcome variable, bias was also
reduced. However, the construct validity of the scales
is still in question, as they may measure different
concepts in different cultures.
Study 2: Lack of Loading Invariance and Bias in Means
The goal of Study 2 was to explore bias in means when factor
loadings are not invariant, given that loading invariance is a
prerequisite for mean comparisons. The experimental conditions
were the same as in Study 1, except that the tested model was a
one-factor measurement model with no predictor or criterion
involved. Intercepts and residual variances were set equal
across the groups in the target model. Model fitting procedures
were also similar to those in Study 1, except that in Step 2,
both factor loadings and intercepts were equated. Relative bias
was calculated by subtracting the mean obtained from the
invariance model from the true factor mean and then dividing the
difference by the true factor mean. A positive value indicates
that the mean was overestimated, and a negative value indicates
that the mean was underestimated.
Results
Bias in factor means resulting from lack
of loading invariance is presented in Table 4. When the reference group
(e.g., United States) had higher loadings, the factor mean
was overestimated in the reference group but underestimated
in the focal group (e.g., China). As a result, an artificial
group difference was created. The degree of bias was
affected by the proportion of non-invariance, ratio of
sample size, and pattern of non-invariance. That is, when
lack of loading invariance was uniform, as the proportion of
non-invariant items increased, bias increased; the degree of
bias was larger in the focal group than in the reference
group. When sample size increased in the reference group
relative to the focal group, bias decreased in that group
but increased in the focal group. In contrast, when lack of
loading invariance was mixed, bias in the factor mean was
minimized in both groups. As discussed earlier, lack of bias
in the means does not imply the construct is equivalent
across groups.
When Loadings Are Lack of Invariance: Bias in Factor Means (Study
2)
Study 3: Lack of Intercept Invariance and Bias in Means
Study 3 was conducted to investigate the impact of lack of
intercept (i.e., point of origin) invariance on factor means,
given that intercept invariance is the prerequisite for factor
mean comparisons. A 4 (Proportion of Non-invariance: 100%, 75%,
50%, 25%) × 2 (Pattern of Invariance: uniform vs.
mixed) × 2 (Ratio of Sample Size: 1 vs. 1, 4 vs. 1;
total N = 300) design was created. Factor
loadings and residual variances were set equal across the groups
in the target model (see Appendix for detailed model
parameters). As in Studies 1 and 2, Mplus was used to generate
the mean and covariance structure, and model-fitting procedures
were similar to those in previous studies.
Results
Lack of intercept (i.e., point of origin)
invariance can lead to appreciable bias in factor means (see
Table
5). The direction of bias depends on the direction of
intercept differences. When the reference group (e.g.,
United States) has higher intercepts than the focal group
(e.g., China), that is, when a U.S. 4 is equal to a Chinese
3, the factor mean is overestimated in the reference group
but underestimated in the focal group. 8 The degree
of bias depends on the degree of non-invariance and ratio of
sample size. The larger the degree of non-invariance, the
larger the bias is in both groups. Consistent with the
findings in Studies 1–3, when the reference group
had a larger sample size, bias became smaller in that group
but larger in the focal group; when the pattern of intercept
non-invariance was mixed, that is, when some of the
intercepts were higher in the reference group, whereas
others were higher in the focal group, bias in the means was
substantially reduced in both groups. Once again, the
reduced bias does not indicate that the measures are
invariant.
When Intercepts Are Lack of Invariance: Bias in Factor Means
(Study 3)
Discussion
To make valid comparisons across different cultural or ethnic
groups, we must ensure that we are not comparing chopsticks with
forks. Given that researchers often import measures developed
for one cultural group to other populations, the issue of
measurement invariance becomes a serious challenge. Findings
from Study 1 indicate that lack of factor loading invariance can
produce artificial interaction effects in predictive
relationships. Results of Studies 2 and 3 demonstrate that lack
of loading and intercept (i.e., point of origin) invariance can
lead to bogus cultural differences in means.
Comparison of the Current Investigations With
Millsap and
Kwok's (2004) Studies
Different from the current studies,
Millsap and
Kwok (2004) did not examine bias in
regression slopes and means due to lack of invariance in
factor loadings or intercepts. However, there is some
comparability between the two independent investigations.
Millsap and Kwok studied selection accuracy by comparing
selection rate based on the distribution of a sum score
(i.e., pooled score from two different groups) and selection
rate based on the distribution of the latent factor score
for each group. They also studied sensitivity (i.e., the
number of individuals who were selected on the basis of both
the pooled sum score and their latent mean score over the
number of individuals who were selected solely on the basis
of their latent mean score). It was found that both the
selection rate and sensitivity were artificially increased
in the reference group (e.g., United States) but decreased
in the focal group (e.g., China) when the reference group
had higher loadings and intercepts than the focal group. The
results of the success ratio (i.e., the number of
individuals who were selected on the basis of the pooled sum
score and their latent mean over the number of individuals
who were selected solely on the basis of the pooled sum
score) also favor the reference group. These findings are
consistent with the results from the current studies, in
which the means were overestimated in the reference group
but underestimated in the focal group when factor loadings
or intercepts favor the reference group. Also as found in
the present studies, as the proportion of non-invariant
items increased, the degree of bias increased accordingly.
In addition, when the reference group had a larger sample
size, bias decreased in that group but increased for the
focal group, a result obtained in the current study as well.
These similar patterns of findings across the two
investigations provide support for the validity of the
current studies.
Implications in Cross-Cultural Research
Given the high incidence of violating
measurement invariance in cross-cultural studies, these
findings cast serious doubt on the conclusions drawn from
past cross-cultural research. For example, a robust
cross-cultural finding is that North Americans have higher
self-esteem than East Asians (e.g., Oishi & Sullivan,
2005). However, in light of the findings
from the current simulation studies and the author's review
in this article on cross-cultural differences in self-esteem
reliability, the discovered cultural difference in
self-esteem, at least in part, is due to lower reliability,
an indication of lower factor loadings, in the self-esteem
scale (Rosenberg,
1965) for East Asians. In addition,
East Asians' value of modesty toward one's personal
attributes (Markus
& Kitayama, 1991) could have
contributed to this cultural difference. This is because the
self-effacing tendency results in lower intercepts in item
ratings, which in turn lead to lower means. Most of the
current self-esteem measures focus on the inner aspect of
self-esteem or feelings of self-competence, which might be
more relevant to North Americans. For East Asians, the
social aspect of self-esteem (i.e., being accepted and
valued by other people) might be more important. Future
research should develop scales that measure self-esteem in a
culturally appropriate manner, such as by including both the
inner and social aspects of self-esteem.
The present literature review on existing
measures encompasses a wide range of topics, including
personality, depression, stress reaction, social competence,
cognitive ability, emotional intelligence, life
satisfaction, organizational commitment, affect,
self-concept, self-esteem, anxiety, and attachment. When
these scales are used as predictors, the predictive
relationship is likely to be underestimated in the reference
group (e.g., United States) but overestimated in the focal
group (e.g., China), and the opposite is likely to happen
when these scales are used as outcome measures. Perhaps the
most routine use of these scales is the comparison of means
across different cultural groups. Most likely, the means are
artificially inflated in the reference group but deflated in
the focal group, given the lower loadings in the latter
group. Particularly, for measures related to self-concept,
self-esteem, and satisfaction with life, the means are
likely to be underestimated for East Asians but
overestimated for North Americans, given that both the
loadings and intercepts (resulting from conceptual
differences and the modesty tendency) are likely to be lower
for East Asians. For other measures, the direction of bias
in the means is difficult to predict, given the uncertainty
in intercept differences.
As discussed earlier, measurement
invariance is still assumed, rather than tested, in many
applications. When we fail to examine measurement
invariance, we may uncover spurious “cultural
differences” that are in fact artifacts of
measurements, or we may fail to reveal true cultural
differences that have been masked by measurement artifacts,
which could be discovered had we used an invariant
instrument. Results of the present studies also suggest that
we are more likely to draw erroneous conclusions for the
focal group (e.g., Asian Americans) than for the reference
group (e.g., European Americans) when comparing different
ethnic groups, given that the focal group often has a much
smaller sample size than the reference group. If erroneous
conclusions were used to guide school admission, medical
diagnosis, personnel selection and promotion, clinical
trials, or health and education prevention programs, serious
consequences could occur. Healthy people can be falsely
diagnosed and sick ones overlooked. Results of these studies
highlight the importance of testing measurement invariance
in cross-cultural comparisons and the significance of
understanding the consequences of lack of invariance.
Implications in Testing Measurement Invariance
The present investigation has important
implications for testing measurement invariance. It suggests
that we may need a more dynamic approach to evaluating
measurement invariance. In other words, measurement
invariance should be tested within the context of its impact
on the statistics that a researcher is comparing. The
conventional wisdom (e.g., Widaman & Reise,
1997) is that we should test measurement
invariance as a first step in group comparisons. When
measurement invariance is achieved at the appropriate level,
we then move to the next step, which is making group
comparisons. When measurement invariance is not achieved, we
should avoid making group comparisons until an invariant
measure is available. However, when results from this
investigation are interpreted with findings from a series of
recent simulation studies (F. F. Chen, 2007), the
picture is much more complex: The relation between the
probability of detecting non-invariance and the degree of
bias in group comparisons resulting from noninvariance is
not congruent. Counterintuitively, when both the degree of
non-invariance and its corresponding bias in statistics are
the highest, the probability of revealing non-invariance is
the lowest (F. F.
Chen, 2007); when the degree of
non-invariance and associated bias are only moderate, the
probability of detecting non-invariance is the highest. In
addition, bias is larger when lack of invariance is uniform,
rather than mixed; however, the likelihood of detecting lack
of invariance is smaller when lack of invariance is uniform,
rather than mixed. These findings indicate that meeting the
standards of measurement invariance does not guard against
lack of bias in group comparisons. On the other hand, the
discovery of lack of invariance may not result in
statistical bias in group comparisons, depending on the
pattern of non-invariance. Nevertheless, lack of statistical
bias in group comparisons does not imply that the constructs
are comparable at the conceptual level.
The reduced bias in regression slopes and
means due to a mixed pattern of non-invariance also has
implications for comparing constructs that are composed of
common aspects (i.e., shared by different cultural or ethnic
groups), as well as unique components (i.e., specific to
each group). These constructs will not meet the standards of
measurement invariance, as culture specific items are unique
to each culture. However, the results of the present
investigation suggest that if these culturally unique items
are balanced across groups, it is possible to make unbiased
comparisons. Conceptually, however, it is still arguable
whether a construct is comparable when culturally unique
components are involved.
Implications of these studies go beyond
cross-cultural research. Measurement invariance is an
important issue whenever heterogeneous groups are involved.
The groups can be gender, age in longitudinal research, or
treatment and control groups in experimental and prevention
studies. For example, Smith and Reise (1999)
conducted a study to examine gender differences in
neuroticism using the Revised NEO Personality Inventory
Neuroticism scale (Costa & McCrae, 1992). It
was found that several items related to being sensitive to
interpersonal stress tended to inflate women's scores,
whereas several items related to tension and worry tended to
inflate men's scores. Similarly, in longitudinal studies,
the meaning of a construct may change over time. For
example, the way people display racism is more subtle in the
21st century than in the 1960s. An instrument developed to
measure explicit racism in the 1960s may not be able to
capture the more subtle and implicit nature of the construct
today. In experimental studies, when a treatment is
introduced, it has the potential to change the meaning of
the constructs under study.
Recommendations—When Invariance Fails
This investigation systematically
examined the direction and degree of bias under varying
conditions of non-invariance. The results can be
particularly useful for substantive researchers in deciding
whether a comparison should be made in the face of lack of
measurement invariance. As discussed earlier, the goal of
testing measurement invariance is to ensure that group
comparisons are valid. However, it is a challenging task to
achieve measurement invariance in cross-cultural research. A
variety of factors, such as translation, inappropriate item
coverage, different response format and style, and social
desirability, can affect the psychometric properties of
instruments when different cultural or ethnic minority
groups are compared (e.g., Van de Vijver & Leung,
2000). What should a researcher do when
invariance fails? On the basis of current simulations,
readers may be tempted to make the following inference: If
we allow the non-invariant factor loadings (and/or
intercepts) to vary across groups, (i.e., if we do not
impose measurement invariance under the condition of
non-invariance), bias in statistics (e.g., regression slopes
or means) will not occur, and thus, it is appropriate to
make group comparisons. However, there are two issues
associated with this line of reasoning. First, when a
construct does not meet the standards of measurement
invariance, it implies that, conceptually, the construct
conveys different meanings in different groups. Second, lack
of invariance can introduce bias in statistics indirectly,
even when measurement invariance is not imposed. If the
construct had been measured appropriately, the regression
slopes (and/or means) would be different.
Dealing with non-invariant scales has
become one of the unresolved questions in measurement
invariance research (Millsap, 2005). As
Millsap and
Kwok (2004) point out, four typical
approaches have been suggested in practice. The first option
is to eliminate the non-invariant items, which results in
many different versions of a scale for different groups
(Cheung
& Rensvold, 1998). It can
also lead to incomplete coverage of the construct. The
second choice is to keep all non-invariant items in the
scale, and thus, the sum/mean score contains both invariant
and non-invariant items. The assumption of this approach is
that the non-invariant items may introduce little bias in
group comparisons. As discovered in the present study and
Millsap and
Kwok's (2004) work, this is an
assumption about which we cannot be confident. As found in
the present literature review, it is common to have a
proportion of the items invariant and another proportion
non-invariant. Researchers have proposed a partial
measurement-invariance model (Byrne, Shavelson, & Muthen,
1989) to address this issue. This
approach constrains the invariant items to be equal across
the groups while allowing the non-invariant items to be
different, and it seems less likely to introduce statistical
bias, compared with the mean/sum score methods, as the
non-invariant items are not forced to be invariant. However,
as discussed earlier, some critical questions are still not
addressed: What would the regression slopes and means be had
the construct been measured properly in different cultures?
Under what conditions should one employ a partial invariance
model? As the proportion of non-invariant items increases,
confidence decreases about the validity of this approach.
Even when only a small proportion of the items are
different, the following conceptual questions remain: Why
are those items different? Is it due to specific samples or
due to the scale? How could those aspects of the construct
be measured differently? What are the implications for
rethinking the construct? It is important to take one step
further to examine the non-invariant items, as well as the
conceptualization of the construct. The fourth option is to
avoid making direct group comparisons. Other researchers
have suggested that it seems reasonable to statistically
adjust for bias introduced by non-invariant items
(Cheung
& Rensvold, 1998). However,
there are currently no sound methods for achieving this
goal. Finally, lack of measurement invariance in a
one-factor model may indicate more factors or more complex
loading patterns. Once additional factors or different
factor-loading patterns are allowed, measurement invariance
can be achieved (McArdle & Cattell, 1994;
Meredith,
1993).
This article recommends a different
approach. When measurement invariance is not achieved at an
appropriate level, a researcher may still wish to draw some
useful conclusions with regard to cross-cultural comparisons
after spending a tremendous amount of time, effort, and
resources. It is possible that the consequences of lack of
invariance on the research questions are limited. To help
researchers decide when it is appropriate to make group
comparisons when facing lack of invariance, the following
steps are proposed: (1) testing measurement invariance for
each construct independently and examining the pattern of
lack of invariance. The goal is to understand the degree and
direction of lack of invariance. As demonstrated in the
present investigation, when lack of invariance is uniform,
bias in regression slopes or means tends to occur; however,
when lack of invariance is mixed, bias tends to be reduced.
(2) Imposing corresponding invariance constraints on
invariant items (e.g., factor loading invariance for
comparing regression slopes and loading and intercept
invariance for comparing means). (3) Imposing corresponding
invariance constraints on non-invariant items, as well as
invariant items. (4) Comparing groups on statistics under
study, such as regression coefficients and means, with and
without imposing corresponding invariance constraints on
non-invariant items (i.e., comparing results from Step 2 and
Step 3 to determine the discrepancy between the statistics
under study). The purpose is to understand the impact of
non-invariance on these statistics. If the differences are
small, it may be justifiable to make group comparisons.
However, future research should examine the effect size of
these group differences as well as their practical
implications. The reader is warned again that trivial
differences in statistics do not imply that the construct is
conceptually equivalent.
Limitations
A major assumption in the present
investigation is that all variables are continuous and are
normally distributed. When the response scales are discrete
categories, alternative estimation methods, such as weighted
least squares (Bollen, 1989), or alternative
frameworks, such as item response theory (Embretson & Reise,
2000), should be used. One direction in
future research is to systematically examine bias under
various levels of invariance for categorical variables. This
project also focused on only two-group comparisons, and
future research should expand the scope to more groups, as
in many applications, several different cultural groups can
be involved at the same time.
Multiple group confirmatory factor
analysis provides a method of testing construct equivalence
(i.e., whether the same construct is measured across
cultural groups). Construct equivalence, however, cannot
always be tested statistically (Cheung & Rensvold,
2000), particularly when a construct has
a wider scope in one culture than in another. For example,
filial piety is a more highly elaborated construct in China
than it is in the West (Hsieh, 1967). Measuring
this kind of construct may require more items in one culture
than it does in the other. Therefore, a particular set of
items may be conceptually adequate for assessing a construct
in one culture, be inadequate in another culture, and yet
display measurement equivalence when tested across both
cultures. To avoid this type of bias, both common and
culturally specific features should be included in the
measurement. When this is the case, one would expect that
the common features are invariant, whereas the specific
features are not invariant across cultural groups.
Conclusion
Lack of measurement invariance can have a
significant impact on the conclusions drawn from group
comparisons. When measurement invariance is not achieved,
one may discover “spurious” group
differences that are in fact artifacts of measurements, or
one may miss true group differences that have been masked by
measurements artifacts. The implications of these simulation
studies go beyond cross-cultural research. It can have an
impact on much broader situations whenever heterogeneous
groups are compared. These groups include ethnicity, gender,
age, measurement occasion in longitudinal research, and
treatment/control groups in experimental and prevention
studies.
Footnotes
1 Residual variance consists of unique variance to that item and
random error. It is assumed that the expected value of the random
error is zero.
2 In
1993, Meredith published the most influential article on measurement
invariance.
3 The
search ended in August of 2006. Thirty-two more articles met the
first three criteria, but the factor loadings for each
cultural/ethnic group were not available, and these articles were
thus excluded from the analysis.
4 Reliability also depends on the variability of the scores.
5 A
composite score is formed by taking the sum of all items.
6 The
average ratio of sample size was 1.34 and 8.01 for standardized and
unstandardized comparisons, respectively. The average of these two
numbers was 4.67. An outlier of 47.15 was excluded from the
analysis.
7 A
model is identified only if there is a unique numerical solution for
each of the parameters (Ullman, 2001). One common approach is to
set one of the factor loadings to 1 for each factor, and that item
is called the marker variable.
8 This pattern of results holds as long as the marker variable is
invariant.
References
Berry, J.
W. (1969). On
cross-cultural comparability.
International Journal of Psychology,
2,
119–128.
Bollen, K.
A. (1989). Structural equations
with latent variables. New
York: Wiley.
Brewer,
M. B., &
Chen,
Y.-R.
(2007). Where (who) are
collectives in collectivism? Toward conceptual clarification
of individualism and collectivism.
Psychological Review, 114,
133–151.
Byrne,
B. M., &
Campbell, T.
L.
(1999). Cross-cultural comparisons
and the presumption of equivalent measurement and
theoretical structure: A look beneath the
surface. Journal of Cross-Cultural
Psychology, 30,
555–574.
Byrne,
B.,
Shavelson,
R., &
Muthen,
B.
(1989). Testing for the
equivalence of factor covariance and mean structures: The
issue of partial measurement invariance.
Psychological Bulletin,
105,
456–466.
Byrne,
B. M., &
Watkins,
D.
(2003). The issue of measurement
invariance revisited. Journal of
Cross-Cultural Psychology, 34,
155–175.
Campbell,
D. T., &
Boruch, R.
F.
(1975). Making the case for
randomized assignment to treatments by considering the
alternatives: Six ways in which quasi-experimental
evaluations in compensatory education tend to underestimate
effects. In C. A.Bennett & A. A.Lumsdaine (Eds.),
Evaluation and experiment: Some critical issues in
assessing social programs (pp.
195–296).
New York:
Academic Press.
Chen,
C.,
Lee, S.
Y., &
Stevenson, H.
W.
(1995). Response style and
cross-cultural comparisons of rating scales among East Asian
and North American students.
Psychological Science, 6,
170–175.
Chen, F.
F. (2007). Sensitivity of
goodness of fit indices to lack of measurement
invariance. Structural Equation
Modeling, 14,
464–504.
Chen,
F. F.,
Sousa, K.
H., &
West, S.
G.
(2005). Testing measurement
invariance of second-order factor models.
Structural Equation Modeling,
12,
471–492.
Chen,
F. F., &
West, S.
G.
(2008). Measuring individualism and
collectivism: The importance of considering different
components, reference groups, and measurement
invariance. Journal of Research in
Personality, 42,
259–294.
Cheung,
G. W., &
Rensvold, R.
B.
(1998). Cross-cultural comparisons
using non-invariant measurement items.
Applied Behavioral Science Review,
6,
93–110.
Cheung,
G. W., &
Rensvold, R.
B.
(2000). Assessing extreme and
acquiescence response sets in cross-cultural research using
structural equation modeling. Journal
of Cross-Cultural Research, 31,
187–212.
Costa,
P. T., &
McCrae, R.
R.
(1992). Professional manual: Revised NEO
Personality Inventory (NEO PI–R) and NEO five
factor inventory (NEO–FFI).
Odessa, FL:
Psychological Assessment
Resources.
Diener,
E.,
Emmons, R.
A.,
Larsen, R.
J., &
Griffin,
S.
(1985). The Satisfaction With
Life Scale. Journal of Personality
Assessment, 49,
71–75.
Embretson,
S. E., &
Reise, S.
P.
(2000). Item response theory for
psychologists.
Mahwah:
NJ: Erlbaum.
Fiske,
A. P.,
Kitayama,
S.,
Markus, H.
R., &
Nisbett, R.
E.
(1998). The cultural matrix of social
psychology. In D. T.Gilbert, S. T.Fiske, & G.Linzey (Eds.), Handbook
of social psychology (4th ed.,
pp.
915–981).
Boston:
McGraw-Hill.
Heine,
S. J.,
Lehman, D.
R.,
Markus, H.
R., &
Kitayama,
S.
(1999). Is there a universal need
for positive self-regard?Psychological Review, 106,
766–794.
Heine,
S. J.,
Lehman, D.
R.,
Peng,
K., &
Greenholtz,
J.
(2002). What's wrong with
cross-cultural comparisons of subjective Likert scales? The
reference group effect. Journal of
Personality and Social Psychology,
82,
903–918.
Holland,
P. W., &
Thayer, D.
T.
(1988). Differential item performance
and the Mantel-Haenszel procedure. In H.Wainer & H.Braum (Eds.), Test
validity (pp.
129–145).
Hillsdale, NJ:
Erlbaum.
Horn,
J. L.,
McArdle, J.
J., &
Mason,
R.
(1983). When is invariance not
invariant: A practical scientist's look at the ethereal
concept of factor invariance.
Southern Psychologist, 4,
179–188.
Hsieh,
Y.-W. (1967). Filial piety
and Chinese society. In C. A.Moore (Ed.), The
Chinese mind: Essentials of Chinese philosophy and
culture (pp.
165–187).
Honolulu:
University of Hawaii Press.
Hui,
C. H., &
Triandis, H.
C.
(1985). Measurement in cross-cultural
psychology: A review and comparison of
strategies. Journal of Cross-Cultural
Psychology, 16,
131–152.
Irvine,
S. H., &
Carroll, W.
K.
(1980). Testing and assessment across
cultures: Issues in methodology and theory.
In H. C.Triandis & J. W.Berry (Eds.), Handbook
of cross-cultural psychology (Vol.
2, pp.
181–244).
Newton, MA:
Allyn & Bacon.
Jöreskog, K.
G. (1971). Simultaneous
factor analysis in several populations.
Psychometrika, 36,
409–426.
Jöreskog,
K. G., &
Sörbom,
D.
(1999). LISREL 8: User's reference
guide (2nd ed.).
Chicago:
Scientific Software
International.
Kwan,
V. S. Y.,
Bond, M.
H.,
Boucher, H.
C.,
Maslach,
C., &
Gan,
Y.
(2002). The construct of
individuation: More complex in collectivist than in
individualist cultures. Personality
and Social Psychology Bulletin,
28,
300–310.
Lehman,
D.,
Chiu,
C., &
Schaller,
M.
(2004). Psychology and
culture. Annual Review of
Psychology, 55,
689–717.
Little, T.
D. (1997). Mean and
covariance structures (MACS) analyses of cross-cultural
data: Practical and theoretical issues.
Multivariate Behavioral Research,
32,
53–76.
Markus,
H. R., &
Kitayama,
S.
(1991). Culture and self:
Implications for cognition, emotion, and
motivation. Psychological
Review, 98,
224–253.
Maslach,
C.,
Stapp,
J., &
Santee, R.
T.
(1985). Individuation: Conceptual
analysis and assessment. Journal of
Personality and Social Psychology,
49,
729–738.
McArdle,
J. J., &
Cattell, R.
B.
(1994). Structural equation models of
factorial invariance in parallel proportional profiles and
oblique confactor problems.
Multivariate Behavioral Research,
29,
63–113.
Meredith,
W. (1993). Measurement
invariance, factor analysis and factorial
invariance. Psychometrika,
58,
525–543.
Millsap, R.
E. (2005). Four unresolved
problems in studies of factorial invariance.
In A.Maydeu-Olivares & J.McArdle (Eds.),
Contemporary psychometrics: A festschrift for
Roderick P. McDonald: Multivariate applications book
series (pp.
153–171).
Mahwah, NJ:
Erlbaum.
Millsap,
R. E., &
Everson, H.
T.
(1993). Methodology review:
Statistical approaches for assessing measurement
bias. Applied Psychological
Measurement, 17,
297–334.
Millsap,
R. E., &
Kwok,
O.-M.
(2004). Evaluating the impact of
partial factorial invariance on selection of multiple
populations. Psychological
Methods, 9,
93–115.
Moorman,
R. H., &
Podsakoff, P.
M.
(1992). A meta-analytic review and
empirical test of the potential confounding effects of
social desirability response sets in organizational behavior
research. Journal of Occupational and
Organizational Psychology, 65,
131–149.
Muthén,
L. K., &
Muthén,
B. O.
(1998). Mplus user's guide.
Los Angeles:
Muthén &
Muthén.
Oishi,
S., &
Sullivan, H.
W.
(2005). The mediating role of
parental expectations in culture and
well-being. Journal of
Personality, 73,
1267–1294.
Oyserman,
D.,
Coon, H.
M., &
Kemmelmeier,
M.
(2002). Rethinking individualism
and collectivism: Evaluation of theoretical assumptions and
meta-analyses. Psychological
Bulletin, 128,
3–72.
Peng,
K.,
Nisbett, R.
E., &
Wong, N.
Y.
(1997). Validity problems comparing
values across cultures and possible
solutions. Psychological
Methods, 2,
329–344.
Poortinga, Y.
H. (1989). Equivalence of
cross-cultural data: An overview of basic
issues. International Journal of
Psychology, 24,
737–756.
Reise,
S. P.,
Widaman, K.
F., &
Pugh, R.
H.
(1993). Confirmatory factor analysis
and item response theory: Two approaches for exploring
measurement invariance. Psychological
Bulletin, 114,
552–566.
Rhee,
E.,
Uleman, J.
S., &
Lee, H.
K.
(1996). Variations in collectivism
and individualism by ingroup and culture: Confirmatory
factor analysis. Journal of
Personality and Social Psychology,
71,
1037–1054.
Riordan,
C. M., &
Vanderberg,
R. J.
(1994). A central question in
cross-cultural research: Do employees of different cultures
interpret work-related measures in an equivalent manner?Journal of Management, 20,
643–671.
Rosenberg,
M. (1965). Society and the
adolescent self-image.
Princeton, NJ:
Princeton University Press.
Smith,
L. L., &
Reise, S.
P.
(1999). Gender differences on
negative affectivity: An IRT study of differential item
functioning on the Multidimensional Personality
Questionnaire Stress Reaction Scale.
Journal of Personality and Social
Psychology, 75,
1350–1362.
Sörbom,
D. (1978). An alternative
to the methodology for analysis of
covariance. Psychometrika,
43,
381–396.
Steenkamp,
J. E. M.,
& Baumgartner,
H.
(1998). Assessing measurement
invariance in cross-national consumer
research. Journal of Consumer
Research, 25,
78–90.
Ullman, J.
B. (2001). Structural
equation modeling. In B. G.Tabachnick & L. S.Fidell (Eds.), Using
multivariate statistics (pp.
653–771).
Boston:
Allyn & Bacon.
Van de
Vijver, F. J.
R., &
Leung,
K.
(1997). Methods and data analysis
for comparative research. In J.Berry, Y.Poortinga, & J.Pandey (Eds.), Handbook
of cross-cultural psychology (Vol.
1, pp.
259–300).
Boston:
Allyn & Bacon.
Van de
Vijver,
F., &
Leung,
K.
(2000). Methodological issues in
psychological research on culture.
Journal of Cross-Cultural Psychology,
31,
33–51.
Widaman,
K. F., &
Reise, S.
P.
(1997). Exploring the measurement
invariance of psychological instruments: Applications in the
substance use domain. In K. J.Bryant, M.Windle, & S. G.West (Eds.), The
science of prevention: Methodological advances from alcohol
and substance abuse research (pp.
281–324).
Washington, DC:
American Psychological
Association.
APPENDIX
APPENDIX A: Model Parameters for Study 1
Model Parameters for Study 2
Model Parameters for Study 3
Submitted: September 5, 2005 Revised: May 9, 2008 Accepted: May 20, 2008
Copyright 2008 American Psychological Association
This publication is protected by US and international copyright laws and its content may not be copied without the copyright holders express written permission except for the print or download capabilities of the retrieval software used for access. This content is intended solely for the use of the individual user.
Source: Journal of Personality and Social Psychology. Vol.95 (5) pp. 1005-1018.
Accession Number: psp-95-5-1005 Digital Object Identifier: 10.1037/a0013193