EBSCOhost

1 article(s) will be saved.

To continue, in Internet Explorer, select FILE then SAVE AS from your browser's toolbar above. Be sure to save as a plain text file (.txt) or a 'Web Page, HTML only' file (.html). In Netscape, select FILE then SAVE AS from your browser's toolbar above.

EBSCO Publishing Citation Format: APA (American Psychological Assoc.):

NOTE: Review the instructions at http://support.ebsco.com.ezp.waldenulibrary.org/help/?int=ehost&lang=en&feature_id=APA and make any necessary corrections before using. Pay special attention to personal names, capitalization, and dates. Always consult your library resources for the exact formatting and punctuation guidelines.

References

Chen, F. (2008, November). What Happens If We Compare Chopsticks With Forks? The Impact of Making Inappropriate Comparisons in Cross-Cultural Research. Journal of Personality & Social Psychology, 95(5), 1005-1018. Retrieved December 13, 2008, from Academic Search Premier database.

What Happens If We Compare Chopsticks With Forks? The Impact of Making Inappropriate Comparisons in Cross-Cultural Research

By: Fang Fang Chen
University of Delaware

Acknowledgement: I would like to express appreciation to Donna Coffman, Larry Cohen, Samuel Gaertner, Kimberly Juliano, Shanhong Luo, Beth Morling, Kristopher Preacher, Robert Simons, Stephen West, and Zugui Zhang for their thoughtful comments. Special thanks go to Lyle Jones and Roger Millsap for their insights on scale development and measurement invariance. I am also grateful to the Quantitative Forum in the Psychology Department at the University of North Carolina at Chapel Hill for fruitful discussion at the early stage of this work.

Correspondence concerning this article should be addressed to: Fang Fang Chen, Department of Psychology, University of Delaware, Wolf Hall, Newark, DE 19716 Electronic Mail may be sent to: [email protected].

Culture affects people in a variety of basic psychological domains, including self-concept, attribution and reasoning, interpersonal communication, negotiation, intergroup relations, and psychological well-being (for review, see Brewer & Chen, 2007; Fiske, Kitayama, Markus, & Nisbett, 1998, 2004; Lehman, Chiu, & Schaller; Markus & Kitayama, 1991; Oyserman, Coon, & Kemmelmeier, 2002). Suppose we were interested in studying self-esteem and life satisfaction in the People's Republic of China and the United States. We may wish to test the mean differences between the two cultural groups on the two constructs and, further, to examine whether the relationship of self-esteem to life satisfaction is stronger in one culture than in the other. Could we simply use scales developed in one culture, such as Rosenberg's self-esteem scale (Rosenberg, 1965), in both cultural groups and then compare the results? To make valid comparisons across different cultural or ethnic groups, we must address an important question: Are we comparing the same constructs across different groups?

What Is Measurement Invariance and Why Is It Important in Cross-Cultural Research?

When we compare scale scores, such as self-esteem, across different groups, we make a critical assumption that the scale measures the same construct in all of the groups. If that assumption is true, comparisons and analyses of those scores are valid, and subsequent interpretations are meaningful. However, if that assumption does not hold, such comparisons do not produce meaningful results. This is the general issue of measurement invariance.

Measurement invariance is the equivalence of a measured construct in two or more groups, such as people from different cultures. It assures that the same constructs are being assessed in each group. Measurement invariance is an important issue if a researcher wishes to make group comparisons (e.g., Byrne & Watkins, 2003; Reise, Widaman, & Pugh, 1993; Riordan & Vanderberg, 1994; Van de Vijver & Leung, 1997; Widaman & Reise, 1997). Meaningful comparisons of statistics, such as means and regression coefficients, can only be made if the measures are comparable across different groups.

Cross-cultural researchers have long recognized the importance of ensuring construct comparability in different cultural or ethnic groups (Berry, 1969; Irvine & Carroll, 1980; Poortinga, 1989; Van de Vijver & Leung, 1997). However, it is the development of measurement invariance tests (Jöreskog, 1971; Meredith, 1993; Millsap & Everson, 1993; Sörbom, 1978; Widaman & Reise, 1997) and the recent development of advanced statistical tools that have made it possible to perform rigorous tests of measurement invariance.

Measurement invariance can be tested when a scale is composed of multiple items or subscales. With continuous variables, the most frequently used technique for testing measurement invariance is multiple-group confirmatory factor analysis (CFA; F. F. Chen, 2007; F. F. Chen, Sousa, & West, 2005; F. F. Chen & West, 2008; Meredith, 1993; Millsap & Everson, 1993; Widaman & Reise, 1997). In factor analytic terms, the items serve as indicators of the common factor (i.e., the construct that the items intend to measure) in a CFA model. The basic idea of applying multiple-group CFA to test measurement invariance is to examine the interrelations between the indicators (i.e., items or subscales) and the factors that the indicators are supposed to measure. Multiple-group CFA can be used to test the equivalence of the factor structure (i.e., number of factors), factor loadings (i.e., unit of a scale), intercepts (i.e., origin of a scale), residual variance (i.e., precision of a scale), and other aspects of a construct across different groups in a series of hierarchical models.

The most basic level of measurement invariance is known as configural invariance (Horn, McArdle, & Mason, 1983) or factor-form invariance (Cheung & Rensvold, 2000). It tests whether similar, but not identical, factors are measured in the groups (Widaman & Reise, 1997). The same item must be associated with the same latent factor in each group, but the factor loadings can differ across groups.

The second level of invariance is factor loading or metric invariance. Factor loadings represent the strength of the relationships between each factor and its associated items (Bollen, 1989; Jöreskog & Sörbom, 1999). Factor loadings can be conceptualized as the slopes of regression lines, that is, the weights obtained by regressing the item responses on the underlying latent factors. When factor loadings are equal, the unit of the measurement is identical, and thus predictive relationships can be compared across groups.

The third level of invariance is intercept or scalar invariance. It tests whether an item has the same point of origin across different groups. When invariance is achieved at both the factor loading and intercept levels, scores from different groups have the same unit of measurement (i.e., factor loading) as well as the same origin (i.e., intercept), and thus factor means can be compared across groups. Otherwise, it is not certain whether group differences on factor means are attributable to valid cultural differences or to measurement artifacts.

The fourth level is the invariance of residual variance. It tests the equivalence of the precision of a scale. ¹ Measurement invariance can be used to test the invariance of a scale (i.e., an omnibus test in which all items are tested simultaneously) as well as the invariance of individual items (i.e., planned contrast in which one or more items are tested). When items meet the standards of measurement invariance, they are considered invariant; otherwise, they are defined as non-invariant, lacking invariance, or having measurement bias. It is possible that some of the items are invariant, whereas others are not in a given scale. For detailed procedures on testing measurement invariance and criteria on evaluating measurement invariance, see F. F. Chen (2007); F. F. Chen, Souse, and West (2005); and Widaman and Reise (1997).

What Factors Can Cause Lack of Measurement Invariance?

When scale scores are compared across different cultural groups, a variety of sources can affect the equivalence of the construct. Lack of configural invariance (i.e., the number of factors that underlies a construct is different) is most likely to occur when a construct is simply imported from one cultural setting to another, because a construct can be more differentiated in one culture than in another. For example, the concept of individuation (Maslach, Stapp, & Santee, 1985) is best represented by two factors in China, whereas it is unidimensional in the United States (Kwan, Bond, Boucher, Maslach, & Gan, 2002). Similarly, filial piety is also a more elaborated concept in China than in the United States (Hsieh, 1967).

Lack of loading invariance (i.e., unit of a scale) is likely to arise from multiple causes. First, it can happen when a scale is imported from one culture, such as the United States, to another, such as China, but the definitions and meanings of that concept do not fully overlap across different cultures. As a result, the item content is more appropriate for one culture than for the other. For example, for North Americans, self-esteem mainly stems from having unique personal attributes and individual achievements. In contrast, for people from Eastern cultures, the self is deeply connected with family, friends, groups, etc., and thus the sense of “we” and interdependence with others may be the most important source of self-esteem. Consequently, items that tap the Western view of self-esteem, such as “I am a person of worth,” and “I feel that I have a number of good qualities,” may not be good indicators of self-esteem in an Eastern context. The association between Chinese participants' self-esteem and endorsement of Western items (i.e., factor loadings) may be weaker than for American participants. Second, lack of loading invariance can come from inappropriate translation. When items are translated from one language to another, their meanings can change, particularly for idiomatic expressions. For example, items like “I feel blue” as a measure of depression would make Chinese participants feel that this item is out of the blue. The American participants would thus respond to the content of the item, whereas the Chinese participants would give inconsistent answers. As a result, the strength of the relationship (i.e., factor loading) between the items and the depression construct would be weaker for the Chinese participants than for their American counterparts. Third, response sets, particularly the tendency to use or avoid extreme responses, can result in lack of loading invariance. For example, evidence suggests that U.S. participants have an inclination to use the extreme ends of a response scale, whereas Chinese participants are more likely to use the middle points (C. Chen, Lee, & Stevenson, 1995; Hui & Triandis, 1985), resulting in a restricted range of responses among the Chinese participants. Accordingly, factor loadings differ across the two groups.

Several factors can affect the origin of a scale, that is, the intercept of the scale. First, social desirability, a tendency to follow the social norms, can lead participants in one group to consistently give higher or lower ratings than those in other groups (Hui & Triandis, 1985). For example, for the item “How happy were you in the past week?,” the true happy state might be 3 on a 5-point scale for participants from both the United States and China. However, the American may respond with 4 because of the need to preserve positive self-esteem (e.g., Heine, Lehman, Markus, & Kitayama, 1999). Second, when a group is preoccupied with its own defects or deficiencies, it may convey a stronger desire for these values or traits. For example, survey ratings indicate that some minority parents and students value the importance of education more than do their European and Asian counterparts. However, behavioral observations, such as the amount of time that students stay in school and study, tell a different story (cf. Peng, Nisbett, & Wong, 1997). Third, people from different cultural groups may use different reference frameworks in making judgments about themselves. For example, current trait or attitude measures of individualism and collectivism often fail to reveal the expected cultural differences. However, when participants from Japan and Canada were asked to compare themselves with either Canadians or Japanese, the expected cultural differences were enhanced when the cross-reference group was used (Heine, Lehman, Peng, & Greenholtz, 2002). Under all three scenarios, the origin of a scale would be different. A 3 in Culture A may be equal to a 4 in Culture B, resulting in lack of intercept invariance.

Given the comparative nature of the studies, it is quite a challenging task to achieve measurement invariance in cross-cultural research, particularly when we simply apply instruments developed in one culture to other cultural contexts. However, this is a common practice in applied research. To what extent are these scales invariant cross-culturally, and how confident are we about the conclusions drawn from these studies? To address these issues, A literature review was conducted on the instruments used in cross-cultural studies.

Are the Instruments Comparable Cross Culturally? Analysis of the Current Practice

The following key words and 30 other similar words were used to search articles published from 1993 ² to 2006 in the PsycINFO database: “cross-cultural invariance,” “factor invariance,” “measurement invariance.” One hundred thirty comparisons ³ met the following selection criteria: (a) the instrument was originally developed in North America, (b) Caucasian Americans/Canadians were used as the reference group, (c) the article was published in a peer-reviewed journal, and (d) factor loadings of each cultural or ethnic group were reported or obtained upon request.

Analyses were performed to examine the pattern and severity of factor-loading differences across the cultural or ethnic comparisons. The analysis results, such as effect size, pattern of non-invariance, and sample size, were used as the basis for conducting subsequent simulated studies, in which bias in regression slopes and means resulting from lacking measurement invariance was examined.

Following the convention of Holland and Thayer (1988), the mainstream cultural group is defined as the reference group (e.g., United States), and the other ethnic minority or cultural groups are defined as focal groups (e.g., China). To clarify the nature of loading differences, two patterns of non-invariance are defined in this review: (a) When all the non-invariant loadings are higher in the reference group than in the focal group, it is classified as a uniform pattern of non-invariance; (b) when some of the non-invariant loadings are higher in the reference group and some are higher in the focal group, it is classified as a mixed pattern of non-invariance. In both cases, the magnitude of the loading difference is the numerical difference between the loadings for a given item across two groups.

Among the 130 cross-cultural and cross-ethnic comparisons, 9 lacked configural invariance, which means that the number of factors that underlie the items was different across groups. These cases are excluded from further analysis, because it is not meaningful to compare factor loadings when configural invariance is not achieved. In the remaining 122 comparisons, 97 were based on standardized factor loadings, and 25 were based on unstandardized factor loadings. Further analyses are based on the standardized factor loadings, because unstandardized factor loadings are subject to scaling, which prevents direct comparisons across studies.

For 74 of the 97 standardized comparisons (74.2%), the average loading was higher in the reference group than in the focal group, and the average loading difference was .13 (SD = .08). Although the magnitude of the average loading difference between the groups appears small, its impact may not be trivial.

Findings further indicate that 14 of the 97 comparisons (14.4%) had all loadings higher in the reference group (e.g., United States) than in the focal group (e.g., China), showing a uniform pattern of non-invariance. However, it was more common that only a proportion of the items, rather than all items, had higher loadings in the reference group than in the focal group: 26 of the comparisons (26.8%) had at least 90% of the loadings higher in the reference group, 48 of the comparisons (49.5%) had at least 75% of the loadings higher in the reference group, 81 of the comparisons (83.5%) had at least 50% of the loadings higher in the reference group, and 94 of the comparisons (96.9%) had at least 30% of the loadings higher in the reference group.

It is interesting that 7 of the 97 comparisons (7.2%) had about half of the loadings higher in the reference group and the other half higher in the focal group, showing a mixed pattern of non-invariance.

Given these findings, it is important to examine bias in group comparisons resulting from a proportion of non-invariant items, in addition to bias associated with the condition in which all loadings are higher in one group than in the other. It is also meaningful to investigate bias associated with the pattern of non-invariance, that is, whether the non-invariant loadings are uniformly higher in one group or the pattern is mixed.

Although no studies have systematically examined the pattern of factor loadings across different cultural and ethnic groups, the findings from this review are consistent with the literature on reliabilities. For example, in compensatory education research, test scores obtained from the disadvantaged minority groups often have lower reliability, compared with those of the advantaged group (Campbell & Boruch, 1975). Reviews of self-reported measures on values indicate that higher reliability was more often reported in the American samples than in other cultural groups (Peng et al., 1997). The lower reliability in the focal groups is a reasonable indication of lower factor loadings and is thus a sign of measurement bias. ⁴

Given that reliabilities are more routinely reported than factor loadings in published articles, a second search was conducted. To limit the scope of the search, Rosenberg's (1965) self-esteem scale was chosen, as it is perhaps one of the most widely used scales cross-culturally. Using key words “culture and Rosenberg self-esteem,” and “cross-cultural and Rosenberg self-esteem,” a search was performed on the PsycINFO database, and it was limited to articles published from 1995 to 2006. Seventy-five comparisons met the following criteria: (a) Rosenberg's self-esteem measure was used cross-culturally or cross ethnically, (b) Caucasian Americans/Canadians were used as the reference group, (c) the article was published in a peer-reviewed journal, and (d) reliability of the scale for each cultural or ethnic group was reported or obtained upon request. In 59 of the 75 comparisons (78.7%), reliability of the Rosenberg self-esteem scale was higher in the Caucasian Americans or Canadians than in the other cultural or ethnic group(s), and the average difference in reliability was .07 (M_U.S./Canada = .87, SD = .02 vs. M_{Non-U.S./Canada} = .80, SD = .08). This pattern is particularly true when comparing North Americans with Asians, because in 18 of the 21 comparisons (85.7%), scores of North Americans had higher reliability than the scores of Asians, and the difference in reliability was .09 (M_U.S./Canada = .87, SD = .02 vs. M_{Non-U.S./Canada} = .78, SD = .05).

This analysis also indicates that, consistent with the literature, North Americans have higher self-esteem than other cultural or ethnic groups (Cohen's d = .31), and this difference is moderately large between North Americans and Asians (Cohen's d = .59). However, it is possible that the lower reliability in the focal groups is, at least in part, responsible for the commonly reported cultural and ethnic difference in self-esteem.

What Happens When Instruments Are Not Comparable Cross-Culturally? The Present Simulation Studies

When we compare diverse groups on the basis of instruments that do not have the same psychometric properties, we may discover erroneous “group differences” that are in fact artifacts of measurement, or we may miss true group differences that have been masked by these artifacts. As a result, a harmful education program may be regarded as beneficial to the students, or an effective health intervention program may be considered of no use to depressive patients.

Although measurement invariance has been increasingly tested in cross-cultural comparisons (e.g., Byrne & Campbell, 1999; Little, 1997; Rhee, Uleman, & Lee; 1996; Steenkamp & Baumgartner, 1998), it is still usually assumed, rather than tested. The author's review of articles in the Journal of Personality and Social Psychology from 1985 to 2005 indicate that although 48 articles involved cross-cultural comparisons of attitudes, values, personality, and other self-reported surveys, only 8 studies (less than 17%) tested measurement invariance across different cultural groups, with the remainder using a sum score or mean score. The sum-score approach takes the total score of the items in a scale, and similarly, the mean score takes the average of the items. Both approaches assume that the measures under study are invariant across different groups. In addition, it is not uncommon to pool participants from different cultural or ethnic groups for evaluation, a procedure that assumes measurement invariance as well. However, as discovered in the author's review, this assumption does not hold in many applications.

To explore the consequences of making comparisons based on non-invariant measures on the conclusions drawn from a study, Millsap and Kwok (2004) conducted an important series of simulation studies. Given that school admission committees or employers often select students or employees from different ethnic or cultural backgrounds, Millsap and Kwok examined selection bias based on a criterion that is only partially invariant. Selection bias was defined by the accuracy of classifying people according to two standards: a factor score for each group and a composite score in which the group difference in factor loadings was ignored. ⁵ Four categories were created: (a) true positive, should be selected on the basis of the factor score and was selected on the basis of the composite score; (b) true negative, should not be selected on the basis of the factor score and was not selected on the basis of the composite score; (c) false positive, should not be selected but was selected; and (d) false negative, should be selected but was not selected. It was found that even small group differences in factor structure could have substantial influence on selection accuracy, particularly for sensitivity, which is the number of individuals who were selected on the basis of both their factor score and their composite score divided by the number of individuals who were selected solely on the basis of their factor score. For example, when the proportion of non-invariance varied from 0% (control condition) to 75%, sensitivity could drop from 64.2% to 22.1%.

No studies have examined the bias that lack of measurement invariance may introduce to commonly used statistics, such as means and regression slopes, in group comparisons. For example, suppose we were interested in asking whether self-esteem would predict life satisfaction to the same degree for Chinese as for Caucasian students. In what direction and to what extent would the predictive relationship (i.e., the beta weight or regression slope) be affected by lack of invariance in the self-esteem measure? How would the relationship be biased if the outcome measure, life satisfaction, also lacks invariance? In what direction would the group means for self-esteem and life satisfaction be biased? To address these issues, three simulation studies were conducted to fill in this gap.

Overview of Present Simulation Studies

Many researchers have discussed the importance of testing measurement invariance and are well aware that lack of invariance can lead to possible bias in conclusions (e.g., Widaman & Reise, 1997). However, this is the first investigation that examines both the direction and degree of bias resulting from various forms of non-invariance in cross-cultural research. This information could be vital to researchers when interpreting findings based on non-invariant measures, because it can warn readers by specifying the direction and degree of bias in each cultural or ethnic group, given that the requirements for measurement invariance are often difficult to meet in applied research. Second, this is also the first study in which the simulation conditions are based on the empirical findings in the cross-cultural literature, and it therefore maximizes the external validity of the study. Third, this investigation is particularly relevant to the cross-cultural study of personality and social psychological phenomena.

There are three major goals in the present investigation: (a) to examine bias in regression slopes (beta weights) when factor loadings are not invariant, as factor loading invariance is a prerequisite for regression slope comparisons (e.g., When using self-esteem to predict subjective well-being, how would the predictive relationship be affected if the factor loadings of self-esteem were different across groups?); (b) to explore bias in means when factor loadings are not invariant, because factor loading invariance is also a prerequisite for proper mean comparisons (e.g., How would group means be biased when factor loadings of self-esteem differ?); (c) to investigate bias in means when intercepts (i.e., point of origin) are not invariant, as intercept invariance is a prerequisite for mean comparisons, in addition to factor loading invariance (Widaman & Reise, 1997; e.g., When one group has higher intercepts in self-esteem than the other group, in what direction would the means be biased in each group?) Given the computational complexity and intensity, the Mplus software program (Muthén & Muthén, 1998) was used to conduct the simulation.

Study 1: Lack of Loading Invariance and Bias in Regression Slopes

As discussed earlier, lack of invariance in factor loadings can come from insufficient overlap in meaning of a construct between cultural groups, inappropriate content of the items, translation problems, the tendency to use or avoid extreme responses on a response scale, differential responses to positively versus negatively worded items, and other sources. Study 1 was conducted to examine predictive bias between two constructs when a predictor or an outcome measure lacks invariance in factor loadings. This would allow us to examine bias in a predictive relationship, such as using self-esteem to predict life satisfaction across groups. When bias is found, one may discover a bogus interaction effect of culture by predictor. For example, self-esteem may be found to be a stronger predictor of life-satisfaction for Caucasians than for Chinese, when in fact the relationship is the same for both groups.

Design

To systematically examine bias in regression slopes when the predictor or criterion lacks loading invariance and to maximize the external validity simultaneously, 4 (Proportion of Non-invariance: 87.5%, 75%, 50%, and 25%) × 2 (Pattern of Invariance: uniform vs. mixed) × 2 (Ratio of Sample Size: 1 vs. 1, 4 vs. 1; total N = 300) experimental conditions were generated (see Appendix for detailed model parameters and additional justification for parameter selections).

The proportion of non-invariance conditions correspond approximately to the findings in the author's literature review: 26 of the 97 comparisons had at least 90% of the loadings higher in the reference group, 48 of the comparisons had at least 75% of the loadings higher in the reference group, 81 of them had at least 50% of the loadings higher in the reference group, and 94 of them had at least 30% of the loadings higher in the reference group.

The proportion of non-invariance was varied to serve two purposes: (a) to maximize the external validity of the study (as found in the author's literature review, in many of the applications, only a proportion of the items, rather than all the items, in a scale are non-invariant); (b) to explore whether the relationship between the degree of bias corresponds monotonically to the degree of non-invariance (i.e., to examine whether a greater degree of non-invariance in factor loadings leads to a greater degree of bias in regression slopes, which is particularly important when the power of testing measurement is considered). This issue is addressed further in the discussion.

In the uniform pattern of non-invariance condition, all non-invariant loadings were set higher in the reference group (e.g., United States) than in the focal group (e.g., China). In the mixed pattern of non-invariance condition, about half of the items were set higher in the reference group, whereas the other half were set higher in the focal group. This condition was designed to match the finding in the review as well, given that 7 of the 97 comparisons showed this pattern of non-invariance. The ratio of sample size (1 vs. 1 and 4 vs. 1) also reflects the findings in the review, because among 36.4% of the comparisons, the ratio of sample size was less than 1.5, and the average ratio of sample size was 4.67 across all comparisons. ⁶ Finally, given that in applied research, both the predictor and outcome variable may lack invariance, such a condition was also examined. For simplicity, the degree and direction of bias were equivalent in both variables, and only the uniform condition was considered.

The expected mean and covariance structures were generated in Version 3.01 of Mplus (Muthén & Muthén, 1998), and maximum likelihood estimation was used to estimate models. First, a population matrix was generated, corresponding to the parameterization of a target two-group model. In the target model, the factor loadings were different between the groups (except for the marker variable, ⁷ which was set equal across the groups); all other parameters (i.e., factor variance and covariance, and residual variances) were set equal across the groups. Second, a configural invariance model was fit to the generated population matrix, in which the pattern of the factor loadings was the same (i.e., the same item loaded on the same factor[s]), whereas all loadings were freely estimated in both groups. Third, a factor loading invariance model was fit to the population matrix, in which all the loadings were equated across the groups. Regression slopes obtained from the loading invariance model with the true values in the configural invariance model were compared to determine the direction and degree of bias in the regression slopes.

Results

Tables 1 and 2 present bias in regression slopes when the predictor lacks loading invariance and when the criterion lacks loading invariance, respectively. Table 3 displays the results when both the predictor and the criterion violate loading invariance. Relative bias was calculated by subtracting the estimated slope in the loading invariance model from the true regression slope and then dividing the difference by the true regression slope. A positive value indicates that the slope was overestimated, and a negative valued indicates that the slope was underestimated.

When a Predictor Is Lack of Factor Loading Invariance: Bias in Regression Slopes (Study 1)

When a Criterion Is Lack of Factor Loading Invariance: Bias in Regression Slopes (Study 1)

When Both the Predictor and Outcome Variable Are Lack of Factor Loading Invariance: Bias in Regression Slopes (Study 1)

Predictor or criterion lack of loading invariance—uniform

When the predictor, such as self-esteem, lacked loading invariance, the regression slope was underestimated in the reference group (e.g., United States) but overestimated in the focal group (e.g., China). For example, in the case of self-esteem predicting life satisfaction, when self-esteem is a better measure for Americans than for Chinese, the predictive relationship is weaker for Americans than for Chinese, even when the true relationship (as specified in the simulation) is the same for both groups. As a result, an artificial interaction effect of Culture × Self-Esteem is created. The degree of bias (i.e., the extent to which the slope is overestimated or underestimated, or the artificially created group difference in the slope) is affected by the proportion of non-invariant items, group membership, and ratio of sample size (i.e., sample size of the reference group vs. focal group). That is, when the proportion of non-invariant items increases, bias increases; bias is bigger in the focal group than in the reference group, especially when the proportion of non-invariance is large. When sample size increases in the reference group relative to the focal group, bias decreases in that group but increases in the focal group.

When the criterion lacked loading invariance, the opposite pattern was found, that is, the regression slope was overestimated in the reference group (e.g., United States) but underestimated in the focal group (e.g., China). Given the same example, when life satisfaction is a more appropriate instrument for Americans than for Chinese, the regression slope is larger for Americans than for Chinese, even when the predictive relationship is the same for both groups. Consequently, lack of invariance in life satisfaction creates a pseudo interaction effect of Culture × Self-Esteem. As in the case when the predictor is non-invariant, degree of bias is affected by the proportion of non-invariance, group membership, and ratio of sample size. That is, when the proportion of non-invariant items increases, bias increases; bias is bigger in the focal group than in the reference group, especially when the proportion of non-invariant items is large. When the reference group has a larger sample size, bias decreases in that group but increases in the focal group.

Predictor or criterion lack of loading invariance—mixed

When the pattern of non-invariant items in the predictor was mixed, bias in the regression slope was reduced in both groups. Similarly, when the pattern of lack of loading invariance in the criterion was mixed, bias in the regression slope was also reduced in both groups. Thus, when some of the loadings are higher in the reference group and some are higher in the focal group, artificially created group difference in the predictive relationship is reduced because bias associated with the reference group and bias associated with the focal group tend to cancel each other out. However, reduced bias in regression slopes does not imply that the measures are invariant.

Both predictor and criterion lack of loading invariance—uniform

When both the predictor, such as self-esteem, and the outcome variable, such as life satisfaction, lacked loading invariance, and when the direction and degree of non-invariance were comparable in both groups, bias was reduced. However, this result does not imply that using non-invariant measures simultaneously in the predictor and the criterion is the solution to lack of measurement invariance. Instead, it suggests that when lack of invariance occurs in both the predictor and the outcome variable, statistical bias associated with the non-invariant predictor and bias associated with the non-invariant outcome variable tend to cancel each other out.

Summary

The results of Study 1 indicate that lack of factor-loading invariance could lead to substantial bias in regression slopes. The direction of bias depends on whether a predictor or criterion lacks invariance. When the reference group had higher loadings in the predictor, the regression slope was underestimated in the reference group but overestimated in the focal group. When the reference group had higher loadings in the criterion, the opposite pattern was found. Under both conditions, a bogus interaction effect was produced. However, when some of the loadings were higher in the reference group and some were higher in the focal group, bias in the regression slopes was reduced. When lack of loading invariance occurred in both the predictor and outcome variable, bias was also reduced. However, the construct validity of the scales is still in question, as they may measure different concepts in different cultures.

Study 2: Lack of Loading Invariance and Bias in Means

The goal of Study 2 was to explore bias in means when factor loadings are not invariant, given that loading invariance is a prerequisite for mean comparisons. The experimental conditions were the same as in Study 1, except that the tested model was a one-factor measurement model with no predictor or criterion involved. Intercepts and residual variances were set equal across the groups in the target model. Model fitting procedures were also similar to those in Study 1, except that in Step 2, both factor loadings and intercepts were equated. Relative bias was calculated by subtracting the mean obtained from the invariance model from the true factor mean and then dividing the difference by the true factor mean. A positive value indicates that the mean was overestimated, and a negative value indicates that the mean was underestimated.

Results

Bias in factor means resulting from lack of loading invariance is presented in Table 4. When the reference group (e.g., United States) had higher loadings, the factor mean was overestimated in the reference group but underestimated in the focal group (e.g., China). As a result, an artificial group difference was created. The degree of bias was affected by the proportion of non-invariance, ratio of sample size, and pattern of non-invariance. That is, when lack of loading invariance was uniform, as the proportion of non-invariant items increased, bias increased; the degree of bias was larger in the focal group than in the reference group. When sample size increased in the reference group relative to the focal group, bias decreased in that group but increased in the focal group. In contrast, when lack of loading invariance was mixed, bias in the factor mean was minimized in both groups. As discussed earlier, lack of bias in the means does not imply the construct is equivalent across groups.

When Loadings Are Lack of Invariance: Bias in Factor Means (Study 2)

Study 3: Lack of Intercept Invariance and Bias in Means

Study 3 was conducted to investigate the impact of lack of intercept (i.e., point of origin) invariance on factor means, given that intercept invariance is the prerequisite for factor mean comparisons. A 4 (Proportion of Non-invariance: 100%, 75%, 50%, 25%) × 2 (Pattern of Invariance: uniform vs. mixed) × 2 (Ratio of Sample Size: 1 vs. 1, 4 vs. 1; total N = 300) design was created. Factor loadings and residual variances were set equal across the groups in the target model (see Appendix for detailed model parameters). As in Studies 1 and 2, Mplus was used to generate the mean and covariance structure, and model-fitting procedures were similar to those in previous studies.

Results

Lack of intercept (i.e., point of origin) invariance can lead to appreciable bias in factor means (see Table 5). The direction of bias depends on the direction of intercept differences. When the reference group (e.g., United States) has higher intercepts than the focal group (e.g., China), that is, when a U.S. 4 is equal to a Chinese 3, the factor mean is overestimated in the reference group but underestimated in the focal group. ⁸ The degree of bias depends on the degree of non-invariance and ratio of sample size. The larger the degree of non-invariance, the larger the bias is in both groups. Consistent with the findings in Studies 1–3, when the reference group had a larger sample size, bias became smaller in that group but larger in the focal group; when the pattern of intercept non-invariance was mixed, that is, when some of the intercepts were higher in the reference group, whereas others were higher in the focal group, bias in the means was substantially reduced in both groups. Once again, the reduced bias does not indicate that the measures are invariant.

When Intercepts Are Lack of Invariance: Bias in Factor Means (Study 3)

Discussion

To make valid comparisons across different cultural or ethnic groups, we must ensure that we are not comparing chopsticks with forks. Given that researchers often import measures developed for one cultural group to other populations, the issue of measurement invariance becomes a serious challenge. Findings from Study 1 indicate that lack of factor loading invariance can produce artificial interaction effects in predictive relationships. Results of Studies 2 and 3 demonstrate that lack of loading and intercept (i.e., point of origin) invariance can lead to bogus cultural differences in means.

Comparison of the Current Investigations With Millsap and Kwok's (2004) Studies

Different from the current studies, Millsap and Kwok (2004) did not examine bias in regression slopes and means due to lack of invariance in factor loadings or intercepts. However, there is some comparability between the two independent investigations. Millsap and Kwok studied selection accuracy by comparing selection rate based on the distribution of a sum score (i.e., pooled score from two different groups) and selection rate based on the distribution of the latent factor score for each group. They also studied sensitivity (i.e., the number of individuals who were selected on the basis of both the pooled sum score and their latent mean score over the number of individuals who were selected solely on the basis of their latent mean score). It was found that both the selection rate and sensitivity were artificially increased in the reference group (e.g., United States) but decreased in the focal group (e.g., China) when the reference group had higher loadings and intercepts than the focal group. The results of the success ratio (i.e., the number of individuals who were selected on the basis of the pooled sum score and their latent mean over the number of individuals who were selected solely on the basis of the pooled sum score) also favor the reference group. These findings are consistent with the results from the current studies, in which the means were overestimated in the reference group but underestimated in the focal group when factor loadings or intercepts favor the reference group. Also as found in the present studies, as the proportion of non-invariant items increased, the degree of bias increased accordingly. In addition, when the reference group had a larger sample size, bias decreased in that group but increased for the focal group, a result obtained in the current study as well. These similar patterns of findings across the two investigations provide support for the validity of the current studies.

Implications in Cross-Cultural Research

Given the high incidence of violating measurement invariance in cross-cultural studies, these findings cast serious doubt on the conclusions drawn from past cross-cultural research. For example, a robust cross-cultural finding is that North Americans have higher self-esteem than East Asians (e.g., Oishi & Sullivan, 2005). However, in light of the findings from the current simulation studies and the author's review in this article on cross-cultural differences in self-esteem reliability, the discovered cultural difference in self-esteem, at least in part, is due to lower reliability, an indication of lower factor loadings, in the self-esteem scale (Rosenberg, 1965) for East Asians. In addition, East Asians' value of modesty toward one's personal attributes (Markus & Kitayama, 1991) could have contributed to this cultural difference. This is because the self-effacing tendency results in lower intercepts in item ratings, which in turn lead to lower means. Most of the current self-esteem measures focus on the inner aspect of self-esteem or feelings of self-competence, which might be more relevant to North Americans. For East Asians, the social aspect of self-esteem (i.e., being accepted and valued by other people) might be more important. Future research should develop scales that measure self-esteem in a culturally appropriate manner, such as by including both the inner and social aspects of self-esteem.

The present literature review on existing measures encompasses a wide range of topics, including personality, depression, stress reaction, social competence, cognitive ability, emotional intelligence, life satisfaction, organizational commitment, affect, self-concept, self-esteem, anxiety, and attachment. When these scales are used as predictors, the predictive relationship is likely to be underestimated in the reference group (e.g., United States) but overestimated in the focal group (e.g., China), and the opposite is likely to happen when these scales are used as outcome measures. Perhaps the most routine use of these scales is the comparison of means across different cultural groups. Most likely, the means are artificially inflated in the reference group but deflated in the focal group, given the lower loadings in the latter group. Particularly, for measures related to self-concept, self-esteem, and satisfaction with life, the means are likely to be underestimated for East Asians but overestimated for North Americans, given that both the loadings and intercepts (resulting from conceptual differences and the modesty tendency) are likely to be lower for East Asians. For other measures, the direction of bias in the means is difficult to predict, given the uncertainty in intercept differences.

As discussed earlier, measurement invariance is still assumed, rather than tested, in many applications. When we fail to examine measurement invariance, we may uncover spurious “cultural differences” that are in fact artifacts of measurements, or we may fail to reveal true cultural differences that have been masked by measurement artifacts, which could be discovered had we used an invariant instrument. Results of the present studies also suggest that we are more likely to draw erroneous conclusions for the focal group (e.g., Asian Americans) than for the reference group (e.g., European Americans) when comparing different ethnic groups, given that the focal group often has a much smaller sample size than the reference group. If erroneous conclusions were used to guide school admission, medical diagnosis, personnel selection and promotion, clinical trials, or health and education prevention programs, serious consequences could occur. Healthy people can be falsely diagnosed and sick ones overlooked. Results of these studies highlight the importance of testing measurement invariance in cross-cultural comparisons and the significance of understanding the consequences of lack of invariance.

Implications in Testing Measurement Invariance

The present investigation has important implications for testing measurement invariance. It suggests that we may need a more dynamic approach to evaluating measurement invariance. In other words, measurement invariance should be tested within the context of its impact on the statistics that a researcher is comparing. The conventional wisdom (e.g., Widaman & Reise, 1997) is that we should test measurement invariance as a first step in group comparisons. When measurement invariance is achieved at the appropriate level, we then move to the next step, which is making group comparisons. When measurement invariance is not achieved, we should avoid making group comparisons until an invariant measure is available. However, when results from this investigation are interpreted with findings from a series of recent simulation studies (F. F. Chen, 2007), the picture is much more complex: The relation between the probability of detecting non-invariance and the degree of bias in group comparisons resulting from noninvariance is not congruent. Counterintuitively, when both the degree of non-invariance and its corresponding bias in statistics are the highest, the probability of revealing non-invariance is the lowest (F. F. Chen, 2007); when the degree of non-invariance and associated bias are only moderate, the probability of detecting non-invariance is the highest. In addition, bias is larger when lack of invariance is uniform, rather than mixed; however, the likelihood of detecting lack of invariance is smaller when lack of invariance is uniform, rather than mixed. These findings indicate that meeting the standards of measurement invariance does not guard against lack of bias in group comparisons. On the other hand, the discovery of lack of invariance may not result in statistical bias in group comparisons, depending on the pattern of non-invariance. Nevertheless, lack of statistical bias in group comparisons does not imply that the constructs are comparable at the conceptual level.

The reduced bias in regression slopes and means due to a mixed pattern of non-invariance also has implications for comparing constructs that are composed of common aspects (i.e., shared by different cultural or ethnic groups), as well as unique components (i.e., specific to each group). These constructs will not meet the standards of measurement invariance, as culture specific items are unique to each culture. However, the results of the present investigation suggest that if these culturally unique items are balanced across groups, it is possible to make unbiased comparisons. Conceptually, however, it is still arguable whether a construct is comparable when culturally unique components are involved.

Implications of these studies go beyond cross-cultural research. Measurement invariance is an important issue whenever heterogeneous groups are involved. The groups can be gender, age in longitudinal research, or treatment and control groups in experimental and prevention studies. For example, Smith and Reise (1999) conducted a study to examine gender differences in neuroticism using the Revised NEO Personality Inventory Neuroticism scale (Costa & McCrae, 1992). It was found that several items related to being sensitive to interpersonal stress tended to inflate women's scores, whereas several items related to tension and worry tended to inflate men's scores. Similarly, in longitudinal studies, the meaning of a construct may change over time. For example, the way people display racism is more subtle in the 21st century than in the 1960s. An instrument developed to measure explicit racism in the 1960s may not be able to capture the more subtle and implicit nature of the construct today. In experimental studies, when a treatment is introduced, it has the potential to change the meaning of the constructs under study.

Recommendations—When Invariance Fails

This investigation systematically examined the direction and degree of bias under varying conditions of non-invariance. The results can be particularly useful for substantive researchers in deciding whether a comparison should be made in the face of lack of measurement invariance. As discussed earlier, the goal of testing measurement invariance is to ensure that group comparisons are valid. However, it is a challenging task to achieve measurement invariance in cross-cultural research. A variety of factors, such as translation, inappropriate item coverage, different response format and style, and social desirability, can affect the psychometric properties of instruments when different cultural or ethnic minority groups are compared (e.g., Van de Vijver & Leung, 2000). What should a researcher do when invariance fails? On the basis of current simulations, readers may be tempted to make the following inference: If we allow the non-invariant factor loadings (and/or intercepts) to vary across groups, (i.e., if we do not impose measurement invariance under the condition of non-invariance), bias in statistics (e.g., regression slopes or means) will not occur, and thus, it is appropriate to make group comparisons. However, there are two issues associated with this line of reasoning. First, when a construct does not meet the standards of measurement invariance, it implies that, conceptually, the construct conveys different meanings in different groups. Second, lack of invariance can introduce bias in statistics indirectly, even when measurement invariance is not imposed. If the construct had been measured appropriately, the regression slopes (and/or means) would be different.

Dealing with non-invariant scales has become one of the unresolved questions in measurement invariance research (Millsap, 2005). As Millsap and Kwok (2004) point out, four typical approaches have been suggested in practice. The first option is to eliminate the non-invariant items, which results in many different versions of a scale for different groups (Cheung & Rensvold, 1998). It can also lead to incomplete coverage of the construct. The second choice is to keep all non-invariant items in the scale, and thus, the sum/mean score contains both invariant and non-invariant items. The assumption of this approach is that the non-invariant items may introduce little bias in group comparisons. As discovered in the present study and Millsap and Kwok's (2004) work, this is an assumption about which we cannot be confident. As found in the present literature review, it is common to have a proportion of the items invariant and another proportion non-invariant. Researchers have proposed a partial measurement-invariance model (Byrne, Shavelson, & Muthen, 1989) to address this issue. This approach constrains the invariant items to be equal across the groups while allowing the non-invariant items to be different, and it seems less likely to introduce statistical bias, compared with the mean/sum score methods, as the non-invariant items are not forced to be invariant. However, as discussed earlier, some critical questions are still not addressed: What would the regression slopes and means be had the construct been measured properly in different cultures? Under what conditions should one employ a partial invariance model? As the proportion of non-invariant items increases, confidence decreases about the validity of this approach. Even when only a small proportion of the items are different, the following conceptual questions remain: Why are those items different? Is it due to specific samples or due to the scale? How could those aspects of the construct be measured differently? What are the implications for rethinking the construct? It is important to take one step further to examine the non-invariant items, as well as the conceptualization of the construct. The fourth option is to avoid making direct group comparisons. Other researchers have suggested that it seems reasonable to statistically adjust for bias introduced by non-invariant items (Cheung & Rensvold, 1998). However, there are currently no sound methods for achieving this goal. Finally, lack of measurement invariance in a one-factor model may indicate more factors or more complex loading patterns. Once additional factors or different factor-loading patterns are allowed, measurement invariance can be achieved (McArdle & Cattell, 1994; Meredith, 1993).

This article recommends a different approach. When measurement invariance is not achieved at an appropriate level, a researcher may still wish to draw some useful conclusions with regard to cross-cultural comparisons after spending a tremendous amount of time, effort, and resources. It is possible that the consequences of lack of invariance on the research questions are limited. To help researchers decide when it is appropriate to make group comparisons when facing lack of invariance, the following steps are proposed: (1) testing measurement invariance for each construct independently and examining the pattern of lack of invariance. The goal is to understand the degree and direction of lack of invariance. As demonstrated in the present investigation, when lack of invariance is uniform, bias in regression slopes or means tends to occur; however, when lack of invariance is mixed, bias tends to be reduced. (2) Imposing corresponding invariance constraints on invariant items (e.g., factor loading invariance for comparing regression slopes and loading and intercept invariance for comparing means). (3) Imposing corresponding invariance constraints on non-invariant items, as well as invariant items. (4) Comparing groups on statistics under study, such as regression coefficients and means, with and without imposing corresponding invariance constraints on non-invariant items (i.e., comparing results from Step 2 and Step 3 to determine the discrepancy between the statistics under study). The purpose is to understand the impact of non-invariance on these statistics. If the differences are small, it may be justifiable to make group comparisons. However, future research should examine the effect size of these group differences as well as their practical implications. The reader is warned again that trivial differences in statistics do not imply that the construct is conceptually equivalent.

Limitations

A major assumption in the present investigation is that all variables are continuous and are normally distributed. When the response scales are discrete categories, alternative estimation methods, such as weighted least squares (Bollen, 1989), or alternative frameworks, such as item response theory (Embretson & Reise, 2000), should be used. One direction in future research is to systematically examine bias under various levels of invariance for categorical variables. This project also focused on only two-group comparisons, and future research should expand the scope to more groups, as in many applications, several different cultural groups can be involved at the same time.

Multiple group confirmatory factor analysis provides a method of testing construct equivalence (i.e., whether the same construct is measured across cultural groups). Construct equivalence, however, cannot always be tested statistically (Cheung & Rensvold, 2000), particularly when a construct has a wider scope in one culture than in another. For example, filial piety is a more highly elaborated construct in China than it is in the West (Hsieh, 1967). Measuring this kind of construct may require more items in one culture than it does in the other. Therefore, a particular set of items may be conceptually adequate for assessing a construct in one culture, be inadequate in another culture, and yet display measurement equivalence when tested across both cultures. To avoid this type of bias, both common and culturally specific features should be included in the measurement. When this is the case, one would expect that the common features are invariant, whereas the specific features are not invariant across cultural groups.

Conclusion

Lack of measurement invariance can have a significant impact on the conclusions drawn from group comparisons. When measurement invariance is not achieved, one may discover “spurious” group differences that are in fact artifacts of measurements, or one may miss true group differences that have been masked by measurements artifacts. The implications of these simulation studies go beyond cross-cultural research. It can have an impact on much broader situations whenever heterogeneous groups are compared. These groups include ethnicity, gender, age, measurement occasion in longitudinal research, and treatment/control groups in experimental and prevention studies.

Footnotes

¹ Residual variance consists of unique variance to that item and random error. It is assumed that the expected value of the random error is zero.

² In 1993, Meredith published the most influential article on measurement invariance.

³ The search ended in August of 2006. Thirty-two more articles met the first three criteria, but the factor loadings for each cultural/ethnic group were not available, and these articles were thus excluded from the analysis.

⁴ Reliability also depends on the variability of the scores.

⁵ A composite score is formed by taking the sum of all items.

⁶ The average ratio of sample size was 1.34 and 8.01 for standardized and unstandardized comparisons, respectively. The average of these two numbers was 4.67. An outlier of 47.15 was excluded from the analysis.

⁷ A model is identified only if there is a unique numerical solution for each of the parameters (Ullman, 2001). One common approach is to set one of the factor loadings to 1 for each factor, and that item is called the marker variable.

⁸ This pattern of results holds as long as the marker variable is invariant.

References

Berry, J. W. (1969). On cross-cultural comparability. International Journal of Psychology, 2, 119–128.

Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.

Brewer, M. B., & Chen, Y.-R. (2007). Where (who) are collectives in collectivism? Toward conceptual clarification of individualism and collectivism. Psychological Review, 114, 133–151.

Byrne, B. M., & Campbell, T. L. (1999). Cross-cultural comparisons and the presumption of equivalent measurement and theoretical structure: A look beneath the surface. Journal of Cross-Cultural Psychology, 30, 555–574.

Byrne, B., Shavelson, R., & Muthen, B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456–466.

Byrne, B. M., & Watkins, D. (2003). The issue of measurement invariance revisited. Journal of Cross-Cultural Psychology, 34, 155–175.

Campbell, D. T., & Boruch, R. F. (1975). Making the case for randomized assignment to treatments by considering the alternatives: Six ways in which quasi-experimental evaluations in compensatory education tend to underestimate effects. In C. A.Bennett & A. A.Lumsdaine (Eds.), Evaluation and experiment: Some critical issues in assessing social programs (pp. 195–296). New York: Academic Press.

Chen, C., Lee, S. Y., & Stevenson, H. W. (1995). Response style and cross-cultural comparisons of rating scales among East Asian and North American students. Psychological Science, 6, 170–175.

Chen, F. F. (2007). Sensitivity of goodness of fit indices to lack of measurement invariance. Structural Equation Modeling, 14, 464–504.

Chen, F. F., Sousa, K. H., & West, S. G. (2005). Testing measurement invariance of second-order factor models. Structural Equation Modeling, 12, 471–492.

Chen, F. F., & West, S. G. (2008). Measuring individualism and collectivism: The importance of considering different components, reference groups, and measurement invariance. Journal of Research in Personality, 42, 259–294.

Cheung, G. W., & Rensvold, R. B. (1998). Cross-cultural comparisons using non-invariant measurement items. Applied Behavioral Science Review, 6, 93–110.

Cheung, G. W., & Rensvold, R. B. (2000). Assessing extreme and acquiescence response sets in cross-cultural research using structural equation modeling. Journal of Cross-Cultural Research, 31, 187–212.

Costa, P. T., & McCrae, R. R. (1992). Professional manual: Revised NEO Personality Inventory (NEO PI–R) and NEO five factor inventory (NEO–FFI). Odessa, FL: Psychological Assessment Resources.

Diener, E., Emmons, R. A., Larsen, R. J., & Griffin, S. (1985). The Satisfaction With Life Scale. Journal of Personality Assessment, 49, 71–75.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah: NJ: Erlbaum.

Fiske, A. P., Kitayama, S., Markus, H. R., & Nisbett, R. E. (1998). The cultural matrix of social psychology. In D. T.Gilbert, S. T.Fiske, & G.Linzey (Eds.), Handbook of social psychology (4th ed., pp. 915–981). Boston: McGraw-Hill.

Heine, S. J., Lehman, D. R., Markus, H. R., & Kitayama, S. (1999). Is there a universal need for positive self-regard?Psychological Review, 106, 766–794.

Heine, S. J., Lehman, D. R., Peng, K., & Greenholtz, J. (2002). What's wrong with cross-cultural comparisons of subjective Likert scales? The reference group effect. Journal of Personality and Social Psychology, 82, 903–918.

Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H.Wainer & H.Braum (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Erlbaum.

Horn, J. L., McArdle, J. J., & Mason, R. (1983). When is invariance not invariant: A practical scientist's look at the ethereal concept of factor invariance. Southern Psychologist, 4, 179–188.

Hsieh, Y.-W. (1967). Filial piety and Chinese society. In C. A.Moore (Ed.), The Chinese mind: Essentials of Chinese philosophy and culture (pp. 165–187). Honolulu: University of Hawaii Press.

Hui, C. H., & Triandis, H. C. (1985). Measurement in cross-cultural psychology: A review and comparison of strategies. Journal of Cross-Cultural Psychology, 16, 131–152.

Irvine, S. H., & Carroll, W. K. (1980). Testing and assessment across cultures: Issues in methodology and theory. In H. C.Triandis & J. W.Berry (Eds.), Handbook of cross-cultural psychology (Vol. 2, pp. 181–244). Newton, MA: Allyn & Bacon.

Jöreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36, 409–426.

Jöreskog, K. G., & Sörbom, D. (1999). LISREL 8: User's reference guide (2nd ed.). Chicago: Scientific Software International.

Kwan, V. S. Y., Bond, M. H., Boucher, H. C., Maslach, C., & Gan, Y. (2002). The construct of individuation: More complex in collectivist than in individualist cultures. Personality and Social Psychology Bulletin, 28, 300–310.

Lehman, D., Chiu, C., & Schaller, M. (2004). Psychology and culture. Annual Review of Psychology, 55, 689–717.

Little, T. D. (1997). Mean and covariance structures (MACS) analyses of cross-cultural data: Practical and theoretical issues. Multivariate Behavioral Research, 32, 53–76.

Markus, H. R., & Kitayama, S. (1991). Culture and self: Implications for cognition, emotion, and motivation. Psychological Review, 98, 224–253.

Maslach, C., Stapp, J., & Santee, R. T. (1985). Individuation: Conceptual analysis and assessment. Journal of Personality and Social Psychology, 49, 729–738.

McArdle, J. J., & Cattell, R. B. (1994). Structural equation models of factorial invariance in parallel proportional profiles and oblique confactor problems. Multivariate Behavioral Research, 29, 63–113.

Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525–543.

Millsap, R. E. (2005). Four unresolved problems in studies of factorial invariance. In A.Maydeu-Olivares & J.McArdle (Eds.), Contemporary psychometrics: A festschrift for Roderick P. McDonald: Multivariate applications book series (pp. 153–171). Mahwah, NJ: Erlbaum.

Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297–334.

Millsap, R. E., & Kwok, O.-M. (2004). Evaluating the impact of partial factorial invariance on selection of multiple populations. Psychological Methods, 9, 93–115.

Moorman, R. H., & Podsakoff, P. M. (1992). A meta-analytic review and empirical test of the potential confounding effects of social desirability response sets in organizational behavior research. Journal of Occupational and Organizational Psychology, 65, 131–149.

Muthén, L. K., & Muthén, B. O. (1998). Mplus user's guide. Los Angeles: Muthén & Muthén.

Oishi, S., & Sullivan, H. W. (2005). The mediating role of parental expectations in culture and well-being. Journal of Personality, 73, 1267–1294.

Oyserman, D., Coon, H. M., & Kemmelmeier, M. (2002). Rethinking individualism and collectivism: Evaluation of theoretical assumptions and meta-analyses. Psychological Bulletin, 128, 3–72.

Peng, K., Nisbett, R. E., & Wong, N. Y. (1997). Validity problems comparing values across cultures and possible solutions. Psychological Methods, 2, 329–344.

Poortinga, Y. H. (1989). Equivalence of cross-cultural data: An overview of basic issues. International Journal of Psychology, 24, 737–756.

Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552–566.

Rhee, E., Uleman, J. S., & Lee, H. K. (1996). Variations in collectivism and individualism by ingroup and culture: Confirmatory factor analysis. Journal of Personality and Social Psychology, 71, 1037–1054.

Riordan, C. M., & Vanderberg, R. J. (1994). A central question in cross-cultural research: Do employees of different cultures interpret work-related measures in an equivalent manner?Journal of Management, 20, 643–671.

Rosenberg, M. (1965). Society and the adolescent self-image. Princeton, NJ: Princeton University Press.

Smith, L. L., & Reise, S. P. (1999). Gender differences on negative affectivity: An IRT study of differential item functioning on the Multidimensional Personality Questionnaire Stress Reaction Scale. Journal of Personality and Social Psychology, 75, 1350–1362.

Sörbom, D. (1978). An alternative to the methodology for analysis of covariance. Psychometrika, 43, 381–396.

Steenkamp, J. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25, 78–90.

Ullman, J. B. (2001). Structural equation modeling. In B. G.Tabachnick & L. S.Fidell (Eds.), Using multivariate statistics (pp. 653–771). Boston: Allyn & Bacon.

Van de Vijver, F. J. R., & Leung, K. (1997). Methods and data analysis for comparative research. In J.Berry, Y.Poortinga, & J.Pandey (Eds.), Handbook of cross-cultural psychology (Vol. 1, pp. 259–300). Boston: Allyn & Bacon.

Van de Vijver, F., & Leung, K. (2000). Methodological issues in psychological research on culture. Journal of Cross-Cultural Psychology, 31, 33–51.

Widaman, K. F., & Reise, S. P. (1997). Exploring the measurement invariance of psychological instruments: Applications in the substance use domain. In K. J.Bryant, M.Windle, & S. G.West (Eds.), The science of prevention: Methodological advances from alcohol and substance abuse research (pp. 281–324). Washington, DC: American Psychological Association.

APPENDIX

APPENDIX A: Model Parameters for Study 1

Model Parameters for Study 2

Model Parameters for Study 3

Submitted: September 5, 2005 Revised: May 9, 2008 Accepted: May 20, 2008

Copyright 2008 American Psychological Association
This publication is protected by US and international copyright laws and its content may not be copied without the copyright holders express written permission except for the print or download capabilities of the retrieval software used for access. This content is intended solely for the use of the individual user.

Source: Journal of Personality and Social Psychology. Vol.95 (5) pp. 1005-1018.
Accession Number: psp-95-5-1005 Digital Object Identifier: 10.1037/a0013193