Validation of Psychometric Instrumentswith Classical Test Theory in Social and Health Sciences: A practical guide
Abstract
In recent years there has been a significant rise in the number of psychometric studies, together with crucial statistical advances for validity and reliability measures. Given the importance of providing accurate procedures both in methodology and score interpretation of tests and/or measurement scales, the editors-in-chief of the journal Annals of Psychology have drafted this guide to address the most relevant issues in the field of applied psychometry. To this end, the present manuscript analyses the main topics under the Classical Test Theory framework (e.g., exploratory/confirmatory factor analysis; reliability, validity, bias, etc.) aiming to synthesize and clarify the best practical applications, and improve publication standards.
Downloads
References
Abad, F. J., Olea, J., Ponsoda, V., & García, C. (2011). Medición en ciencias sociales y de la salud [Measurement in social and health sciences]. Síntesis.
Adams, R. J., Wu, M. L., Cloney, D., Berezner, A., & Wilson, M. (2020). ACER ConQuest: Generalised Item Response Modelling Software (Version 5.29) [Computer software]. Australian Council for Educational Research. https://www.acer.org/au/conquest
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-573. https://doi.org/10.1007/BF02293814
Andrich, D., & Luo, G. (1996). RUMMFOLDss: A Windows program for analyzing single stimulus responses of persons to items according to the hyperbolic cosine unfolding model. [Computer program]. Perth, Australia: Murdoch University.
American Educational Research Association. American Psychological Association. National Council on Measurement in Education (2014). Standards for educational and psychological testing. American Educational Research Association.
Asparouhov, T., & Muthén, B. (2009). Exploratory structural equation modeling. Structural Equation Modeling, 16, 397-438. https://doi.org/10.1080/10705510903008204
Beauducel, A., & Herzberg, P. Y. (2006). On the performance of maximum likelihood versus means and variance adjusted weighted least squares estimation in CFA. Structural Equation Modeling, 13, 186-203. https://doi.org/10.1207/s15328007sem1302_2
Bentler, P. M., & Yuan, K. H. (1999). Structural equation modeling with small samples: Test statistics. Multivariate Behavioral Research, 34, 181-187. https://doi.org/10.1207/S15327906Mb340203
Bock, R. D., & Gibbons, R. (2010). Factor analysis of categorical item responses. In M. L. Nering and R. Ostini (Eds.). Handbook of polytomous item response theory models. Routledge.
Bond, T. G., & Fox, C. (2015). Applying the Rasch model; fundamental measurement in the Human Sciences. Routledge.
Brown, T. A. (2006). Confirmatory factor analysis for applied research. The Guilford Press.
Byrne, B. M., Shavelson, R. J. & Muthén, B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105(3), 456-466. https://doi.org/10.1037/0033-2909.105.3.456
Canivez, G. L. (2016). Bifactor modeling. In K. Schweizer & C. DiStefano (Eds), Principles and methods of test construction (pp. 247-271). Hogrefe.
Charter, R. A. (2000). Confidence interval formulas for split-half reliability coefficients. Psychological Reports, 86, 1168-1170. https://doi.org/10.1177/003329410008600317.2
Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98-104. https://doi.org/10.1037/0021-9010.78.1.98
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Holt, Rinehart, and Winston.
de Ayala, R. J. (2009). The theory and practice of item response theory. The Guilford Press.
de Boeck, P., & Wilson, M. (Eds.) (2004). Explanatory item response models: A generalized linear and nonlinear approach. Springer-Verlag.
Enders, C. K. (2004). The impact of missing data on sample reliability estimates: Implications for reliability reporting practices. Educational and Psychological Measurement, 64, 419-436. https://doi.org/10.1177/0013164403261050
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (pp. 105–146). Macmillan Publishing Co, Inc; American Council on Education.
Ferrando, P. J., & Lorezo-Seva, U. (2014). Exploratory item factor analysis: additional considerations. Annals of Psychology, 30(3), 1170-1175. https://doi.org/10.6018/analesps.30.3.199991
Finney, S. J. & DiStefano, C. (2006). Nonnormal and categorical data in structural equation models. In G. R. Hancock, & R. O. Mueller (Eds.), A second course in Structural equation modeling (pp. 269-314). Information Age.
Fisher, G. H., & Molenaar, I. W. (Eds.) (1995). Rasch models: Foundations, recent developments, and applications. Springer-Verlag.
Flora, D. B., & Curran, P. J. (2004). An empirical evaluation of alternative methods of estimation for confirmatory factor analysis with ordinal data. Psychological Methods, 9, 466-491. https://doi.org/10.1037/1082-989X.9.4.466
Forero, C., Maydeu-Olivares, A., & Gallardo-Pujol, D. (2009). Factor analysis with ordinal indicators: A Monte Carlo study comparing DWLS and ULS estimation. Structural Equation Modeling, 16, 625-641. https://doi.org/10.1080/10705510903203573
Gilmer, J. S., & Feldt, L. S. (1983). Reliability estimation for a test with part of unknown lengths. Psychometrika, 48, 99-111. https://doi.org/10.1007/BF02314679
Goretzko, D., Pham, T. T. H., & Bühner, M. (2021). Exploratory factor analysis: Current use, methodological developments, and recommendations for good practice. Current Psychology, 40, 3510-3521. https://doi.org/10.1007/s12144-019-00300-2
Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977). Limitations of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement, 37, 827-838. https://doi.org/10.1177/001316447703700403
Hambleton, R. K., Merenda, P. F., & Spielberger, C. D. (2005). Adapting educational and psychological tests for cross-cultural assessment. Lawrence Erlbaum Associates.
Henson, R. K. (2001). Understanding internal consistency reliability estimates: A conceptual primer on coefficient alpha. Measurement and Evaluation in Counseling and Development, 34, 177-189. https://doi.org/10.1080/07481756.2002.12069034
Jackson, D. L. (2001). Sample size and number of parameter estimates in maximum likelihood confirmatory factor analysis: A Monte Carlo investigation. Structural Equation Modeling, 8, 205-223. https://doi.org/10.1207/S15328007SEM0802_3
Lei, P. W. (2009). Evaluating estimation methods for ordinal data in structural equation modeling. Quality and Quantity, 43, 495-507. https://doi.org/10.1007/s11135-007-9133-z
Linacre, J.M. (2023). Winsteps® (Version 5.6.0) [Computer Software]. Portland, Oregon: Winsteps.com. Available from https://www.winsteps.com/
Lloret, S., Ferreres, A., Hernández, A., & Tomás, I. (2014). Exploratory item factor analysis: A practical guide revised and updated. Annals of Psychology, 30(3), 1151-1169. https://doi.org/10.6018/analesps.30.3.199361
Lloret, S., Ferreres, A., Hernández, A., & Tomás, I. (2017). The exploratory factor analysis of items: guided analysis based on empirical data and software. Annals of Psychology, 33(2), 417-432. https://doi.org/10.6018/analesps.33.2.270211
Lohr, K. N., Aaronson, N. K., Alonso, J., Burnam, M. A., Patrick, D. L., Perrin, E. B., & Roberts, J. S. (1996). Evaluating quality-of-life and health status instruments: development of scientific review criteria. Clinical Therapeutics, 18, 979-992. https://doi.org/10.1016/s0149-2918(96)80054-3
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
Masters, G. (1982). A Rasch model for credit partial scoring. Psychometrika, 47, 149-174. https://doi.org/10.1007/BF02296272
McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: LEA.
McHorney, C. A., & Tarlov, A. R. (1995). Individual-patient monitoring in clinical practice: Are available health status surveys adequate? Quality of Life Research, 4, 293-307. https://doi.org/10.1007/BF01593882
Mearns, J., Patchett, E., & Catanzaro, S. (2009). Multitrait-multimethod matrix validation of the Negative Mood Regulation Scale. Journal of Research in Personality, 43(5), 910-913. https://doi.org/10.1016/j.jrp.2009.05.003
Meyer, J. P. (2014). Applied measurement with jMetrik. Routdlege.
Michell, J. (1999). Measurement in Psychology: A critical history of a methodological concept. Cambridge University Press.
Millsap, R. E. (2011). Statistical approaches to measurement invariance. Routledge.
Millsap, R. E., & Yun-Tein, J. (2004). Assessing factorial invariance in ordered-categorical measures. Multivariate Behavioral Research, 39(3), 479-515. https://doi.org/10.1207/S15327906MBR3903_4
Muñiz, J., & Bartram, D. (2007). Improving international tests and testing. European Psychologist, 12, 206-219. https://doi.org/10.1027/1016-9040.12.3.206
Muraki, E. (1990). Fitting a polytomous item response model to Likert-type data. Applied Psychological Measurement, 14, 59-71. https://doi.org/10.1177/014662169001400106
Nering, M. L., & Ostini, R. (2010). Handbook of polytomous item response theory models. New York: Routledge.
O'Rourke, N. (2004). Reliability generalization of responses by care providers to the Center for Epidemiologic Studies-Depression Scale. Educational and Psychological Measurement, 64, 973-990. https://doi.org/10.1177/0013164404268668
Raykov, T. (2001). Estimation of congeneric scale reliability using covariance structure analysis with nonlinear restrictions. British Journal of Mathematical and Statistical Psychology, 54, 315-323. https://doi.org/10.1348/000711001159582
Raykov, T. (2002). Analytic estimation of standard error and confidence interval for scale reliability. Multivariate Behavioral Research, 37, 89-103. https://doi.org/10.1207/S15327906MBR3701_04
Raykov, T. (2004). Behavioral scale reliability and measurement invariance evaluation using latent variable modeling. Behavior Therapy, 35, 299-331. https://doi.org/10.1016/S0005-7894(04)80041-8
Reise, S. P., Morizot, J., & Hays, R. D. (2007). The role of the bifactor model in resolving dimensionality issues in health outcome measures. Quality of Life Research, 16, 19-31. https://doi.org/10.1007/s11136-007-9183-7
Robitzsch, A., & Lüdtke, O. (2023). Why full, partial, or approximate measurement invariance are not a prerequisite for meaningful and valid group comparisons. Structural Equation Modeling: A multidisciplinary Journal. https://doi.org/10.1080/10705511.2023.2191292
Rutkowski, L., & Svetina, D. (2014). Assessing the hypothesis of measurement invariance in the context of large-scale international surveys. Educational and Psychological Measurement, 74, 31-57. https://doi.org/10.1177/0013164413498257
Samejima, E. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, No. 17.
Sánchez-Meca, J., Marín-Martínez, F., López-López, J. A., Núñez-Núñez, R. M., Rubio-Aparicio, M., López-García, J. J., López-Pina, J. A., Blázquez-Rincón, D. M., López-Ibáñez, C., & López-Nicolás, R. (2021). Improving the reporting quality of reliability generalization meta-analyses: The REGEMA checklist. Research Synthesis Methods, 12(4), 516-536. https://doi.org/10.1002/jrsm.1487
Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8, 350-353. https://doi.org/10.1037/1040-3590.8.4.350
Schmitt, N., & Kuljanin, G. (2008). Mesurement invariance: Review of practice and implications. Human Resource Management Review, 18(4), 210-222. https://doi.org/10.1016/j.hrmr.2008.03.003
Shevlin, M., Miles, J. N. V., Davies, M. N. O., & Walker, S. (2000). Coefficient alpha: A useful indicator of reliability? Personality and Individual Differences, 28, 229-237. https://doi.org/10.1016/S0191-8869(99)00093-8
Sijtsma, K., & Molenaar, I. W. (2002). Introduction to nonparametric item response theory. Sage.
Streiner, D. L. (2003). Starting at the beginning: An introduction to coefficient alpha and internal consistency. Journal of Personality Assessment, 80, 99-103. https://doi.org/10.1207/S15327752JPA8001_18
Streiner, D., Norman, G., & Cairney, J. (2015). Health measurement scales: A practical guide to their development and use. Oxford.
Svetina, D., Rutkowski, I., & Rutkowski, D. (2020). Multiple-group invariance with categorical outomes using updated guidelines: an illustration using Mplus and the lavaan/semtools packages. Structural Equation Modeling: A Multidisciplinary Journal, 27, 111-130. https://doi.org/10.1080/10705511.2019.1602776
The jamovi project (2023). jamovi (Version 2.3) [Computer Software]. Retrieved from https://www.jamovi.org
Thompson, B., & Vacha-Haase, T. (2000). Psychometrics is datametrics: The test is not reliable. Educational and Psychological Measurement, 60, 174-195. https://doi.org/10.1177/00131640021970448
Thompson, M. S. (2016). Assessing measurement invariance of scales using Multiple-Group Structural Equation Modeling. In K. Schewizer & C. DiStefano (Eds.), Principles and methods of test construction (pp. 218-244). Hogrefe.
Vacha-Haase, T. (1998). Reliability generalization: Exploring variance in measurement error affecting score reliability across studies. Educational and Psychological Measurement, 58, 6-20. https://doi.org/10.1177/0013164498058001002
van der Linden, W., & Hambleton, R. K. (Eds.) (1997). Handbook of modern item response theory. Springer.
Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4-69. https://doi.org/10.1177/109442810031002
Viladrich, C., Angulo-Brunet, A., & Doval, E. (2017). A journey around alpha and omega to estimate internal consistency reliability. Annals of Psychology, 33(3), 755-782. http://dx.doi.org/10.6018/analesps.33.3.268401
Wright, B. D., & Stone, M. H. (1979). Best test design. Mesa Press.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Mesa Press.
Zinbarg, R. E., Revelle, W., Yovel, I., & Li, W. (2005). Cronbach’s α, Revelle's β, and McDonald's ωH: Their relations with each other and two alternative conceptualizations of reliability. Psychometrika, 70(1), 123-133. https://doi.org/10.1007/s11336-003-0974-7
Zinbarg, R. E., Yovel, I., Revelle, W., & McDonald, R. P. (2006). Estimating generalizability to a latent variable common to all of a scale’s indicators: A comparison of estimators for ωh. Applied Psychological Measurement, 30, 121-144. https://doi.org/10.1177/0146621605278814
Copyright (c) 2024 Servicio de Publicaciones, University of Murcia (Spain)
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
The works published in this journal are subject to the following terms:
1. The Publications Service of the University of Murcia (the publisher) retains the property rights (copyright) of published works, and encourages and enables the reuse of the same under the license specified in paragraph 2.
© Servicio de Publicaciones, Universidad de Murcia, 2022
2. The works are published in the online edition of the journal under a Creative Commons Reconocimiento-CompartirIgual 4.0 (legal text). You can copy, use, distribute, transmit and publicly display, provided that: i) you cite the author and the original source of publication (journal, editorial and URL of the work), ii) are not used for commercial purposes, iii ) mentions the existence and specifications of this license.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
3. Conditions of self-archiving. Is allowed and encouraged the authors to disseminate electronically pre-print versions (version before being evaluated and sent to the journal) and / or post-print (version reviewed and accepted for publication) of their works before publication, as it encourages its earliest circulation and diffusion and thus a possible increase in its citation and scope between the academic community. RoMEO Color: Green.