Comparison of the procedures of Fleishman and Ramberg et al . for generating non-normal data in simulation studies

Título: Comparación de los procedimientos de Fleishman y Ramberg et al. para generar datos no normales en estudios de simulación. Resumen: Las técnicas de simulación deben posibilitar la generación adecuada de las distribuciones más frecuentes en la realidad como son las distribuciones no normales. Entre los procedimientos para la generación de datos no normales destacan el método de transformaciones lineales propuesto por Fleishman y el método basado en la generalización de la distribución lambda de Tukey propuesto por Ramberg et al. Este estudio compara los procedimientos en función del ajuste de las distribuciones generadas a sus respectivos modelos teóricos y del número de simulaciones necesarias para dicho ajuste. Con este objetivo se seleccionan, junto con la distribución normal, una serie de distribuciones no normales frecuentes en datos reales, y se analiza el ajuste según el grado de violación de la normalidad y del número de simulaciones realizadas. Los resultados muestran que ambos procedimientos de generación de datos tienen un comportamiento similar. A medida que aumenta el grado de contaminación de la distribución teórica hay que aumentar el número de simulaciones a realizar para asegurar un mayor ajuste a la generada. Los dos procedimientos son más precisos para generar distribuciones normales y no normales a partir de 7000 simulaciones aunque cuando el grado de contaminación es severo (con valores de asimetría y curtosis de 2 y 6, respectivamente), se recomienda aumentar el número de simulaciones a 15000. Palabras clave: Simulación; Monte Carlo; generadores de datos; datos no normales; número de simulaciones. Abstract: Simulation techniques must be able to generate the types of distributions most commonly encountered in real data, for example, nonnormal distributions. Two recognized procedures for generating nonnormal data are Fleishman’s linear transformation method and the method proposed by Ramberg et al. that is based on generalization of the Tukey lambda distribution. This study compares these procedures in terms of the extent to which the distributions they generate fit their respective theoretical models, and it also examines the number of simulations needed to achieve this fit. To this end, the paper considers, in addition to the normal distribution, a series of non-normal distributions that are commonly found in real data, and then analyses fit according to the extent to which normality is violated and the number of simulations performed. The results show that the two data generation procedures behave similarly. As the degree of contamination of the theoretical distribution increases, so does the number of simulations required to ensure a good fit to the generated data. The two procedures generate more accurate normal and non-normal distributions when at least 7000 simulations are performed, although when the degree of contamination is severe (with values of skewness and kurtosis of 2 and 6, respectively) it is advisable to perform 15000 simulations.


Introduction
Monte Carlo simulation studies are widely used by researchers in the health and social sciences (Burton, Altman, Royston & Holder, 2006).One of the aims of these studies is to evaluate and compare the robustness of different statistical procedures when the assumptions regarding the underlying distribution are not fulfilled.The parametric tests most commonly used in applied research (e.g.ANOVA) require that the assumption of normality be fulfilled, in other words, the dependent variable must be distributed according to the normal curve.However, the variables encountered in the field of health and social sciences often do not follow a normal distribution (Blanca, Arnau, Bono, López-Montiel & Bendayan, 2013;Limpert, Stahel & Abbt, 2001;Micceri, 1989).Examples of such variables in the health sciences are survival times for certain types of cancer (Claret et al., 2009;Qazi, DuMez & Uckun, 2007) or the age at onset of Alzheimer's disease (Horner, 1987), while in the social sciences it is the case of variables such as social support (Matud, Carballeira, Lopez, Marrero & Ibáñez, 2002), physical and verbal aggression in couple relationships (Soler, Vinyak & Quadagno, 2000), certain psychosocial aspects of addictions (Deluchi & Bostrom, 2004), post-traumatic stress (Sullivan & Holt, 2008), reaction times or response latency (Shang-Wen & Ming-Hua, 2010; Ulrich & Miller, 1993;Van der Linden, 2006), certain attentional skills (Brown, Weatherholt & Burns, 2010) and variables of a psychophysiological nature (Keselman, Wilcox & Lix, 2003).
The first step in any Monte Carlo simulation study is to generate data that reflect the characteristics of the distributions that one wishes to simulate.Consequently, the quality of the results produced by such studies largely depends on the accuracy and suitability of the data generation procedure used (Niederreiter, 1992).The criteria used to evaluate this include how long the process takes (Afflerbach, 1990), its replicability (Ripley, 1990) and the degree to which the generated distribution fits the theoretical model (Bang, Shumacker & Schlieve, 1998), with this latter criterion being of particular interest for determining the accuracy of the procedure.In this context, it is especially important to evaluate the suitability of data generators for generating non-normal distributions, both known and unknown (Demirtas, 2007), as these types of distributions are commonly encountered in real data (Blanca et al., 2013;Limpert et al., 2001;Micceri, 1989).
There are many useful procedures for generating nonnormal data, although in the health and social sciences particular mention should be made of Fleishman's linear transanales de psicología, 2014, vol.30, nº 1 (enero) formation method (Fleishman, 1978) and the method proposed by Ramberg, Dudewicz, Tadikamalla and Mykytka (1979) that is based on generalization of the Tukey lambda distribution.
The procedure proposed by Fleishman (1978) uses a polynomial transformation to generate non-normal data.Specifically, it takes the sum of a linear combination of a random normal variable, its square and its cube, and for the univariate case it is defined as shown in (1), where X is a normally distributed random variable with mean 0 and variance 1.This procedure calculates the coefficients a, b, c and d by means of a polynomial transformation involving different values of the third and fourth moments (i.e.skewness, γ 1, and kurtosis, γ 2 ).The first and second moments are arbitrarily set at 0 and 1, respectively.
The procedure proposed by Ramberg et al. (1979) involves a generalization of the Tukey lambda distribution, originally developed by Ramberg andSchmeiser (1972, 1974) with the aim of generating random variables (Karian & Dudewicz, 2000).The inverse function of the generalized lambda distribution (GLD), which includes the original lambda distribution (λ 4 = λ 2 ) , is defined as shown in (2), ( 0(2 where p is a uniform random variable (0,1) and x follows the GLD.Ramberg et al. (1979) explored how to determine the distribution parameters using the first four moments, and how to fit the resulting distribution.The skewness and kurtosis of the GLD are determined by λ 3 and λ 4, respectively.Given these values, the variance is determined by λ 2 , whereas the mean can take any value (λ 1 ).
In recent research, controversy has arisen regarding which data generation procedure is the most suitable for generating accurate non-normal distributions.The Fleishman (1978) method is one of the most widely used in simulation studies, and it has the advantage of being simple, quick and easily generalized to the generation of multivariate non-normal data, through the procedure described by Vale and Maurelli (1983).However, some authors have argued that the procedure proposed by Ramberg et al. (1979) is able to generate more extreme non-normal distributions, although when it comes to non-extreme, non-normal distributions the two procedures offer an equivalent speed and ease of execution (Tadikamalla, 1980).Likewise, other studies have pointed out that the two procedures are similar and present the same limitations when generating non-normal distributions with extreme values of skewness and/or kurtosis (Headrick & Kowalchuk, 2007;Headrick, Sheng & Hodis, 2007).
As noted above, the fit of the generated distribution to the theoretical model is a key criterion for determining the accuracy of a data generation procedure, and some studies have highlighted the importance of investigating how the number of simulations performed affects the quality of Monte Carlo studies in general (Harwell, Stone, Hsu & Kirisci, 1996;Díaz-Emparanza, 2002), and the fit of the generated distribution to the theoretical model in particular (Bang et al., 1998;Luo, 2011).Recent research has provided fairly consistent results in this regard, and it is generally accepted that more simulations means better quality (Burton et al., 2006;Diaz-Emparanza, 2002).In Monte Carlo studies the number of simulations is usually somewhere between 100 and 100000 (Burton et al., 2006), most commonly 1000 (e.g.Collier, Baker, Mandeville & Hayes, 1967;Kowalchuk, Keselman, Algina & Wolfinger, 2004), 5000 (e.g.Keselman, Othman, Wilcox & Fradette, 2004;Lix & Keselman, 1996) or 10000 (e.g.Livacic-Rojas, Vallejo & Fernández, 2006;Wilcox, 2004).However, very few studies have examined the effect of the number of simulations on the fit of the generated distribution to the theoretical model, and the results to date are inconclusive.For example, Kashyap, Butt and Bhattacharjee (2009) found that 5400 simulations were sufficient for an adequate fit to the Bernoulli distribution, while Bang et al. (1998) reported that 10000 simulations were enough to generate data from the normal distribution.However, these studies did not consider the wide range of possible distributions or the number of simulations most commonly used in Monte Carlo simulation studies.More recently, Luo (2011) examined the accuracy of the Fleishman (1978) method and took into account a wide variety of nonnormal distributions with values of skewness between 0 and 1.25 and kurtosis values between 1 and 4.This study only considered a maximum of 2000 simulations, and the results show that the procedure becomes less accurate as skewness increases, and also that this number of simulations is insufficient when the degree of contamination is severe.
In light of the above, the aim of the present study was to compare the suitability of the data generation procedures proposed by Fleishman (1978) and Ramberg et al. (1979) in terms of the fit of the generated distribution to the corresponding theoretical model, with both the number of simulations and the degree of contamination of the distribution being modified.To this end, we selected a series of nonnormal distributions defined by the skewness and kurtosis values most commonly found in real data.These distributions were generated by means of the two abovementioned procedures and the number of simulations was varied.Each distribution was then compared with its respective theoretical distribution, and the degree of fit was calculated according to the deviation in the skewness and kurtosis coefficients.

Method
The data were generated using SAS/IML, which was chosen due to it being one of the most suitable software packages for simulating data (Kashyap et al., 2009).The study variables were: a. Degree of contamination of the distribution.The theoretical distributions selected were the normal distribution and a series of unknown non-normal distributions, defined by the skewness and kurtosis values most commonly encountered in real health and social sciences data.Blanca et al. (2013) calculated the skewness and kurtosis coefficients in 693 real data sets derived from measures of psychological variables and found, in line with other authors (Micceri, 1989), that only a small percentage of distributions were normal.More specifically, they found that skewness values ranged between -2.49 and 2.33, while those for kurtosis were between -1.92 and 7.41.Considering skewness and kurtosis together, only 5.5% of distributions were close to expected values under normality; in terms of absolute values of skewness and kurtosis 39.9% of values were between 0.26 and 0.75, 34.5% were between 0.76 and 1.25, 10.4% were between 1.26 and 1.75, 2.6% were between 1.76 and 2.25, and 7.1% were greater than 2.25.On the basis of these results we selected skewness and kurtosis coefficients that represented different degrees of contamination with respect to normality, from mild to severe.These values are shown in Table1.b.Type of data generator.Data were generated by means of two procedures, the Fleishman (1978) method and that developed by Ramberg et al. (1979).In the former we used the coefficients a, b, c and d for the values of skewness and kurtosis that are shown in the table of Fleishman (1978).For those values that do not appear in this table the coefficients a, b, c and d were calculated by means of a polynomial transformation in SAS, using the syntax shown in Appendix I.In order to generate data by means of the procedure proposed by Ramberg et al. (1979) we used lambda values (λ 1 , λ 2 , λ 3 and λ 4 ) for the different values of skewness and kurtosis shown in the tables of Karian and Dudewicz (2000).The values of the coefficients a, b, c and d and the lambda values are shown in Tables 2 and 3, respectively.c.Number of simulations.Data were generated for the different theoretical distributions using between 1000 and 15000 simulations (number of iterations), in steps of 1000.Subsequently, and so as to ensure that the statistical analysis was interpretable, the number of simulations was grouped into five categories: 1000-3000, 4000-6000, 7000-9000, 10000-12000 and 13000-15000.
In order to analyse the accuracy of the data generators used in the simulations we examined the fit of the generated distribution with respect to the theoretical model, calculating the differences in absolute values between the respective theoretical coefficients of skewness and kurtosis and those obtained in the generated distribution; these were labelled, respectively, the skewness deviation and the kurtosis deviation.For both variables, values of 0 indicate a perfect fit between the theoretical and generated distributions.The coefficients of skewness (3) and kurtosis (4) were calculated by means of SAS, which uses the following unbiased estimators: where N is the number of observations, x i represents the i-th value of the variable, x corresponds to the mean and S x is the standard deviation.

Results
In order to analyse differences according to the type of data generator (Ramberg vs. Fleishman), the degree of contamination of the distribution (normal, mild, moderate, high and severe) and the number of simulations (1000-3000, 4000-6000, 7000-9000, 10000-12000 and 13000-15000) we conducted two 2 x 5 x 5 analyses of variance, with skewness deviation and kurtosis deviation as the dependent variable, respectively.Each cell of the design contains three observations.Table 4 shows the results of the analysis.The main effects show that there are no differences between the two data generators as regards the skewness deviation or kurtosis deviation.The total mean deviation for skewness was 0.04 (SD= 0.05), while that for kurtosis was 0.20 (SD= 0.30).With respect to the degree of contamination there were differences in terms of the kurtosis deviation but not in the skewness deviation.Specifically, the kurtosis deviation increased in line with the degree of contamination, and was at its highest when the simulated distribution showed severe contamination (Figure 1).As regards the number of simulations, this was associated with differences in both skewness deviation and kurtosis deviation.In general, both deviations decreased as the number of simulations increased, with lower deviations being produced above 7000 simulations (Figures 2 and 3).The interaction effects show that the two data generators produce a similar pattern as regards the skewness and kurtosis deviations and the number of simulations, there being no interaction between these factors.However, the kurtosis deviation as a function of the degree of contamination did vary according to the number of simulations (Figure 4).In gen-eral, the kurtosis deviation is greater with 6000 simulations or fewer, and lower above 7000 simulations for all the distributions.Note, however, that the greatest deviation occurs when simulating a distribution with severe contamination, especially when the number of simulations is less than 7000.

Discussion
The main aim of this study was to compare two data generation procedures, the Fleishman (1978) method and that proposed by Ramberg et al. (1979), as regards the fit of the generated distribution to the theoretical model, with both the number of simulations and the degree of contamination of the distribution being modified.In addition to the normal distribution we considered a series of distributions with different degrees of contamination, specifically those most frequently encountered in health and social sciences research (Blanca et al., 2013).Fit was evaluated by calculating the differences in absolute values between the respective theoretical coefficients of skewness and kurtosis and those obtained in the generated distribution.
The results show that the data generation procedure proposed by Ramberg et al. (1979) is as accurate as the Fleishman (1978) method for both normal and non-normal distributions, a finding that is consistent with previous research (Headrick et al., 2007;Headrick, Sheng et al., 2007).Both procedures also show less skewness deviation than kurtosis deviation when generating data.However, their accuracy differs as a function of the number of simulations and the degree of contamination with respect to normality.In general, as the degree of contamination of the distribution increases, so does the number of simulations required to ensure a good fit.These results confirm the findings of previous research (Burton et al., 2006;Díaz-Emparanza, 2002) which reported that increasing the number of simulations improves the quality of a simulation study, and they are also consistent with the conclusions reached by Luo (2011).Luo (2011) found that when using the Fleishman (1978) method a higher number of simulations are required as the degree of skewness in the desired distribution increases.
More specifically, one of the main conclusions to be drawn from the present study is that when deciding how many simulations are required to ensure accurate data gener-ation it is necessary to take into account the values of skewness and kurtosis that one wishes to simulate.On the one hand, both the data generators studied here are more accurate when generating normal and non-normal distributions with over 7000 simulations, above which number the values of skewness and kurtosis deviation are close to zero.However, when the simulated distribution presents severe contamination, defined by skewness of 2 and kurtosis of 6, both procedures become less accurate and yield higher values of kurtosis deviation.In these cases the lowest deviation is produced between 13000 and 15000 simulations, where the kurtosis deviation takes a value of 0.20.Although these results cannot be directly compared with other studies, as this is the first report to compare the procedures of Fleishman (1978) and Ramberg et al. (1979) in terms of the number of simulations required to generate non-normal data, our findings are partially consistent with previous research stating that the quality of simulation studies can be increased by using 10000 simulations (Bang et al., 1998;Rasch & Guiard, 2004;Robey & Barcikowski, 1992).
In summary, the results show that the two data generation procedures behave similarly.However, as the degree of contamination of the theoretical distribution increases, so does the number of simulations required to ensure a good fit to the generated data.Future research should analyse the degree of fit with other types of known non-normal distributions that are widely used in Monte Carlo studies, for example, the double exponential or lognormal.Similarly, it would be useful to consider distributions with different values of skewness and kurtosis, with one of these being set at zero.Finally, it would also be interesting to replicate the present study with multivariate data so as to analyse the accuracy of the multivariate extensions of these two procedures.

Figure 1 .
Figure 1.Mean kurtosis deviation as a function of the degree of contamination of the distribution

Figure 2 .
Figure 2. Mean skewness deviation as a function of the number of simulations.

Figure 3 .
Figure 3. Mean kurtosis deviation as a function of the number of simulations.

Figure 4 .
Figure 4. Mean kurtosis deviation as a function of the number of simulations and the degree of contamination.

Table 1 .
Values of skewness and kurtosis used in the present study to determine the degree of contamination of the distribution with respect to normality.

Table 2 .
Fleishman's (1978)'s (1978)a, b, c and d coefficients for each value of skewness and kurtosis for the distributions generated in the present study.
Note. γ1:value of the skewness coefficient; γ2: value of the kurtosis coefficient.

Table 3 .
Lambda values for each value of skewness and kurtosis for the distributions generated in the present study.

Table 4 .
Results of the 2 x 5 x 5 ANOVA, with the factors type of generator, degree of contamination and number of simulations, and the skewness deviation and kurtosis deviation as the dependent variable.