The 2015 Gartner CIO Survey gathered data from 2, 810 CIO respondents in 84 countries and all major industries, representing approximately $12. 1 trillion in revenue/public-sector budgets
For this report, we analyzed this data and supplemented it with interviews of 11 CIOS
businesses need forward-looking predictive analytics combined with data-led experimentation (see figure below). Information and technology flip 3:
From Backward-looking reporting Passive analysis of data Structured information Separate analytics To Forward-looking predictive analytics Active experimentation informed by data New types of information,
IMPUTATION OF MISSING DATA 35 4. 1 Single imputation 36 3. 1. 1 Unconditional mean imputation 37 4. 1. 2 Regression
imputation 38 4. 1. 3 Expected maximization imputation 38 4. 2 Multiple imputation 40 5. NORMALISATION OF DATA 44 5. 1
and factor analysis 56 6. 1. 2 Data envelopment analysis and Benefit of the doubt 59 Benefit of the doubt approach 60 6. 1. 3 Regression approach
General framework for the analysis 88 4 7. 1. 3 Inclusion exclusion of individual sub-indicators 88 7. 1. 4 Data quality 88
We deal with the problem of missing data and with the techniques used to bring into a common unit the indicators that are of very different nature.
whereby a lot of work in data collection and editing is wasted or hidden behind a single number of dubious significance.
multivariate analysis, imputation of missing data and normalization techniques aim at supplying a sound and defensible dataset.
(d) a method for selecting groups of countries to impute missing data with a view to decrease the variance of the imputed values.
Missing data are present in almost all composite indicators and they can be missing either in a random or in a nonrandom fashion.
whether data are missing at random or systematically, whilst most of the methods of imputation require a missing at random mechanism.
Three generic approaches for dealing with missing data can be distinguished i e. case deletion, single imputation or multiple imputation.
The other two approaches see the missing data as part of the analysis and therefore try to impute values through either Single Imputation (e g.
and the use ofexpensive to collect'data that would otherwise be discarded. In the words of Dempster and Rubin (1983:
because it can lull the user into the pleasurable state of believing that the data are complete after all,
and imputed data have substantial bias. Whenever indicators in a dataset are incommensurate with each other,
The normalization method should take into account the data properties and the objectives of the composite indicator.
whether hard or soft data are available, whether exceptional behaviour needs to be rewarded/penalised, whether information on absolute levels matters,
partially, to correct for data quality problems in such extreme cases. The functional transformation is applied to the raw data to represent the significance of marginal changes in its level.
Different weights may be assigned to indicators to reflect their economic significance (collection costs, coverage, reliability and economic reason), statistical adequacy, cyclical conformity, speed of available data, etc.
such as 12 weighting schemes based on statistical models (e g. factor analysis, data envelopment analysis, unobserved components models), or on participatory methods (e g. budget allocation, analytic hierarchy processes).
Weights may also reflect the statistical quality of the data, thus higher weight could be assigned to statistically reliable data (data with low percentages of missing values, large coverage, sound values).
In this case the concern is to reward only sub-indicators easy to measure and readily available, punishing the information that is more problematic to identify and measure.
selection of data, data quality, data editing (e g. imputation), data normalisation, weighting scheme/weights, weights'values and aggregation method.
The composite indicator is no longer a magic number corresponding to crisp data treatment, weighting set or aggregation method,
along with the data, the weights and the documentation of the methodology. Given that composite indicators can be decomposed
or disaggregated so as to introduce alternative data, weighting, normalisation approaches etc. the components of composites should be available electronically as to allow users to change variables, weights,
etc. and to replicate sensitivity tests. 2. 1 Requirements for quality control As mentioned above the concept of quality of the composite indicators is not only a function of the quality of its underlying data (in terms of relevance, accuracy, credibility, etc.)
) Table 2. 1 The Pedigree Matrix for Statistical Information Grade Definitions & Standards Data-collection & Analysis Institutional Culture Review 4 Negotiation Task-force Dialogue
(h) a method for selecting groups of countries to impute missing data with a view to decrease the variance of the imputed values.
say P<Q principal components that preserve a high amount of the cumulative variance of the original data.
because it means that the principal components are measuring different statistical dimensions in the data.
When the objective of the analysis is to present a huge data set using a few variables then in applying PCA there is the hope that some degree of economy can be achieved
Bootstrap refers to the process of randomly resampling the original data set to generate new data sets.
whether the TAI data set for the 23 countries can be viewed as arandom'sample of the entire population as required by the bootstrap procedures (Efron 1987;
Several points can be made regarding the issues of randomness and representativeness of the data. First, it is often difficult to obtain complete information for a data set in the social sciences because, unlike the natural sciences,
controlled experiments are not always possible, as in the case here. As Efron and Tibshirani (1993) state:
A 20 third point on the data quality is that a certain amount of measurement error is likely to exist.
While such measurement error can only be controlled at the data collection stage rather than at the analytical stage, it is argued that the data represent the best estimates currently available (United nations, 2001, p. 46.
Figure 3. 1 (right) demonstrates graphically the relationship between the eigenvalues from the deterministic PCA,
it is unlikely that they share common factors. 2. Identify the number of factors that are necessary to represent the data
Although social scientists may be attracted to factor analysis as a way of exploring data whose structure is unknown,
which variables are associated most with the outlier cases. 4. Assumption of interval data. Kim and Mueller (1978b
pp. 74-75) note that ordinal data may be used if it is thought that the assignment of ordinal categories to the data does not seriously 25 distort the underlying metric scaling.
Likewise, these authors allow the use of dichotomous data if the underlying metric correlations between the variables are thought to be moderate(.
7) or lower. The result of using ordinal data is that the factors may be much harder to interpret.
Note that categorical variables with similar splits will necessarily tend to correlate with each other, regardless of their content (see Gorsuch, 1983).
the more important it is to screen data for linearity. 6. Multivariate normality of data is required for related significance tests.
The smaller the sample size, the more important it is to screen data for normality.
Likewise, the inclusion of multiple definitionally-similar sub-indicators representing essentially the same data will lead to tautological results. 8. Strong intercorrelations are required not mathematically,
thereby defeating the data reduction purposes of factor analysis. On the other hand, too high inter-correlations may indicate a multi-collinearity problem
Sensitive to modifications in the basic data: data revisions and updates (e g. new countries. Sensitive to the presence of outliers,
which may introduce a spurious variability in the data. Sensitive to small-sample problems, which are particularly relevant
when the focus is limited on a set of countries. Minimisation of the contribution of subindicators which do not move with other subindicators.
(=Useful if the data are categorical in nature. Having decided how to measure similarity (the distance measure),
and hence different classifications may be obtained for the same data, even using the same distance measure.
which indicates that the data are represented best by ten clusters: Finland alone, Sweden and 31 USA, the group of countries located between The netherlands and Hungary, then alone Canada, Singapore, Australia, New zealand, Korea, Norway, Japan.
Figure 3. 2. Country clusters for the sub-indicators of technology achievement (standardised data. Type:
kmeans clustering (standardized data. Finally, expectation maximization (EM) clustering extends the simple k-means clustering in two ways:
so as to maximizes the overall likelihood of the data, given the final clusters (Binder, 1981). 2. Unlike k-means,
EM can be applied both to continuous and categorical data. Ordinary significance tests are not valid for testing differences between clusters.
Principal component analysis or Factor analysis) that summarize the common information in the data set by detecting non-observable dimensions.
or when is believed it that some of these do not contribute much to identify the clustering structure in the data set,
because PCA or FA may identify dimensions that do not necessarily contribute much to perceive the clustering structure in the data and that,
A discrete clustering model together with a continuous factorial one are fitted simultaneously to two-way data,
the data reduction and synthesis, simultaneously in direction of objects and variables; Originally applied to short-term macroeconomic data,
factorial k-means analysis has a fast alternating least-squares algorithm that extends its application to large data sets.
The methodology can therefore be recommended as an alternative to the widely used tandem analysis. 3. 3 Conclusions Application of multivariate statistics,
then it must take its place as one of the important steps during the development of composite indicators. 35 4. Imputation of missing data Missing data are present in almost all the case studies of composite indicators.
Data can be missing either in a random or in a nonrandom fashion. They can be missing at random because of malfunctioning equipment, weather issues, lack of personnel,
but there is no particular reason to consider that the collected data are substantially different from the data that could not be collected.
On the other hand, data are often missing in a nonrandom fashion. For example, if studying school performance as a function of social interactions in the home, it is reasonable to expect that data from students in particularly types of home environments would be more likely to be missing than data from people in other types of environments.
More formally, the missing patterns could be: -MCAR (Missing Completely At random: missing values do not depend on the variable of interest or any other observed variable in the data set.
For example the missing values in variable income would be of MCAR type if (i) people who do not report their income have on average,
but they are conditional on some other variables in the data set. For example the missing values in income would be MAR
if the probability of missing data on income depends on marital status but, within each category of marital status,
One of the problems with missing data is that there is no statistical test for NMAR and often no basis upon
whether data are missing at random or systematically, whilst most of the methods that impute (i e. fill in) missing values require an MCAR or at least an MAR mechanism.
Three generic approaches for dealing with missing data can be distinguished, i e. case deletion, single imputation or multiple imputation.
The other two approaches see the missing data as part of the analysis and therefore try to impute values through either Single Imputation (e g.
and the use ofexpensive to collect'data that would otherwise be discarded. The main disadvantage of imputation is that it can allow data to influence the type of imputation.
In the words of Dempster and Rubin (1983: The idea of imputation is both seductive and dangerous.
because it can lull the user into the pleasurable state of believing that the data are complete after all,
and imputed data have substantial bias. The uncertainty in the imputed data should be reflected by variance estimates.
This allows taking into account the effects of imputation in the course of the analysis. However
The literature on the analysis of missing data is extensive and in rapid development. Therefore
The predictive distribution must be created by employing the observed data. There are, in general, two approaches to generate this predictive distribution:
the danger of this type of modeling missing data is to consider the resulting data set as complete
fill in blanks cells with individual data drawn from similar responding units, e g. missing values for individual income may be replaced with the income of another respondent with similar characteristics (age, sex, race, place of residence, family relationships, job, etc.).
and the time to converge depends on the proportion of missing data and the flatness of the likelihood function.
Another common method (called imputing means within adjustment cells) is to classify the data for the sub-indicator with some missing values in classes
thus the inference based on the entire dataset (including the imputed data) does not fully count for imputation uncertainty.
For nominal variables, frequency statistics such as the mode or hot-and cold-deck imputation methods might be more appropriate. 4. 1. 3 Expected maximization imputation Suppose that X denotes the data.
In the likelihood based estimation the data are assumed to be generated by a model described by a probability
The probability function captures the relationship between the data set and the parameter of the of the data model 5
If the observed variables are dummies for a categorical variable then the prediction (4. 2) are respondent means within classes defined by the variable
while the data set is known, it make sense to reverse the argument and look for the probability of observing a certain given the data set X:
this is the likelihood function. Therefore, given X, the likelihood function L(/X) is any function of O proportional to f (X/:
Assume that missing data are MAR or MCAR8, the EM consists of two components, the expectation (E) and maximization (M) steps.
just as if there were no missing data (thus missing values are replaced by estimated values, i e. initial conditions in the first round of maximization).
In the E step the missing data are estimated by their expectations given the observed data and current estimated parameter values.
which, for complex pattern of incomplete data, can be a very complicate function of. As a result these algorithms often require algebraic manipulations and complex programming.
but careful computation is needed. 8 For NMAR mechanisms one needs to make assumption on the missing-data mechanism
Ch. 15.40 parameters in are estimated re using maximum likelihood applied to the observed data augmented by the estimates of the unobserved data (coming from the previous round).
Effectively, this process maximizes, in each cycle, the expectation of the complete data log likelihood.
To test this, different initial starting values for each can be used. 4. 2 Multiple imputation Multiple imputation (MI) is a general approach that does not require a specification of paramentrized likelihood for all data.
The idea of MI is depicted in Figure 4. 1. The imputation of missing data is performed with a random process that reflects uncertainty.
Data set with missing values Set 1 Set 2 Set N Result 1 Result 2 Result N Combine results Figure 4
It assumes that data are drawn from a multivariate Normal distribution and requires MAR or MCAR assumptions.
The theory of MCMC is understood most easily using Bayesian methodology (See Figure 4. 2). Let us denote the observed data as Xobs and the complete dataset as X=(Xobs
we shall estimate it from the data, yielding, and use the distribution f (Xmis Xobs).
and covariance matrix from the data that does not have missing values. Use to estimate Prior distribution.
simulate values for missing data items by randomly selecting a value from the available distribution of values Posterior step:
i e. mean vector and cov. matrix are unchanged as we iterate) Use imputation from final iteration to form a data set without missing values need more iterations enough iterations Figure 4. 2. Functioning of MCMC
whose the distribution depends on the data. So the first step for its estimation is to obtain the posterior distribution of from the data.
Usually this posterior is approximated by a normal distribution. After formulating the posterior distribution of, the following imputation algorithm can be used:
*9 The missing data generating process may depend on additional parameters f, but if f and are independent,
the process called ignorable and the analyst can concentrate on modelling the missing data, given the observed data and.
then we have non-ignorable missing data generating process, which cannot be solved adequately without making assumptions on the functional form of the interdependency. 10 rearranged from K. Chantala and C. Suchindran,
http://www. cpc. unc. edu/services/computer/presentations/mi presentation2. pdf 42 Use the completed data X and the model to estimate the parameter of interest (e g. the mean) ß
In conclusion, Multiple Imputation method imputes several values (N) for each missing value (from the predictive distribution of the missing data),
The N versions of completed data sets are analyzed by standard complete data methods and the results are combined using simple rules to yield single combined estimates (e g.,
p-values, that formally incorporate missing data uncertainty. The pooling of the results of the analyses performed on the multiply imputed data sets,
implies that the resulting point estimates are averaged over the N completed sample points, and the resulting standard errors and p-values are adjusted according to the variance of the corresponding N completed sample point estimates.
Thus, the'between imputation variance',43 provides a measure of the extra inferential uncertainty due to missing data
44 5. Normalisation of data The indicators selected for aggregation convey at this stage quantitative information of different kinds11.
and their robustness to possible outliers in the data. Different normalization methods will supply different results for the composite indicator.
'46 Another transformation which is used often to reduce the skewness of (positive) data varying across many orders of magnitudes is the logarithmic transformation:
yet s/he has to beware that the normalized data will surely be affected by the log transformation.
Therefore, data have to be processed via specific treatment. An example is offered in the Environmental Sustainability Index
when data for a new time point become available. This implies an adjustment of the analysis period T,
In such cases, to maintain comparability between the existing and the new data, the composite indicator would have to be recalculated for the existing data. 5. 2. 4 Distance to a reference country This method takes the ratios of the indicator tqc x for a generic country c and time t
with respect to the sub-indicator t0 qc c x=for the reference country at the initial time 0 t. t0 qc c tt qc qc xx
if there is little variation within the original scores, the percentile banding forces the categorization on the data, irrespective of the distribution of the underlying data.
which each component of the regulatory framework is weighted according to its contribution to the overall variance in the data.
Data have been gathered basically from Member countries responses to the OECD Regulatory Indicators Questionnaire, which include both qualitative and quantitative information.
while the quantitative information (such as data on ownership shares or notice periods for individual dismissals) is subdivided into classes.
Examples of the above transformations are shown in Table 5. 6 using the TAI data. The data are sensitive to the choice of the transformation
and this might cause problems in terms of loss of the 52 interval level of the information, sensitivity to outliers, arbitrary choice of categorical scores and sensitivity to weighting.
normalisation techniques using the TAI data. Mean years of school Rank zscore rescaling distance to reference country Log 10 abo ve/Per cen tile Cat ego rica l age 15
coverage, reliability and economic reason), statistical adequacy, cyclical conformity, speed of available data, etc. In this section a number of techniques are presented ranging from weighting schemes based on statistical models (such as factor analysis, data envelopment analysis, unobserved components models),
to participatory methods (e g. budget allocation or analytic hierarchy processes). Weights usually have an important impact on the value of the composite
Weights may also reflect the statistical quality of the data, thus higher weight could be assigned to statistically reliable data (data with low percentages of missing values, large coverage, sound values).
In this case the concern is to reward only easy to measure and readily available baseindicators, punishing the information that is more problematic to identify and measure.
but it is rather based on the statistical dimensions of the data. According to PCA/FA, weighting only intervenes to correct for the overlapping information of two or more correlated indicators,
Methodology The first step in FA is to check the correlation structure of the data: if the correlation between the indicators is low then it is unlikely that they share common factors.
small than the number of sub-indicators, representing the data. Summarizing briefly what has been explained in Section 3,
Sensitive to modifications of basic data: data revisions and updates (e g. new observations and new countries) may change the set of weights
(i e. the estimated loadings) used in the composite. Sensitive to the presence of outliers, that may introduce spurious variability in the data Sensitive to small-sample problems
and data shortage that may make the statistical identification or the economic interpretation difficult (in general a relation between data and unknown parameters of 3: 1 is required for a stable solution).
Minimize the contribution of indicators, which do not move with other indicators. Sensitive to the factor extraction and to the rotation methods.
Examples of use Indicators of product market regulation (Nicoletti et al. OECD, 2000) Internal Market Index (EC-DG MARKT, 2001b) Business Climate Indicator (EC-DG ECFIN, 2000) General Indicator of S&t (NISTEP
, 1995) Success of software process Improvement (Emam et al. 1998) 16 To preserve comparability final weights could be rescaled to sum up to one. 59 6. 1. 2 Data envelopment analysis
and Benefit of the doubt Data envelopment analysis (DEA) employs linear programming tools (popular in Operative Research) to retrieve an efficiency frontier
Indicator 1 Indicator 2 a b c d d'0 Figure 6. 1. Performance frontier determined with Data Envelopment Analysis. Rearranged from Mahlberg and Obersteiner (2001.
It requires a large amount of data to produce estimates with known statistical properties. Examples of use Composite Economic Sentiment Indicator (ESIN) http://europa. eu. int/comm/economy finance National Innovation Capacity index (Porter and Stern, 1999
The observed data consist on a cluster of q=1,, Q (c) indicators, each measuring an aspect of ph (c). Let c=1,
However, since not all countries have data on all sub-indicators, the denominator of w c,
q). The likelihood function of the observed data is maximized with respect to the unknown parameters, a q) s, ß (q) s,
Reliability and robustness of results depend on the availability of enough data. With highly correlated sub-indicators there could be identification problems.
AHP allows for the application of data, experience, insight, and intuition in a logical and thorough way within a 69 hierarchy as a whole.
Analytic Hierarchy Process Advantages Disadvantages The method can be used both for qualitative and quantitative data The method increase the transparency of the composite The method requires a high number of pairwise comparisons
Although this methodology uses statistical analysis to treat data, it operates with people (experts, politicians, citizens) who are asked to choose which set of sub-indicators they prefer,
25 Data are normalized not. Normalization does not change the result of the multicriteria method whenever it does not change the ordinal information of the data matrix. 78 S==+Qq 1 jk q jk q jk w (In)) 2e (w (Pr) 1 (6. 15
) where w (Pr) q jk and w (In) q jk are the weights of sub-indicators presenting a preference and an indifference relation respectively.
only if data are expressed all in partially comparable interval scale (i e. temperature in Celsius of Fahrenheit) of type:
Non-comparable data measured in ratio scale (i e. kilograms and pounds: where>0 i i x x f a a (i e. i a varying across subindicators) can only be aggregated meaningfully by using geometric functions,
because it lets the data decide on the weighting issue, and it is sensible to national priorities.
i. selection of sub-indicators, ii. data quality, iii. data editing, iv. data normalisation, v. weighting scheme, vi. weights'values, vii. composite
i. inclusion exclusion of sub-indicators, ii. modelling of data error, e g. based of available information on variance estimation. iii. alternative editing schemes,
e g. multiple imputation, described in section 4. iv. using alternative data normalisation schemes, such as rescaling, standardisation,
Also modelling of the data error, point (ii) above, will not be included as in the case of TAI no standard error estimate is available for the sub-indicators.
the relative sub-indicator q will be neglected almost for that run. 7. 1. 4 Data quality This is not considered here as discussed above. 7. 1. 5 Normalisation As described in Section II-5 several methods are available
1 X Editing 1 Use bivariate correlation to impute missing data 2 Assign zero to missing datum The second input factor 2 X is the trigger to select the normalisation
(either for the BAL or AHP schemes) are assigned to the data. Clearly the selection of the expert has no bearing
i e. by censoring all countries with missing data. As a result, only 34 countries could in theory be analysed.
as this is the first country with missing data, and it was preferred to analyse the set of countries whose rank was altered not the omission of missing records.
Data retrieved on 4 october, 2004. A number of lines are superimposed usually in the same chart to allow comparisons between countries.
an assessment of progress can be made by comparing the latest data with the position at a number of baselines.
in direction away from meeting objective Insufficient or no comparable data 109 8. 5 Rankings A quick and easy way to display country performance is to use rankings.
However its graphical features can be helpful for presentational purposes. www. nationmaster. com is a massive central data source on the internet with a handy way to graphically compare nations.
Nation Master is a vast compilation of data from such sources as the CIA World Factbook, United nations, World health organization, World bank, World Resources Institute, UNESCO,
Data selection The quality of composite indicators depends also on the quality of the underlying indicators.
Imputation of missing data The idea of imputation is both seductive and dangerous. Several imputation methods are available,
and the use ofexpensive to collect'data that would otherwise be discarded. The main disadvantage of imputation is that the results are affected by the imputation algorithm used.
and Seiford L. M.,(1995), Data Envelopment Analysis: Theory, Methodology and Applications. Boston: Kluwer. 13.
Cherchye L. 2001), Using data envelopment analysis to assess macroeconomic policy performance, Applied Economics, 33,407-416.14.
Dempster A p. and Rubin D. B. 1983), Introduction pp. 3-10, in Incomplete Data in Sample Surveys (vol. 2:
Funtowicz S. O.,Munda G.,Paruccini M. 1990)- The aggregation of environmental data using multicriteria methods, Environmetrics, Vol. 1 (4), pp. 353-36.43.
An Empirical Analysis Based on Survey Data for Swiss Manufacturing, Research Policy, 25,633-45.58. Hollenstein, H. 2003:
A Cluster analysis Based on Firm-level Data, Research Policy, 32 (5), 845-863.59. Homma, T. and Saltelli, A. 1996) Importance measures in global sensitivity analysis of model output.
and Schenker N.,(1994), Missing Data, in Handbook for Statistical Modeling in the Social and Behavioral Sciences (G. Arminger, C. C Clogg,
Little R. J. A (1997) Biostatistical Analysis with Missing Data, in Encyclopedia of Biostatistics (p. Armitage and T. Colton eds.
Little R. J. A. and Rubin D. B. 2002), Statistical analysis with Missing Data, Wiley Interscience, J. Wiley &sons, Hoboken, New jersey. 85.
Mahlberg B. and Obersteiner M.,(2001), Remeasuring the HDI by data Envelopment analysis, Interim report IR-01-069, International Institute for Applied System Analysis, Laxenburg
Massart, D. L. and Kaufman, L. 1983), The Interpretation of Analytical Chemical Data by the Use of Cluster analysis, New york:
Milligan, G. W. and Cooper, M. C. 1985),"An Examination of Procedures for Determining the Number of Clusters in a Data Set,"Psychometrika, 50,159-179.93.
and Kiers, H. 2001) Factorial k-means analysis for two-way data, Computational Statistics and Data analysis, 37 (1), 49-64.142.
However the original data set contains a large number of missing values, mainly due to missing data in Patents and Royalties.
Overtext Web Module V3.0 Alpha
Copyright Semantic-Knowledge, 1994-2011