personalised public services, using open data and services, enhancing transparency and decision-making processes of public administrations,
Against the update of structural data, the project will test these hypotheses on the qualitative impacts of the Third Sector in terms of capital building (e g. social networks,
and you can include excerpts from OECD publications, databases and multimedia products in your own documents, presentations, blogs,
23 1. 3 Imputation of missing data...24 1. 4 Multivariate analysis...25 1. 5 Normalisation of data...
27 1. 6 Weighting and aggregation...31 1. 7 Robustness and sensitivity...34 1. 8 Back to the details...
44 2. 2 Quality dimensions for basic data...46 2. 3 Quality dimensions for procedures to build
51 Step 3. Imputation of missing data...55 3. 1 Single imputation...55 3. 2 Unconditional mean imputation...
63 4. 1 Principal components analysis...63 4. 2 Factor analysis...69 4. 3 Cronbach Coefficient Alpha...
72 4. 4 Cluster analysis...73 4. 5 Other methods for multivariate analysis...79 6 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
89 6. 1 Weights based on principal components analysis or factor analysis...89 6. 2 Data envelopment analysis (DEA...
91 6. 3 Benefit of the doubt approach (BOD...92 6. 4 Unobserved components model (UCM...
115 Step 7. Uncertainty and sensitivity analysis...117 7. 1 General framework...118 7. 2 Uncertainty analysis (UA...
118 7. 3 Sensitivity analysis using variance-based techniques...121 7. 3. 1 Analysis 1...124 7. 3. 2 Analysis 2...129 Step 8. Back to the details...
K-means for clustering TAI countries...78 Table 14. Normalisation based on interval scales...83 Table 15.
Examples of normalisation techniques using TAI data...87 Table 16. Eigenvalues of TAI data set...
90 Table 17. Factor loadings of TAI based on principal components...90 Table 18. Weights for the TAI indicators based on maximum likelihood (ML) or principal components (PC) method for the extraction of the common factors...
Data envelopment analysis (DEA) performance frontier...92 Figure 17. Analytical hierarchy process (AHP) weighting of the TAI indicators...
-Young-Levenglick CLA Cluster analysis DEA Data Envelopment Analysis DFA Discriminant Function Analysis DQAF Data Quality Framework EC European commission EM Expected
Maximisation EU European union EW Equal weighting FA Factor analysis GCI Growth Competitiveness Index GDP Gross domestic product GME Geometric aggregation HDI Human Development Index ICT Information
whereby a lot of work in data collection and editing is wasted or hidden behind a single number of dubious significance.
and use of composite indicators in order to avoid data manipulation and misrepresentation. In particular, to guide constructors
Data selection. Indicators should be selected on the basis of their analytical soundness, measurability, country coverage,
relevance to the phenomenon being measured and relationship to each other. The use of proxy variables should be considered
when data are scarce. Imputation of missing data. Consideration should be given to different approaches for imputing missing values.
Extreme values should be examined as they can become unintended benchmarks. Multivariate analysis. An exploratory analysis should investigate the overall structure of the indicators
assess the suitability of the data set and explain the methodological choices, e g. weighting, aggregation.
Skewed data should also be identified and accounted for. Weighting and aggregation. Indicators should be aggregated and weighted according to the underlying theoretical framework.
and Sensitivity analysis should be undertaken to assess the robustness of the composite indicator in terms of, e g.,, the mechanism for including
or excluding single indicators, the normalisation scheme, the imputation of missing data, the choice of weights and the aggregation method.
Back to the real data. Composite indicators should be transparent and fit to be decomposed into their underlying indicators or values.
and the underlying data are freely available on the Internet. For the sake of simplicity, only the first 23 of the 72 original countries measured by the TAI are considered here.
dimensions of technological capacity (data given) in Table A. 1 Creation of technology. Two individual indicators are used to capture the level of innovation in a society:(
The quality of a composite indicator as well as the soundness of the messages it conveys depend not only on the methodology used in its construction but primarily on the quality of the framework and the data used.
A composite based on a weak theoretical background or on soft data containing large measurement errors can lead to disputable policy messages
especially as far as methodologies and basic data are concerned. To avoid these risks, the Handbook puts special emphasis on documentation and metadata.
process. 2. Data selection Should be based on the analytical soundness, measurability, country coverage, and relevance of the indicators to the phenomenon being measured and relationship to each other.
when data are scarce (involvement of experts and stakeholders is envisaged at this step). To check the quality of the available indicators.
To create a summary table on data characteristics, e g.,, availability (across country, time), source, type (hard,
3. Imputation of missing data is needed in order to provide a complete dataset (e g. by means of single or multiple imputation).
To check the underlying structure of the data along the two main dimensions, namely individual indicators and countries (by means of suitable multivariate methods, e g.,
, principal components analysis, cluster analysis). ) To identify groups of indicators or groups of countries that are statistically similar
To compare the statisticallydetermined structure of the data set to the theoretical framework and discuss possible differences. 5. Normalisation Should be carried out to render the variables Comparable to select suitable normalisation procedure (s) that respect both the theoretical framework and the data properties.
To discuss the presence of outliers in the dataset as they may become unintended benchmarks.
To select appropriate weighting and aggregation procedure (s) that respect both the theoretical framework and the data properties.
and sensitivity analysis Should be undertaken to assess the robustness of the composite indicator in terms of e g.,, the mechanism for including
the normalisation scheme, the imputation of missing data, the choice of weights, the aggregation method.
To conduct sensitivity analysis of the inference (assumptions) and determine what sources of uncertainty are more influential in the scores
and/or ranks. 8. Back to the data is needed to reveal the main drivers for an overall good or bad performance.
To correlate the composite indicator with other relevant measures, taking into consideration the results of sensitivity analysis.
To develop data-driven narratives based on the results. 10. Visualisation of the results Should receive proper attention,
Criteria for assuring the quality of the basic data set for composite indicators are discussed in detail in Section 2:
the data selection process can be quite subjective as there may be no single definitive set of indicators.
A lack of relevant data may also limit the developer's ability to build sound composite indicators.
Given a scarcity of internationally comparable quantitative (hard) data, composite indicators often include qualitative (soft) data from surveys or policy reviews.
Proxy measures can be used when the desired data are unavailable or when cross-country comparability is limited.
For example data on the number of employees that use computers might not be available. Instead, the number of employees who have access to computers could be used as a proxy.
As in the case of soft data, caution must be taken in the utilisation of proxy indicators.
To the extent that data permit, the accuracy of proxy measures should be checked through correlation and sensitivity analysis.
The builder should also pay close attention to whether the indicator in question is dependent on GDP or other size-related factors.
The quality and accuracy of composite indicators should evolve in parallel with improvements in data collection and indicator development.
The current trend towards constructing composite indicators of country performance in a range of policy areas may provide further impetus to improving data collection,
identifying new data sources and enhancing the international comparability of statistics. On the other hand we do not marry the idea that using
Poor data will produce poor results in a garbage-in garbage-out logic. From a pragmatic point of view,
Created a summary table on data characteristics, e g. availability (across country, time), source, type (hard,
1. 3. Imputation of missing data The idea of imputation could be both seductive and dangerous Missing data often hinder the development of robust composite indicators.
Data can be missing in a random or nonrandom fashion. The missing patterns could be:
Missing values do not depend on the variable of interest or on any other observed variable in the data set.
and if (ii) each of the other variables in the data set would have to be the same, on average,
but are conditional on other variables in the data set. For example, the missing values in income would be MAR
if the probability of missing data on income depends on marital status but, within each category of marital status,
whether data are missing at random or systematically, while most of the methods that impute missing values require a missing at random mechanism, i e.
There are three general methods for dealing with missing data:(i) case deletion,(ii) single imputation or (iii) multiple imputation.
The other two approaches consider the missing data as part of the analysis and try to impute values through either single imputation, e g. mean/median/mode substitution, regression imputation,
Data imputation could lead to the minimisation of bias and the use ofexpensive to collect'data that would otherwise be discarded by case deletion.
However, it can also allow data to influence the type of imputation. In the words of Dempster & Rubin (1983:
The idea of imputation is both seductive and dangerous. It is seductive because it can lull the user into the pleasurable state of believing that the data are complete after all,
and it is dangerous because it lumps together situations where the problem is sufficiently minor that it can be handled legitimately in this way
and imputed data have substantial bias. The uncertainty in the imputed data should be reflected by variance estimates.
This makes it possible to take into account the effects of imputation in the course of the analysis. However
A complete data set without missing values. A measure of the reliability of each imputed value
and the results. 1. 4. Multivariate analysis Analysing the underlying structure of the data is still an art Over the last few decades,
The underlying nature of the data needs to be analysed carefully before the construction of a composite indicator.
This preliminary step is helpful in assessing the suitability of the data set and will provide an understanding of the implications of the methodological choices, e g. weighting and aggregation,
and analysed along at least two dimensions of the data set: individual indicators and countries. Grouping information on individual indicators.
or appropriate to describe the phenomenon (see Step 2). This decision can be based on expert opinion and the statistical structure of the data set.
Factor analysis (FA) is similar to PCA, but is based on a particular statistical model. An alternative way to investigate the degree of correlation among a set of variables is to use the Cronbach coefficient alpha (c-alpha),
These multivariate analysis techniques are useful for gaining insight into the structure of the data set of the composite.
Cluster analysis is another tool for classifying large amounts of information into manageable sets. It has been applied to a wide variety of research problems and fields, from medicine to psychiatry and archaeology.
Cluster analysis is used also in the development of composite indicators to group information on countries based on their similarity on different individual indicators.
Cluster analysis serves as:(i) a purely statistical method of aggregation of the indicators, ii) a diagnostic tool for exploring the impact of the methodological choices made during the construction phase of the composite indicator,
and (iv) a method for selecting groups of countries for the imputation of missing data with a view to decreasing the variance of the imputed values.
or when is believed it that some of them do not contribute to identifying the clustering structure in the data set,
and then apply a clustering algorithm on the object scores on the first few components,
as PCA or FA may identify dimensions that do not necessarily help to reveal the clustering structure in the data
and weaknesses of multivariate analysis Strengths Weaknesses Principal Components/Factor analysis Can summarise a set of individual indicators while preserving the maximum possible proportion of the total variation in the original data set.
Sensitive to modifications in the basic data: data revisions and updates, e g. new countries. Sensitive to the presence of outliers,
which may introduce a spurious variability in the data. Sensitive to small-sample problems, which are particularly relevant
when the focus is limited on a set of countries. Minimisation of the contribution of individual indicators which do not move with other individual indicators.
Cluster analysis Offers a different way to group countries; gives some insight into the structure of the data set.
Purely a descriptive tool; may not be transparent if the methodological choices made during the analysis are motivated not
Various alternative methods combining cluster analysis and the search for a low-dimensional representation have been proposed, focusing on multidimensional scaling or unfolding analysis. Factorial k-means analysis combines k-means
cluster analysis with aspects of FA and PCA. A discrete clustering model together with a continuous factorial model are fitted simultaneously to two-way data to identify the best partition of the objects, described by the best orthogonal linear combinations of the variables (factors) according to the least-squares criterion.
This has a wide range of applications since it achieves a double objective: data reduction and synthesis, simultaneously in the direction of objects and variables.
Originally applied to short-term macroeconomic data, factorial k-means analysis has a fast alternating least-squares algorithm that extends its application to large data sets.
This methodology can be recommended as an alternative to the widely-used tandem analysis. By the end of Step 4 the constructor should have:
Checked the underlying structure of the data along various dimensions, i e. individual indicators, countries. Applied the suitable multivariate methodology, e g.
PCA, FA, cluster analysis. Identified subgroups of indicators or groups of countries that are statistically similar.
Analysed the structure of the data set and compared this to the theoretical framework. Documented the results of the multivariate analysis
and the interpretation of the components and factors. 1. 5. Normalisation of data Avoid adding up apples
and oranges Normalisation is required prior to any data aggregation as the indicators in a data set often have different measurement units.
A number of normalisation methods exist (Table 3)( Freudenberg 2003; Jacobs et al. 2004): 1. Ranking is the simplest normalisation technique.
the percentile bands force the categorisation on the data, irrespective of the underlying distribution. A possible solution is to adjust the percentile brackets across the individual indicators
The normalisation method should take into account the data properties, as well as the objectives of the composite indicator.
Selected the appropriate normalisation procedure (s) with reference to the theoretical framework and to the properties of the data.
A number of weighting techniques exist (Table 4). Some are derived from statistical models, such as factor analysis, data envelopment analysis and unobserved components models (UCM
METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 Weights may also be chosen to reflect the statistical quality of the data.
Higher weights could be assigned to statistically reliable data with broad coverage. However, this method could be biased towards the readily available indicators,
and cons. Statistical models such as principal components analysis (PCA) or factor analysis (FA) could be used to group individual indicators according to their degree of correlation.
such as the benefit of the doubt (BOD) approach, are extremely parsimonious about weighting assumptions as they allow the data to decide on the weights
as in the case of environmental indices that include physical, social and economic data. If the analyst decides that an increase in economic performance cannot compensate for a loss in social cohesion or a worsening in environmental sustainability,
and sensitivity Sensitivity analysis can be used to assess the robustness of composite indicators Several judgements have to be made
when constructing composite indicators, e g. on the selection of indicators, data normalisation, weights and aggregation methods, etc.
A combination of uncertainty and sensitivity analysis can help gauge the robustness of the composite indicator
Sensitivity analysis assesses the contribution of the individual source of uncertainty to the output variance. While uncertainty analysis is used more often than sensitivity analysis
and is treated almost always separately, the iterative use of uncertainty and sensitivity analysis during the development of a composite indicator could improve its structure (Saisana et al.,
2005a; Tarantola et al. 2000; Gall, 2007. Ideally, all potential sources of uncertainty should be addressed: selection of individual indicators, data quality, normalisation, weighting, aggregation method, etc.
The approach taken to assess uncertainties could include the following steps: 1. Inclusion and exclusion of individual indicators. 2. Modelling data error based on the available information on variance estimation. 3. Using alternative editing schemes,
e g. single or multiple imputation. 4. Using alternative data normalisation schemes, such as Mni-Max, standardisation,
use of rankings. 5. Using different weighting schemes, e g. methods from the participatory family (budget allocation,
No index can be better than the data it uses. But this is an argument for improving the data,
not abandoning the index. UN, 1992. The results of the robustness analysis are reported generally as country rankings with their related uncertainty bounds,
The sensitivity analysis results are shown generally in terms of the sensitivity measure for each input source of uncertainty.
The results of a sensitivity analysis are shown often also as scatter plots with the values of the composite indicator for a country on the vertical axis
does derived the theoretical model provide a good fit to the data? What the lack of fit tells about the conceptual definition of the composite of the indicators chosen for it?
Conducted sensitivity analysis of the inference, e g. to show what sources of uncertainty are more influential in determining the relative ranking of two entities.
While they can be used as summary indicators to guide policy and data work, they can also be decomposed such that the contribution of subcomponents
Correlation analysis should not be mistaken with causality analysis. Correlation simply indicates that the variation in the two data sets is similar.
Tested the links with variations of the composite indicator as determined through sensitivity analysis. Developed data-driven narratives on the results Documented
and explained the correlations and the results. 1. 10. Presentation and dissemination A well-designed graph can speak louder than words The way composite indicators are presented is not a trivial issue.
JRC elaboration, data source: Eurostat, 2007. http://ec. europa. eu/eurostat 60 70 80 90 100 110 120 130 140 150 2000 2001 2002
related both to the quality of elementary data used to build the indicator and the soundness of the procedures used in its construction.
when quality was equated with accuracy. It is recognised now generally that there are other important dimensions. Even if data are accurate,
they cannot be said to be of good quality if they are produced too late to be useful,
or appear to conflict with other data. Thus, quality is multifaceted a concept. The most important quality characteristics depend on user perspectives, needs and priorities,
With the adoption of the European Statistics Code of practice in 2005, the Eurostat quality framework is now quite similar to the IMF's Data Quality Framework (DQAF),
in the sense that both frameworks provide a comprehensive approach to quality, through coverage of governance, statistical processes and observable features of the outputs.
3. Accuracy and reliability: Are the source data, statistical techniques, etc. adequate to portray the reality to be captured?
4. Serviceability: How are met users'needs in terms of timeliness of the statistical products, their frequency, consistency,
Are effective data and metadata easily available to data users and is there assistance to users?
HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS: METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 45 Given the institutional setup of the European Statistical System,
2. Accuracy refers to the closeness of computations or estimates to the exact or true values;
Punctuality refers to the time lag between the target delivery date and the actual date of the release of the data;
availability of micro or macro data, media (paper, CD-ROM, Internet, etc. Clarity refers to the statistics'information environment:
6. Coherence refers to the adequacy of the data to be combined reliably in different ways and for various uses.
and (ii) the quality of internal processes for collection, processing, analysis and dissemination of data and metadata.
i) the quality of basic data, and (ii) the quality of procedures used to build
the application of the most advanced approaches to the development of composite indicators based on inaccurate or incoherent data would not produce good quality results,
In the following section each is considered separately 2. 2. Quality dimensions for basic data The selection of basic data should maximise the overall quality of the final result.
In particular, in selecting the data the following dimensions (drawing on the IMF, Eurostat and OECD) are to be considered:
Relevance The relevance of data is a qualitative assessment of the value contributed by these data.
It depends upon both the coverage of the required topics and the use of appropriate concepts.
Careful evaluation and selection of basic data have to be carried out to ensure that the right range of domains is covered in a balanced way.
Given the actual availability of data proxy series are used often, but in this case some evidence of their relationships with target series should be produced whenever possible.
Accuracy The accuracy of basic data is the degree to which they correctly estimate or describe the quantities
Accuracy refers to the closeness between the values provided and the (unknown) true values. Accuracy has many attributes,
and in practical terms it has no single aggregate or overall measure. Of necessity, these attributes are measured typically
In the case of sample survey-based estimates, the major sources of error include coverage, sampling, non-response,
and censuses that provide source data; from the fact that source data do not fully meet the requirements of the accounts in terms of coverage, timing,
and valuation and that the techniques used to compensate can only partially succeed; from seasonal adjustment;
An aspect of accuracy is the closeness of the initially released value (s) to the subsequent value (s) of estimates.
which include (i) replacement of preliminary source data with later data,(ii) replacement of judgemental projections with source data,
however, the absence of revisions does not necessarily mean that the data are accurate. HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
accuracy of basic data is extremely important. Here the issue of credibility of the source becomes crucial.
The credibility of data products refers to confidence that users place in those products based simply on their image of the data producer, i e.,
One important aspect is trust in the objectivity of the data. This implies that the data are perceived to be produced professionally in accordance with appropriate statistical standards
and policies and that practices are transparent (for example, data are manipulated not, nor their release timed in response to political pressure).
Other things being produced equal, data by official sources (e g. national statistical offices or other public bodies working under national statistical regulations
or codes of conduct) should be preferred to other sources. Timeliness The timeliness of data products reflects the length of time between their availability
and the event or phenomenon they describe, but considered in the context of the time period that permits the information to be of value
The concept applies equally to short-term or structural data; the only difference is the timeframe.
Closely related to the dimension of timeliness, the punctuality of data products is also very important, both for national and international data providers.
and reflects the degree to which data are released in accordance with it. In the context of composite indicators, timeliness is especially important to minimise the need for the estimation of missing data or for revisions of previously published data.
As individual basic data sources establish their optimal trade-off between accuracy and timeliness, taking into account institutional, organisational and resource constraints
data covering different domains are released often at different points of time. Therefore special attention must be paid to the overall coherence of the vintages of data used to build composite indicators (see also coherence.
Accessibility The accessibility of data products reflects how readily the data can be located and accessed from original sources,
i e. the conditions in which users can access statistics (such as distribution channels, pricing policy, copyright, etc.).
The range of different users leads to considerations such as multiple dissemination formats and selective presentation of metadata.
which the data are available, the media of dissemination, and the availability of metadata and user support services.
It also includes the affordability of the data to users in relation to its value to them
and whether the user has a reasonable opportunity to know that the data are available
In the context of composite indicators, accessibility of basic data can affect the overall cost of production and updating of the indicator over time.
if poor accessibility of basic data makes it difficult for third parties to replicate the results of the composite indicators.
In this respect, given improvements in electronic access to databases released by various sources, the issue of coherence across data sets can become relevant.
Therefore, the selection of the source should not always give preference to the most accessible source,
Interpretability The interpretability of data products reflects the ease with which the user may understand
and analyse the data. The adequacy of the definitions of concepts, target populations, variables and terminology underlying the data
and of the information describing the limitations of the data, if any, largely determines the degree of interpretability.
The range of different users leads to considerations such as the presentation of metadata in layers of increasing detail.
the wide range of data used to build them and the difficulties due to the aggregation procedure require the full interpretability of basic data.
The availability of definitions and classifications used to produce basic data is essential to assess the comparability of data over time
and across countries (see coherence): for example, series breaks need to be assessed when composite indicators are built to compare performances over time.
Therefore the availability of adequate metadata is an important element in the assessment of the overall quality of basic data.
Coherence The coherence of data products reflects the degree to which they are connected logically and mutually consistent,
i e. the adequacy of the data to be combined reliably in different ways and for various uses.
Coherence implies that the same term should not be used without explanation for different concepts or data items;
that different terms should not be used for the same concept or data item without explanation;
and that variations in methodology that might affect data values should not be made without explanation.
Coherence in its loosest sense implies the data are at least reconcilable. For example, if two data series purporting to cover the same phenomena differ, the differences in time of recording, valuation,
and coverage should be identified so that the series can be reconciled. In the context of composite indicators, two aspects of coherence are especially important:
coherence over time and across countries. Coherence over time implies that the data are based on common concepts, definitions and methodology over time,
or that any differences are explained and can be allowed for. Incoherence over time refers to breaks in a series resulting from changes in concepts, definitions, or methodology.
Coherence across countries implies that from country to country the data are based on common concepts, definitions, classifications and methodology,
the imputation of missing data, as well as the normalisation and the aggregation, can affect its accuracy, etc.
In the following matrix, the most important links between each phase of the building process and quality dimensions are identified,
The imputation of missing data affects the accuracy of the composite indicator and its credibility.
The normalisation phase is crucial both for the accuracy and the coherence of final results.
The quality of basic data chosen to build the composite indicator strongly affects its accuracy and credibility.
Timeliness can also be influenced greatly by the choice of appropriate data. The use of multivariate analysis to identify the data structure can increase both the accuracy and the interpretability of final results.
This step is also very important to identify redundancies among selected phenomena and to evaluate possible gaps in basic data.
HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS: METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 49 One of the key issues in the construction of composite indicators is the choice of the weighting and aggregation model.
Almost all quality dimensions are affected by this choice, especially accuracy, coherence and interpretability. This is also one of the most criticised characteristics of composite indicators:
Analysis of this type can improve the accuracy, credibility and interpretability of the final results.
which are correlated highly with the reference data. The presentation of composite indicators and their visualisation affects both relevance and interpretability of the results.
The OECD has developed recently the Data and Metadata Reporting and Presentation Handbook (OECD, 2007), which describes practices useful to improve the dissemination of statistical products.
Table 5. Quality dimensions of composite indicators CONSTRUCTION QUALITY DIMENSIONS PHASE Relevance Accuracy Credibility Timeliness Accessibility Interpretability Coherence Theoretical framework Data
selection Imputation of missing data Multivariate analysis Normalisation Weighting and aggregation Back to the data Robustness and sensitivity Links to other variables Visualisation Dissemination HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
The problem of missing data is discussed first. The need for multivariate analysis prior to the aggregation of the individual indicators is stressed.
as well as the need to test the robustness of the composite indicator using uncertainty and sensitivity analysis.
METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 55 STEP 3. IMPUTATION OF MISSING DATA The literature on the analysis of missing data
The predictive distribution must be generated by employing the observed data either through implicit or explicit modelling:
The danger of this type of modelling of missing data is the tendency to consider the resulting data set as complete,
Filling in blanks cells with individual data, drawn from similar responding units. For example, missing values for individual income may be replaced with the income of another respondent with similar characteristics, e g. age, sex, race, place of residence, family relationships, job, etc.
and the time to convergence depends on the proportion of missing data and the flatness of the likelihood function. 56 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
or the robustness of the composite index derived from the imputed data set. 3. 2. Unconditional mean imputation Let Xq be associated the random variable with the individual indicator q, with q=1,,
Hence, the inference based on the entire data set, including the imputed data, does not fully count for imputation uncertainty.
For nominal variables, frequency statistics such as the mode or hot-and cold-deck imputation methods might be more appropriate. 3. 4. Expected maximisation imputation Suppose that X denotes the matrix of data.
In the likelihood-based estimation, the data is assumed to be generated by a model, described by a probability or density function f (X),
and describes the probability of observing a data set for a given. Since is unknown,
Assuming that missing data are MAR or MCAR14, the EM consists of two components: the expectation (E) and maximisation (M) steps.
as if there were no missing data, and second (E), the expected values of the missing variables are calculated,
Effectively, this process maximises the expectation of the complete data log-likelihood in each cycle, conditional on the observed data and parameter vector.
however, an initial estimate of the missing data is needed. This is obtained by running the first M-step on the non-missing observations only
It can be used for a broad range of problems, e g. variance component estimation or factor analysis.
To test this, different initial starting values for can be used. 3. 5. Multiple imputation Multiple imputation (MI) is a general approach that does not require a specification of parameterised likelihood for all data (Figure 10.
The imputation of missing data is performed with a random process that reflects uncertainty. Imputation is done N times,
The parameters of interest are estimated on each data set, together with their standard errors. Average (mean or median) estimates are combined using the N sets
Logic of multiple imputation Data set with missing values Set 1 Set 2 Set N Result 1 Result 2 Result N Combine
It assumes that data are drawn from a multivariate normal distribution and requires MAR or MCAR assumptions The theory of MCMC is understood most easily using Bayesian methodology (Figure 11).
The observed data are denoted Xobs, and the complete data set, X=(Xobs, Xmis), where Xmis is to be filled in via multiple imputation.
If the distribution of Xmis, with parameter vector, were known, then Xmis could be imputed by drawing from the conditional distribution f (Xmis Xobs,).
it shall be estimated from the data, yielding, and using the distribution f (Xmis Xobs). Since is itself a random variable,
The missing-data generating process may also depend on additional parameters, but if and are independent,
and the analyst may concentrate on modelling the missing data given the observed data and. If the two processes are not independent,
then a non-ignorable missing-data generating process pertains, which cannot be solved adequately without making assumptions on the functional form of the interdependency. 60 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
of which depends on the data. The first step in its estimation is to obtain the posterior distribution of from the data.
Usually this posterior is approximated by a normal distribution. After formulating the posterior distribution of, the following imputation algorithm can be used:
*Use the completed data X and the model to estimate the parameter of interest (e g. the mean)* and its variance V(*)within-imputation variance).
and covariance matrix from the data that does not have missing values. Use to estimate Prior distribution.
simulate values for missing data items by randomly selecting a value from the available distribution of values Posterior step:
i e. mean vector and cov. matrix are unchanged as we iterate) Use imputation from final iteration to form a data set without missing values need more iterations enough iterations HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 61 This combination will be the value that fills in the blank space in the data set.
The Multiple Imputation method imputes several values (N) for each missing value (from the predictive distribution of the missing data),
The N versions of completed data sets are analysed by standard complete data methods and the results combined using simple rules to yield single combined estimates (e g.
which formally incorporate missing data uncertainty. The pooling of the results of the analyses performed on the multiple imputed data sets implies that the resulting point estimates are averaged over the N completed sample points,
and the resulting standard errors and pvalues are adjusted according to the variance of the corresponding N completed sample point estimates.
Thus, the between-imputation variance provides a measure of the extra inferential uncertainty due to missing data
which method he/she has to use to fill in empty data spaces. To the best of our knowledge there is no definitive answer to this question but a number of rules of thumb (and lot of common sense.
The choice principally depends on the dataset available (e g. data expressed on a continuous scale or ordinal data where methods like MCMC cannot be used),
the number of missing data as compared to the dimension of the dataset (few missing data in a large dataset do not probably require sophisticated imputation methods),
and the identity of the country and the indicator for which the data is missing.
Therefore there is not"a"method we advise to use but the method should be fitted to the characteristics of the missing information.
eliminate some of the data (for the same countries and in the same proportion of the complete dataset),
and O (P) the average of the observed (imputed) data,) (O P the standard deviation of the observed (imputed) data.
As noticed by Willmott et al. 1985) the value of R2 could be unrelated to the sizes of the difference between the predicted and the observed values.
===N i i i N i i i P O N MAE P O N RMSE 1 1/2 1 2 1 1()Finally a complementary measure of accuracy
) The majority of methods in this section are thought for data expressed in an interval or ratio scale,
although some of the methods have been used with ordinal data (for example principal components and factor analysis, see Vermunt
& Magidson 2005). 4. 1. Principal components analysis The objective is to explain the variance of the observed data through a few linear combinations of the original data. 15
Even though there are Q variables, 1 2 Q x, x,..x, much of the data's variation can often be accounted for by a small number of variables principal components,
or linear relations of the original data, 1 2 Q Z, Z,..Z, that are uncorrelated.
At this point there are still Q principal components, i e.,, as many as there are variables. The next step is to select the first, e g.,
, P<Q principal components that preserve a high amount of the cumulative variance of the original data.
It indicates that the principal components are measuring different statistical dimensions in the data. When the objective of the analysis is to present a huge data set using a few variables,
some degree of economy can be achieved by applying Principal Components Analysis (PCA) if the variation in the Qoriginal x variables can be accounted for by a small number of Z variables.
with a variance of 1. 7. The third and fourth principal components have an eigenvalue close to 1. The last four principal components explain the remaining 12.8%of the variance in the data set.
Bootstrap refers to the process of randomly resampling the original data set to generate new data sets.
Although social scientists may be attracted to factor analysis as a way of exploring data whose structure is unknown,
Assumption of interval data. Kim & Mueller (1978) note that ordinal data may be used if it is thought that the assignment of ordinal categories to the data will not seriously distort the underlying metric scaling.
Likewise, the use of dichotomous data is allowed if the underlying metric correlation between the variables is thought to be moderate(.
7) or lower. The result of using ordinal data is that the factors may be much harder to interpret.
Note that categorical variables with similar splits will necessarily tend to correlate with each other, regardless of their content (see Gorsuch, 1983).
This is particularly apt to occur when dichotomies are used. The correlation will reflect similarity of"difficulty"for items in a testing context;
Principal component factor analysis (PFA), which is the most common variant of FA, is a linear procedure.
The smaller the sample size, the more important it is to screen data for linearity.
METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 67 Multivariate normality of data is required for related significance tests.
Note, however, that a variant of factor analysis, maximum likelihood factor analysis, does assume multivariate normality. The smaller the sample size, the more important it is to screen data for normality.
Moreover, as factor analysis is based on correlation (or sometimes covariance), both correlation and covariance will be attenuated when variables come from different underlying distributions (e g.,
, a normal vs. a bimodal variable will correlate less than 1. 0 even when both series are ordered perfectly co).
Underlying dimensions shared by clusters of individual indicators are assumed. If this assumption is met not, the"garbage in,
Factor analysis cannot create valid dimensions (factors) if none exist in the input data. In such cases, factors generated by the factor analysis algorithm will not be comprehensible.
Likewise, the inclusion of multiple definitionally-similar individual indicators representing essentially the same data will lead to tautological results.
Strong intercorrelations are required not mathematically, but applying factor analysis to a correlation matrix with only low intercorrelations will require nearly as many factors as there are original variables,
thereby defeating the data reduction purposes of factor analysis. On the other hand, too high intercorrelations may indicate a multicollinearity problem
and collinear terms should be combined or otherwise eliminated prior to factor analysis. Notice also that PCA and Factor analysis (as well as Cronbach's alpha) assume uncorrelated measurement errors.
a) The Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy is a statistic for comparing the magnitudes of the observed correlation coefficients to the magnitudes of the partial correlation coefficients.
The concept is that the partial correlations should not be very large if distinct factors are expected to emerge from factor analysis (Hutcheson & Sofroniou, 1999).
A KMO statistic is computed for each individual indicator, and their sum is the KMO overall statistic.
KMO varies from 0 to 1. 0. A KMO overall should be. 60 or higher to proceed with factor analysis (Kaiser & Rice, 1974),
though realistically it should exceed 0. 80 if the results of the principal components analysis are to be reliable.
but common cut off criterion for suggesting that there is a multi-collinearity problem. Some researchers use the more lenient cut off VIF value of 5. 0. c) The Bartlett's test of sphericity is used to test the null hypothesis that the individual indicators in a correlation matrix are uncorrelated,
however, is whether the TAI data set for the 23 countries can be viewed as a random sample of the entire population, as required by the bootstrap procedures (Efron, 1987;
Several points can be made regarding the issues of randomness and representativeness of the data. First, it is often difficult to obtain complete information for a data set in the social sciences,
as controlled experiments are not always possible, unlike in natural sciences. As Efron and Tibshirani (1993) state that,in practice the selection process is seldom this neat,
A third point on data quality is that a certain amount of measurement error is likely to pertain.
While such measurement error can only be controlled at the data collection stage, rather than at the analytical stage
it is argued that the data represent the best estimates currently available (UN, 2001. Figure 12 (right graph) demonstrates graphically the relationship between the eigenvalues from the deterministic PCA,
and of how the interpretation of the components might be improved are addressed in the following section on Factor analysis. 4. 2. Factor analysis Factor analysis (FA) is similar to PCA.
However, while PCA is based simply on linear data combinations, FA is based on a rather special model.
Contrary to the PCA, the FA model assumes that the data is based on the underlying factors of the model,
and that the data variance can be decomposed into that accounted for by common and unique factors.
Principal components factor analysis is preferred most in the development of composite indicators, e g. in the Product Market Regulation Index (Nicoletti et al.
and are sorted not into descending order according to how much of the original data set's variance is explained.
which indicates that university does not move with the other individual indicators in the data set,
This conclusion does not depend on the factor analysis method, as it has been confirmed by different methods (centroid method, principal axis method).
it is unlikely that they share common factors. 2. Identify the number of factors necessary to represent the data
The c-alpha is. 70 for the data set of the 23 countries which is equal to Nunnally's cut off value.
Note also that the factor analysis in the previous section had indicated university as the individual indicator that shared the least amount of common variance with the other individual indicators.
Although both factor analysis and the Cronbach coefficient alpha are based on correlations among individual indicators, their conceptual framework is different.
Cronbach coefficient alpha results for the 23 countries after deleting one individual indicator (standardised values) at a time. 4. 4. Cluster analysis Cluster analysis (CLA) is a collection of algorithms to classify objects such as countries, species,
The classification aims to reduce the dimensionality of a data set by exploiting the similarities/dissimilarities between cases.
if the classification has an increasing number of nested classes, e g. tree clustering; or nonhierarchical when the number of clusters is decided ex ante,
e g. k-means clustering. However, care should be taken that classes are meaningful and not arbitrary or artificial.
including Euclidean and non-Euclidean distances. 18 The next step is to choose the clustering algorithm,
and hence different classifications may be obtained for the same data, even using the same distance measure.
if the data are categorical in nature. 76 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS: METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 Figure 13 shows the country clusters based on the individual Technology Achievement Index
Standardised Data type: Hierarchical, single linkage, squared Euclidean distances. Sudden jumps in the level of similarity (abscissa) could indicate that dissimilar groups
which indicates that the data are represented best by ten clusters: Finland; Sweden and the USA;
METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 77 A nonhierarchical method of clustering,
is k-means clustering (Hartigan, 1975). This method is useful when the aim is to divide the sample into k clusters of the greatest possible distinction.
k-means clustering (standardised data. Table 13. K-means for clustering TAI countries Group1 (leaders) Group 2 (potential leaders) Group 3 (dynamic adopters) Finland Netherlands Sweden USA Australia
Canada New zealand Norway Austria Belgium Czech Rep. France Germany Hungary Ireland Israel Italy Japan Korea Singapore Slovenia Spain UK Finally, expectation
EM estimates mean and standard deviation of each cluster so as to maximise the overall likelihood of the data
Second, unlike k-means, EM can be applied to both continuous and categorical data. HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
Various alternative methods combining cluster analysis and the search for a low-dimensional representation have been proposed and focus on multidimensional scaling or unfolding analysis (e g.
A method that combines k-means cluster analysis with aspects of Factor analysis and PCA is offered by Vichi & Kiers (2001.
A discrete clustering model and a continuous factorial model are fitted simultaneously to two-way data with the aim of identifying the best partition of the objects, described by the best orthogonal linear combinations of the variables (factors) according to the least-squares criterion.
This methodology, known as factorial k-means analysis, has a wide range of applications, since it achieves a double objective:
data reduction and synthesis simultaneously in the direction of objects and variables. Originally applied to short-term macroeconomic data,
factorial k-means analysis has a fast alternating least-squares algorithm that extends its application to large data sets, i e.,
, multivariate data sets with>2 variables. The methodology can therefore be recommended as an alternative to the widely used tandem analysis that sequentially performs PCA
and CLA. 4. 5. Other methods for multivariate analysis Other methods can be used for multivariate analysis of the data set.
The characteristics of some of these methods are sketched below citing textbooks where the reader may find additional information and references.
(or relevant relationships between rows and columns of the table) by reducing the dimensionality of the data set.
unlike factor analysis. This technique finds scores for the rows and columns on a small number of dimensions which account for the greatest proportion of the 2 for association between the rows and columns,
Correspondence analysis starts with tabular data, e g. a multidimensional time series describing the variable number of doctorates in 12 scientific disciplines (categories) given in the USA between 1960 and 1975 (Greenacre
The correspondence analysis of this data would show, for example, whether anthropology and engineering degrees are at a distance from each other (based on the 80 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
However, while conventional factor analysis determines which variables cluster together (parametric approach), correspondence analysis determines which category values are close together (nonparametric approach).
As in PCA, CCA implies the extraction of the eigenvalues and eigenvectors of the data matrix.
When the dependent variable has more than two categories then it is a case of multiple Discriminant analysis (or also Discriminant Factor analysis or Canonical Discriminant analysis), e g. to discriminate countries on the basis of employment patterns in nine industries (predictors.
This is the main difference to Cluster analysis, in which groups are predetermined not. There are also conceptual similarities with Principal Components and Factor analysis
but while PCA maximises the variance in all the variables HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS: METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 81 accounted for by a factor, DFA maximises the differences between values of the dependent.
and their robustness against possible outliers in the data (Ebert & Welsch, 2004). Different normalisation methods will produce different results for the composite indicator.
Using Celsius data normalised based on distance to the best performer, the level of Country A has increased over time.
40 Country B Humidity(%)50 45 Normalised data in Celsius Country A 0. 949 0. 949 Country B 0. 833 0. 821
Normalised data in Fahrenheit Country A 0. 965 0. 965 Country B 0. 833 0. 821 The example illustrated above is a case of an interval scale (Box 3
Another transformation of the data, often used to reduce the skewness of (positive) data, is the logarithmic transformation f:
bearing in mind that the normalised data will be affected by the log transformation. In some circumstances outliers21 can reflect the presence of unwanted information.
when data for a new time point become available. This implies an adjustment of the analysis period T,
I. To maintain comparability between the existing and the new data, the composite indicator for the existing data must be recalculated. 5. 4. Distance to a reference This method takes the ratios of the indicator tqc x for a generic country c and time t with respect to the individual
indicator t0 qc c x=for the reference country at the initial time 0 t. 0tt qc qc tqc c x
Examples of normalisation techniques using TAI data Mean years of school age 15 and above) Rank*z-score Min-Max distance to reference country (c) Above/below the mean(**Percentile
0. 00-1 4. 3 0(*)High value=Top in the list(**p=20%Examples of the above normalisation methods are shown in Table 15 using the TAI data.
The data are sensitive to the choice of the transformation and this might cause problems in terms of loss of the interval level of the information,
thus poor data availability may hamper its use. In formal terms, let qi x be as usual the level of indicator q for country
AND AGGREGATION WEIGHTING METHODS 6. 1. Weights based on principal components analysis or factor analysis Principal components analysis,
and more specifically factor analysis, groups together individual indicators which are collinear to form a composite indicator that captures as much as possible of the information common to individual indicators.
the composite no longer depends upon the dimensionality of the data set but rather is based on the statistical dimensions of the data.
According to PCA/FA, weighting intervenes only to correct for overlapping information between two or more correlated indicators and is not a measure of the theoretical importance of the associated indicator.
The first step in FA is to check the correlation structure of the data, as explained in the section on multivariate analysis.
The second step is the identification of a certain number of latent factors (fewer than the number of individual indicators) representing the data.
For a factor analysis only a subset of principal components is retained (m i e. those that account for the largest amount of the variance.
%With the reduced data set in TAI (23 countries) the factors with eigenvalues close to unity are the first four,
Eigenvalues of TAI data set Eigenvalue Variance(%)Cumulative variance(%)1 3. 3 41.9 41.9 2 1. 7 21.8 63.7 3 1
Rotation is a standard step in factor analysis it changes the factor loadings and hence the interpretation of the factors,
With the TAI data set there are four intermediate composites (Table 17. The first includes Internet (with a weight of 0. 24),
The four intermediate composites are aggregated by assigning a weight to each one of them equal to the proportion of the explained variance in the data set:
. 08 Electricity 0. 11 0. 12 Schooling 0. 19 0. 14 University 0. 02 0. 16 6. 2. Data envelopment
analysis (DEA) Data Envelopment Analysis (DEA) employs linear programming tools to estimate an efficiency frontier that would be used as a benchmark to measure the relative performance of countries. 26 This requires construction of a benchmark (the frontier) and the measurement
Data envelopment analysis (DEA) performance frontier Indicator 1 Indicator 2 a b c d d'0 Source: Rearranged from Mahlberg
The observed data consist in a cluster of q=1 Q (c) indicators, each measuring an aspect of ph (c). Let c=1,
However, since not all countries have data on all individual indicators, the denominator of w c,
and 2 s q (hence at least 3 indicators per country are needed for an exactly identified model) so the likelihood function of the observed data based on equation (25) will be maximised with respect to (q) s,(q) s,
AHP allows for the application of data, experience, insight, and intuition in a logical and thorough way within a hierarchy as a whole.
Although this methodology uses statistical analysis to treat the data, it relies on the opinion of people (e g. experts, politicians, citizens),
when dealing with environmental issues. 6. 9. Performance of the different weighting methods The weights for the TAI example are calculated using different weighting methods equal weighting (EW), factor analysis (FA), budget allocation (BAP
The role of the variability in the weights and their influence on the value of the composite are discussed in the section on sensitivity analysis. 100 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
TAI weights based on different methods Equal weighting (EW), factor analysis (FA), budget allocation (BAP), analytic hierarchy process (AHP) Method Weights for the indicators (fixed for all countries) Patents Royalties Internet Tech exports Telephones Electricity Schooling University EW 0
Reliability and robustness of results depend on the availability of sufficient data. With highly correlated individual indicators there could be identification problems.
) Can be used both for qualitative and quantitative data. Transparency of the composite is higher. Weighting is based on expert opinion and not on technical manipulations.
Let us then apply Borda's rule to the data presented in Table 25. To begin, the information can be presented in a frequency matrix fashion,
let us apply it to the data presented in Table 25. The outranking matrix is shown in Table 27;
For explanatory purposes, consider only five of the countries included in the TAI data set39
only if data are expressed all on a partially comparable interval scale (i e. temperature in Celsius or Fahrenheit) of type:+>
Non-comparable data measured on a ratio scale (i e. kilograms and pounds: where>0 i i x x f i e. i varying across individual indicators) can only be aggregated meaningfully by using geometric functions,
technique for the TAI data set with 23 countries. Although in all cases equal weighting is used,
AND SENSITIVITY ANALYSIS Sensitivity analysis is considered a necessary requirement in econometric practice (Kennedy, 2003) and has been defined as the modeller's equivalent of orthopaedists'X-rays.
This is what sensitivity analysis does: it performs theX-rays'of the model by studying the relationship between information flowing in and out of the model.
More formally, sensitivity analysis is the study of how the variation in the output can be apportioned, qualitatively or quantitatively, to different sources of variation in the assumptions,
Sensitivity analysis is thus closely related to uncertainty analysis, which aims to quantify the overall uncertainty in country rankings as a result of the uncertainties in the model input.
A combination of uncertainty and sensitivity analysis can help to gauge the robustness of the composite indicator ranking
Below is described how to apply uncertainty and sensitivity analysis to composite indicators. Our synergistic use of uncertainty and sensitivity analysis has recently been applied for the robustness assessment of composite indicators (Saisana et al.
2005a; Saltelli et al. 2008) and has proven to be useful in dissipating some of the controversy surrounding composite indicators such as the Environmental Sustainability Index (Saisana et al.
and sensitivity analysis discussed below in relation to the TAI case study is only illustrative. In practice the setup of the analysis will depend upon which sources of uncertainty and
inclusion/exclusion of one indicator at a time, imputation of missing data, different normalisation methods, different weighting schemes and different aggregation schemes.
()c Rank CI is an output of the uncertainty/sensitivity analysis. The average shift in country rankings is explored also.
The investigation of()c Rank CI and S R is the scope of the uncertainty and sensitivity analysis. 41 7. 1. General framework The analysis is conducted as a single Monte carlo experiment,
1 X Estimation of missing data 1 Use bivariate correlation to impute missing data 2 Assign zero to missing datum The second input factor,
A scatter plot based sensitivity analysis would be used to track which indicator affects the output the most
(either for the BAP or AHP schemes) are assigned to the data. Clearly the selection of the expert has no bearing
such as the variance and higher order moments, can be estimated with an arbitrary level of precision related to the size of the simulation N. 7. 3. Sensitivity analysis using variance-based techniques A necessary step
when designing a sensitivity analysis is to identify the output variables of interest. Ideally these should be relevant to the issue addressed by the model.
2008), with nonlinear models, robust, model-free techniques should be used for sensitivity analysis. Sensitivity analysis using variance-based techniques are model-free
and display additional properties convenient in the present analysis, such as the following: They allow an exploration of the whole range of variation of the input factors, instead of just sampling factors over a limited number of values, e g. in fractional factorial design (Box et al.
They allow for a sensitivity analysis whereby uncertain input factors are treated in groups instead of individually; They can be justified in terms of rigorous settings for sensitivity analysis.
To compute a variance-based sensitivity measure for a given input factor i X, start from the fractional contribution to the model output variance,
The I s, Ti S, in the case of nonindependent input factors, could also be interpreted as settings for sensitivity analysis. 124 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
i e. by censoring all countries with missing data. As a result, only 34 countries, in theory, may be analysed.
Hong kong, as this is the first country with missing data. The analysis is restricted to the set of countries
Figure 19 shows the sensitivity analysis based on the first-order indices. The total variance in each country's rank is presented
The sensitivity analysis results for the average shift in rank output variable (equation (38)) is shown in Table 40.
Practically, however, in the absence of a genuine theory on what causes what, the correlation structure of the data set can be of some help in at least excluding causal relationships between variables (but not necessarily between the theoretical constructs
However, a distinction should be made between spatial data (as in the case of TAI) and data
The case of spatial data is complicated more, but tools such as Path Analysis and Bayesian networks (the probabilistic version of path analysis) could be of some help in studying the many possible causal structures
whereas a low value would point to the absence of a linear relationship (at least as far the data analysed are concerned).
However, the resulting path coefficients or correlations only reflect the pattern of correlation found in the data.
exclude the two least influential indicators from the TAI data set. Even in this scenario, TAI6, the indicators are weighted equally.
In the 1970s factor analysis and latent variables enriched path analysis giving rise to the field of Structural Equation Modelling (SEM, see Kline, 1998.
Measurement techniques such as factor analysis and item response theory are used to relate latent variables to the observed indicators (the measurement model),
(i) determine that the pattern of covariances in the data is consistent with the model specified;(
of which path diagram is more likely to be supported by the data. Bayesian networks are graphical models encoding probabilistic relationships between variables.
when combined with the data via Bayes'theorem, produces a posterior distribution. This posterior, in short a weighted average of the prior density and of the conditional density (conditional on the data), is the output of 136 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 Bayesian analysis. Note that the output of classical analysis is rather a point estimate.
ii) to see how the different evidence (the data available) modifies the probability of each given node;(
the use of Bayesian networks is becoming increasingly common in bioinformatics, artificial intelligence and decision-support systems; 45 however, their theoretical complexity and the amount of computer power required to perform relatively simple graph searches make them difficult to implement in a convenient manner.
and the testing of the robustness of the composite using uncertainty and sensitivity analysis. The present work is perhaps timely,
When questioned on the data reliability/quality of the HDI, Haq said that it should be used to improve data quality,
rather than to abandon the exercise. In fact, all the debate on development and public policy and media attention would not have been possible
Similarly, Nicoletti and others make use of factor analysis in the analysis of, for example, product market regulation in OECD countries (Nicoletti et al.,
The media coverage of events such as the publishing of the World Economic Forum's World Competitiveness Index and Environmental Sustainability Index
METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 141 REFERENCES Anderberg M. R. 1973), Cluster analysis for Applications, New york:
Binder D. A. 1978), Bayesian Cluster analysis, Biometrika, 65:31-38. Borda J. C. de (1784), Mémoire sur les élections au scrutin, in Histoire de l'Académie Royale des Sciences, Paris. Boscarino
. and Yarnold P. R. 1995), Principal components analysis and exploratory and confirmatory factor analysis. In Grimm and Yarnold, Reading and understanding multivariate analysis.
In Sensitivity analysis (eds, Saltelli A.,Chan K.,Scott M.)167-197. New york: John Wiley & Sons.
Chantala K, Suchindran C.,(2003) Multiple Imputation for Missing Data. SAS Onlinedoctm: Version 8. Charnes A.,Cooper W. W.,Lewin A y,
. and Seiford L. M. 1995), Data Envelopment Analysis: Theory, Methodology and Applications. Boston: Kluwer. 142 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
Cherchye L. 2001), Using data envelopment analysis to assess macroeconomic policy performance, Applied Economics, 33: 407-416 Cherchye L. and Kuosmanen T. 2002), Benchmarking sustainable development:
Dempster A p. and Rubin D. B. 1983), Introduction (pp. 3-10), in Incomplete Data in Sample Surveys (vol. 2:
(2004b), Composite Indicator on e-business readiness, DG JRC, Brussels. Everitt B. S. 1979), Unresolved Problems in Cluster analysis, Biometrics, 35: 169-181.
Funtowicz S. O.,Munda G.,Paruccini M. 1990), The aggregation of environmental data using multicriteria methods, Environmetrics, Vol. 1 (4): 353-36.
in the 20th century, Journal of Computational and Applied mathematics, Vol. 123 (1-2). Gorsuch R. L. 1983), Factor analysis.
Haq M. 1995), Reflections on Human Development, Oxford university Press, New york. Hartigan J. A. 1975), Clustering Algorithms, New york:
John Wiley & Sons, Inc. Hatcher L. 1994), A step-by-step approach to using the SAS system for factor analysis and structural equation modeling.
Heiser, W. J. 1993), Clustering in low-dimensional space. In: Opitz, O.,Lausen, B. and Klar, R.,Editors, 1993.
Homma T. and Saltelli A. 1996), Importance measures in global sensitivity analysis of model output, Reliability Engineering and System Safety, 52 (1), 1-17.
Kim, J.,Mueller, C. W. 1978), Factor analysis: Statistical methods and practical issues, Sage Publications, Beverly hills, California, pp. 88.
Covers confirmatory factor analysis using SEM techniques. See esp. Ch. 7. Knapp, T. R.,Swoyer, V. H. 1967), Some empirical results concerning the power of Bartlett's test of the significance of a correlation matrix.
Factor analysis as a statistical method, London: Butterworth and Co. 146 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 Little R. J. A. and Schenker N. 1994), Missing Data, in Arminger
Little R. J. A (1997), Biostatistical Analysis with Missing Data, in Armitage P. and Colton T. eds.
Little R. J. A. and Rubin D. B. 2002), Statistical analysis with Missing Data, Wiley Interscience, J. Wiley & Sons, Hoboken, New jersey.
Mahlberg B. and Obersteiner M. 2001), Remeasuring the HDI by data Envelopment analysis, Interim report IR-01-069, International Institute for Applied System Analysis, Laxenburg, Austria.
Massart D. L. and Kaufman L. 1983), The Interpretation of Analytical Chemical Data by the Use of Cluster analysis, New york:
Milligan G. W. and Cooper M. C. 1985), An Examination of Procedures for Determining the Number of Clusters in a Data Set, Psychometrika, 50: 159-179.
/60/34002216. pdf. OECD (2007), Data and Metadata Reporting and Presentation Handbook, available at http://www. oecd. org/dataoecd/46/17/37671574. pdf). Parker
Saisana M.,Nardo M. and Saltelli A. 2005b), Uncertainty and Sensitivity analysis of the 2005 Environmental Sustainability Index, in Esty D.,Levy M.,Srebotnjak T. and de Sherbinin
. and Tarantola S. 2008), Global Sensitivity analysis. The Primer, John Wiley & Sons. Saltelli A. 2007) Composite indicators between analysis and advocacy, Social Indicators Research, 81:65-77.
Saltelli A.,Tarantola S.,Campolongo F. and Ratto M. 2004), Sensitivity analysis in practice, a guide to assessing scientific models, New york:
Software for sensitivity analysis is available at http://www. jrc. ec. europa. eu/uasa/prj-sa-soft. asp.
11-30 Sobol'I. M. 1993), Sensitivity analysis for nonlinear mathematical models, Mathematical Modelling & Computational Experiment 1: 407-414.
Spath H. 1980), Cluster analysis Algorithms, Chichester, England: Ellis Horwood. Storrie D. and Bjurek H. 1999), Benchmarking European labour market performance with efficiency frontier technique, Discussion Paper FS I 00-2011.
Tarantola S.,Jesinghaus J. and Puolamaa M. 2000), Global sensitivity analysis: a quality assurance tool in environmental policy modelling.
Sensitivity analysis, pp. 385-397. New york: John Wiley & Sons. Tarantola S.,Saisana M.,Saltelli A.,Schmiedel F. and Leapman N. 2002), Statistical techniques and participatory approaches for the composition of the European Internal Market Index 1992
METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 Vermunt J. K. and Magidson J. 2005), Factor analysis with categorical indicators:
) Available at http://spitswww. uvt. nl/vermunt/vanderark2004. pdf Vichi M. and Kiers H. 2001), Factorial k-means analysis for two-way data, Computational Statistics
Decisions and evaluations by hierarchical aggregation of information, Fuzzy sets and Systems, 10: 243-260 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
Economists have long been hostile to subjective data. Caution is prudent, but hostility is warranted not.
Another common method (called imputing means within adjustment cells) is to classify the data for the individual indicator with some missing values in classes
which, for complex patterns of incomplete data, can be complicated a very function of. As a result these algorithms often require algebraic manipulations and complex programming.
but careful computation is needed. 14 For NMAR mechanisms one needs to make assumptions on the missing-data mechanism
as in most American cities it is not possible to go directly between two points. 20 The Bartlett test is valid under the assumption that data are a random sample from a multivariate normal distribution.
z Y C, see the note above. 35 Compensability of aggregations is studied widely in fuzzy set theory, for example Zimmermann & Zysno (1983) use the geometric operator
in the multiplicative aggregation it is proportional to the relative score of the indicator with respect to the others. 39 Data are normalized not.
whenever it does not change the ordinal information of the data matrix. 158 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
Overtext Web Module V3.0 Alpha
Copyright Semantic-Knowledge, 1994-2011