and the upper quartile covers the lowest 75 per cent of the data. The difference between the upper and lower quartiles is called the interquartile range.
It represents 50 per cent of the data. Radar Chart A radar chart is a graphical method of displaying multivariate data in the form of a two-dimensional chart of quantitative variables represented on axes starting from the same point.
In the following example the data of the benchmarked cluster is indicated by a green line
and compared to the data of the clusters in its specific technology area (orange line) and all technology areas (blue line).
Box 1: Explanation of figures used to present the results of the benchmarking Box 1:
The overview includes data on The age of cluster management organizations, The size of clusters, The composition of clusters,
Since 2006 the annual performance of the clusters that are supported through the program is measured through quantitative data, e g. indicators on number of new services or products
and other types of evaluations as those databases do not contain all data in detail that is usually required for the analysis or evaluation of a certain program.
restricting the scope of the analysis. Another limitation refers to the predominantly quantitative character of the data collected,
and data is not commonly available for non-technological innovation as a proportion of firm employment or turnover.
Chapter 2 provides data on SME innovation performance and constraints across 40 economies and examines the major and new policies that have been introduced.
Definitions Supporting Frameworks for Data Collection, OECD Statistics Working papers, 2008/1, OECD Publishing, Paris, doi: 10.1787/243164686763.1.
New Evidence from Micro Data, Ch. 1, pp. 15-82, University of chicago Press, Chicago. Baumol, W. 2002), The Free-Market Innovation Machine:
and advertising data. 6 And as we would expect of a dynamic, innovation value chain-based open innovation strategy as its ability to capture value at different points in the innovation value chain shifts,
OECD trade data does confirm that the past decade saw a dramatic increase in the scale of international trades in knowledge services;
US data seems to suggest that SMES seem to have been increasing their R&d spending in past decades;
*Data was compiled not for 2009 Ireland is not alone in experiencing this decline in entrepreneurial activity.
allowing for data availability and any necessary time-lags. The portfolio approach adopted proved valuable as it allowed the analysis to focus not just on individual programmes,
According to EVCA data, all Irish VC firms have invested circa 963 million in Irish firms since 200019.
IVCA data included investments by angel and investors and corporation that are considered not to be VC firms 23 PWC review 24 For example,
data required for evaluation purposes is not currently being collected or collated, and that this needs to be addressed.
and Collect and collate data required for programme evaluation, and in particular facilitate the delineation of activities/supports directed toward the stimulation of entrepreneurship and start-ups.
Appropriate metrics and approaches to data collection, collation and analysis should be identified at the outset relating to programme inputs, activities/processes, outputs and outcomes.
Using data from The Annual Business Survey of Economic Impact (ABSEI), comparator groups from the wider population of Enterprise Ireland supported firms have been constructed, controlling for age of firm, numbers employed, turnover and sector44:
For survival rates, CRO registration data for 2004-2006 provides a reference group for companies'trading status up to early 2011.
using the rich data collected annually from agency supported firms. A previous review of HPSU supports was carried out by Enterprise Ireland in 2010.
I grade of In apply taken in set at 60 rural an The data generati was 3. 98 domesti 51 Data f 52 Costs uses th NPV of attache Book
HPSU inancial retu he effects of Cost-Benefit n ABSEI data d where app e HPSU cohor t on the wid on Irish-sour payments fo
Ho tically overest or the deadwe 3 edition) rt from the indups, 2004-2 20 years has met dividual Source data nt companies ear period, w ls,
The G opment agenc 64. 9m ble data e 2005 data was 2006 as sourced he ly) were salary funds is weight is Us in HPSUS it r of s a rnative that the be Green
and Recipients'feedback on the importance of Enterprisestart in attaining HPSU status. A limitation in terms of data collected was identified during the course of the evaluation.
A limitation in terms of data collected was identified during the course of the evaluation. Entrepreneurs may be registered formally on the internal monitoring system,
and whether it is value for money, due in part to a lack of data,
Increase It is too Howeve participa data ind noted th Sales: 41 participa twelve m Exports:
in term e data on bot of sales, exp cant impact o incorporate dy experienc er 54 per ce enced some e ed to realise orts to increa perienced
Enterprise Ireland data and Forfás analysis Effectiveness Effectiveness covers the extent to which the outputs have led to the desired outcomes.
Data on the progression of participants from the pilot phase to become HPSU/Pre-HPSU is not available.
and the Ideagen events. 71 Data on the progression of participants from the pilot phase to become HPSU/Pre-HPSUS is not available 90 Any displacement effect of Ideagen is small.
The data on the Enterprise Ireland partner funds is provided in terms of the two separate Schemes as it is not generally possible
EU level data shows that seed capital funds typically experience greater challenges than VC funds in raising private capital
and over US$3. 1 trillion in revenue in the United states representing 11 per cent of private sector employment and 21 per cent of GDP (2010 data).
They provide data on the partner funds themselves and on the investments they make in terms of the size of individual investments in companies, the company's stage of development and
A note of caution relates to the challenges faced in providing comprehensive and comparative data for seed and VC funding.
The analysis below is based on data from the Enterprise Ireland Seed & Venture capital Programme Annual Reports and data provided from the European Venture capital Association (EVCA).
The EVCA compiles data provided to it by national VC associations. The irish Venture capital Association (IVCA) is the relevant body in Ireland.
Companies and entrepreneurs benefit from an expanded pool of funds available for export oriented high technology start-ups
and scaling companies Data from the EVCA shows that Irish VC firms have invested circa 963 million89 in Irish firms since 2000.
Private funds are attracted into The irish market Data from the IVCA states that there has been 3 billon of investment in Irish SMES since 200091.
The IVCA data is broad in scope and includes investments by angel investors and corporations that are considered not to be VC firms.
However, it should be noted that the IVCA data is quite broad in scope and includes investments by angel investors
They also have engaged in high levels of follow on funding. 93 Specific data on fund management
The OECD data shows that VC investment in Ireland still only accounts for a small proportion of GDP
Germany France UK Canada USA 2009 2008 2005 2000-2003 108 Data on the numbers of investments and the numbers of companies invested in by Irish Seed
Secondly, data from the IVCA indicates that, aside from the EI partner funds, further private VC investment has been attracted into Irish based SMES.
Secondly, data from the IVCA states that there has been 3 billon of venture, angel and related investment in Irish SMES since 200099.
the IVCA data is broad in scope and includes investments by angel investors and corporations that are considered not to be VC firms.
Data on this is not available due to commercial and confidentiality considerations102. Over the medium to longer term, there are also real and positive impacts associated with the programme in terms of employment,
Based on a range of data sources, it is estimated that each year on average a typical CEB: Handles some 800 to 1, 000 queries;
Using 2009 data this averages out as follows per individual CEB: Average per CEB '000 Current costs 383 Measure One Grants 269 Measure Two Grants 304 Total 956 The CEBS operate within national policy
Data for 2010 and 2011, where available has also been used. The methodology follows the template for entrepreneurship and start-up programmes, developed in the Forfás Evaluation Framework118.
The methodology included analysis of the data contained in the management information systems operated by the CEBS,
existing reports and data provided by the Central Co-ordination Unit in Enterprise Ireland, case studies of 7 CEBS119 including office visits and analysis of locally available data, a survey of former SYOB participants,
a client focus group, specific enquiries to CEBS, and an international literature review. The CCU has facilitated greatly this evaluation by providing aggregate data on the CEBS activities.
As a large part of the period under review predated the CCU established in 2007, there were considerable data challenges associated with this evaluation. 117 Report of the Enterprise Strategy Group, Ahead of the Curve, 2004 118 Framework for Evaluation of Enterprise Supports, 2011,
Forfás 119 The case studies included a representative sample of CEBS, taking into account location, urban/rural split and size.
noting that the data currently being collected is not done for the purposes of evaluation in that the data appropriate to monitoring the impact of the CEBS activities is not generally available.
together with an allocation of share of the current costs of the CEBS. 120 The data gathered by the CEBS
Analysis of CCU data The Central Co-ordination Unit collects data on the number of participants on training courses,
Based on the data from the Central Coordination Unit and the results of a complementary survey122, it has been estimated that 52 per cent of attendees at training courses run by CEBS were from new start ups.
Analysis of CCU data and CEB survey: includes indirect costs Over the period 2008-2010 the total expenditure by CEBS on start-up supports is estimated at between 18. 8m and 17m per annum.
the available data do not enable this to be encapsulated easily. There are two reasons for this:
Derived from CCU data Over the period 2004 to 2010 as a whole, the vast bulk of the grants made were in respect of capital or employment projects.
Derived from CCU data 134 Note: The data in Table 9. 11 above includes grants from both the European Regional Development Fund and the European Globalisation Fund.
The new CEB Financial Instruments of Feasibility/Innovation, Priming and Business Expansion came into being in November 2009
so that is the reason for there being no data entries before that time. There is evidence that in 2010 the average grant size fell
Derived from CCU Data Start Your Own Business Training A total of 18,899 individuals participated in SYOB soft support courses from 2005-2010 (data is unavailable for 2004.
Derived from CCU data The analysis conducted for this evaluation indicates that 80 per cent of financial supports are directed to start-up enterprises,
Precise data are not available to measure these attributes however we have looked to job creation estimates for all grant aided firms as an indicator of the scale of impacts.
Derived from CCU data These metrics indicate that the CEBS have become more efficient over more recent years (Table 9. 24.
The survey data indicates a high start-up rate but unfortunately, the response from course participants from former years is too low to enable longevity to be assessed.
Precise data are not available to measure these attributes. However, the analysis indicates that over the seven year period under review approximately 5, 400 start-up companies received financial supports.
Analysis of the CEB activities for policy-making purposes requires data which are not currently being collected
and Collect and collate data required for programme evaluation, and in particular facilitate the delineation of activities/supports directed toward the stimulation of entrepreneurship and start-ups.
To convert these minimal pressure differences into a convenient tool for recording weather data the metal cells were brought into contact with a liquid that reacts to these small differences accurately
Analysing empirical data for EU companies, the report 7 shows that innovative companies are more likely to export,
published innovation performance data from the Innovation Union Scoreboard (IUS) 11 and the Regional Innovation Scoreboard (RIS) 12 has been used to benchmark performance against the EU-27 average and the UK.
albeit with older RIS data and against fewer indicators, the business community in Ireland tended to be better across the following metrics:
There is a gap in the data for financial organisations regarding venture capital in Northern ireland although recent analysis of BVCA data suggests that just over 2%of UK-wide VC investment between 1989 and 2010 was in Northern ireland. 13 From EU-wide metrics Ireland in 2011 lags
significantly behind both the EU-27 average and the UK (see Figure 3). Innovation Ecosystem Actors Firms Financial services Orgs Higher education Institutes Innovation Support Agencies Business Services Orgs Intermediary Bodies Policy makers
and the 2009 RIS uses data from 2004 and 2006 for all EU27 regions. 13 Northern ireland Science Park,
that of moderate innovator. 14 In the absence of data to assess the performance of each category of innovation actor,
The data on human resources help form a view on firms'absorptive capacity. A significant majority of past innovators (68%)said they possessed ambition to grow.
While there is limited data to benchmark their contribution, Intertradeireland research from 2009, on the design services sector on the island, suggested that such services were utilised under both because of lack of design companies (the sector is approximately one third the relative size of the UK's) and also a lack of understanding of the application and benefits of such
While data exists on trade balances for virtually all nations, data on the extent of export discounts
and import restricting (especially through non-tariff barriers) are difficult to obtain. Despite this at a cursory level, it would appear that nations like Austria, Germany,
and monitor performance data to ensure that actions taken are showing positive results. Until now there has been designated noentrepreneurship policy unit'within the Government system that seeks to coordinate response and programmes across government for startups.
The data for this information does exist; however, it is not accessible because of the variety of systems in
which the data is kept--between Revenue, the Company Registrations Office, and other sources. Use of up-to-date performance data to monitor startup levels will be important.
The Forum proposes the establishment of aStartup Monitor'by the Department of Jobs, Enterprise and Innovation to facilitate performance monitoring.
The collation of appropriate metrics could be supported by relevant available data obtained from the Revenue Commissioners.
and clearly developed value propositions and where appropriate are oriented export in their thinking early in their development. 14 National Policy Statement on Entrepreneurship in Ireland CSO data indicates that in 2011 there were almost 190,
%25 34 (10%)and 45 54 (9%).It is lowest amongst those aged 18 24 (7. 6%)and 55 64 (4. 6%).This data suggests there is perhaps untapped potential amongst females, youth,
It is based on analysis of comprehensive data sets from more than 120 countries that marshal information about the 3as of development:
Altogether, the index construction integrates 31 variables, 16 from GEM and 15 from other data sources, into 14 pillars and three sub-indexes.
31.3 Ireland 19 61.8 Brazil 81 30.4 Puerto rico 20 61.7 Bangladesh 121 13.8 Using this entrepreneurship-related data to compare countries,
Using this data we will create an accurate picture of the entrepreneurship ecosystem in Ireland.
and data for Ireland compiled across international benchmarks. This analysis will reveal the particular conditions that are driving high
Data on take-up of the scheme is not yet available. National Policy Statement on Entrepreneurship in Ireland 27 2. 1. 3 Share Based Remuneration In private Companies Share based employee remuneration can significantly reduce fixed labour costs
according to CSO data for the period 2007 to 2012. This suggests there is a strong need to ensure that the framework conditions
evidence based case for the viability of their proposed solution (for example a desk based feasibility study with some supporting practical work/data).
The data for these indicators will collated by by DJEI in collaboration with the relevant agencies. The attached tables of indicators are not exhaustive
and data for Ireland compiled across international benchmarks. Many of the performance indicators listed below focus on output.
initiated CRO 1, 967 Survival Rates for enterprise at 5 years CSO 48.4%The performance indicators identified above are not exhaustive (for example the data on startups
The data values for each variable are gathered from a wide range of sources. Appendix 3:
Economists David Autor and Mark Dugan argue that the SSDI eligibility application process should focus on objective data with specific maladies for
The evidence in these data is that hours of work are, as found in much of the previous work,
accessed October 15, 2013), http://dx. doi. org/10.1787/strd-data-en. 18. Luke A. Stewart and Robert D. Atkinson, University Research Funding:
A Profile (HFCA, July 2000), 12, http://www. cms. gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Thechartseries/Downloads/35chartbk. pdf. 32.
accessed October 15, 2013), http://dx. doi. org/10.1787/lfs-data-en. 41. Social security administration, Disability insurance Benefit Payments (annual benefits paid from the DI Trust fund;
Author calculations using data from SSDI website. See: Social security administration, 2013 OASDI Trustees Report (Table V. A3.
Congressional Budget Office, Historical Budget Data Statistics (historical data on revenues, outlays, and the deficit or surplus for fiscal years 1973 to 2012;
Historic post-recession economic data suggest that private-sector R&d spending will also increase in 2014.
according to historic data from the National Science Foundation. Research intensity has been correlated with macroeconomic growth, and has been the foundation of U s. technological innovation.
Forecasting must also accommodate updates and corrections to historic economic data. For example, this 2014 Forecast incorporates a revision of the NSF's 2011 baseline research expenditures1.
as well as the most recent data (2011) from the NSF's Business R&d and Innovation Survey (BRDIS.
which comprehensive data are available. Several institutions exceeded $1 billion in research that year, including Johns hopkins university (including the Applied Physics laboratory), the University of michigan at Ann arbor, the University of washington at Seattle,
and certain regulatory requirements like meaningful use of electronic data will accelerate new markets and healthcare efficiency,
REFERENCES AND METHODOLOGY ENDNOTES RESOURCES 1 During 2013 the release of new data from NSF modified the historical data for industry
B Stakeholders Consulted 41 Appendix C Statistical Data 42 Table of Figures Figure 1 Economic Performance Indicators...
and the EU average. 2 The source of all data in this section is the Hungarian Central Statistical Office,
expensive research 5 The source of all data in this section is the Hungarian Central Statistical Office,
HU EU27 SMES innovation data-regional estimates 2010 Technological innovators(%of SMES) Non--technological innovators(%of SMES) Innova:
Although no data are available about the share of Structural Funds in GBAORD in the 2011-2013 period, the official statement that 95%(!
Opten Ltd. http://www. opten. hu/1758-2013-x-24-korm-hatarozat-j238613. html 17 Data for the number of beneficiaries and the volume of the allocated support
Publicly Regional Innovation Monitor Plus 33 available data about the applicant companies should be checked, including the number of their employees and the volume of applicants'net sales.
01.07.2014). 42 Regional Innovation Monitor Plus Appendix C Statistical Data HU22 Nyugat-Dunántúl Country EU27 Year Source Performance relative to Performance
%80,000 The latest CSO data, shows that 80,000 additional people are compared at work to Q1 2012
launch of an Open Data Portal for access to public sector databases; launch of an Energy efficiency Fund and other measures such as raising standards in the retrofitting of homes;
Benchmark metrics are supplemented with key data from the annual surveys of the Enterprise Agencies, such as employment, expenditure, sales,
with the CSO data indicating that some of the largest employment increases have been in the domestic economy 5 Framework for the Evaluation of Enterprise Supports (2011), Forfás 2015 ACTION PLAN FOR JOBS 21 areas.
across a range of areas including ICT, data analytics, international sales, engineering and entrepreneurship in initiatives such as the ICT Skills Action Plan, Springboard and Momentum.
and our recognised success in Big data and data analytics; Grid integration of renewables, with associatedsmart grid'components.
DJEI data indicates that currently 7 per cent of sales of indigenous firms is from new products and services,
including fit-for-purpose data infrastructure (see Section 3. 3 below). In 2014 Knowledge Transfer Ireland was launched
Building world class data management infrastructure. The overall ambition of the Disruptive Reform is to build on existing enterprise strengths to make Ireland a leading country in Europe in the area of Big data and Data Analytics.
A number of significant initiatives and investments were progressed in 2013 and 2014 in partnership with the enterprise sector
and in 2014 the Government launched the Open Data Portal to act as the primary source of public sector datasets.
2. Building on our research strengths consolidate Ireland's leadership position in Big data/Data Analytics within Horizon 2020
and 5. Develop an internationally competitive Data infrastructure. In light of the new policy actions required in 2015 the mission
and in order to achieve the benefits of data-driven innovation, policy must take into account the full data value cycle and the role of all stakeholders.
By focusing on developing a coherent ecosystem Ireland can bridge the gap between R&d innovation and adoption and take the lead in developing concrete solutions and applications.
An important element of the ecosystem for data-intensive companies operating in Ireland is the system of data protection
and focus of the Big Data Taskforce with the goal to oversee progress towards the strategic goals of the Disruptive Reform.
The RFT reflected the requirements of licensing authorities including managing licence application forms, registration of licensees, managing licence applications and renewals, remittance of licence fees, transmission and security of data
and examined on a monthly basis granular data from both the Bank of Ireland and the AIB.
It is important to recognise that in relation to commercial entities the type of data available varies in terms of both quality
and data and as such part of the work of the SME State Bodies group will involve engaging with commercial entities around the issue of improving the provision of data on SME lending.
and monitor data, including Central bank data, on lending to SMES from both bank and non-bank sources, including the full range of state sponsored initiatives and report on this issue to the Cabinet Committee on Economic Recovery and Jobs twice yearly.
SME State Bodies Group) 105 Detailed data from AIB and Bank of Ireland will be collated and examined, on a monthly basis ensuring a more informed understanding of the SME bank lending environment,
with a particular focus on new lending. D/Finance, Credit Review Office) 106 Following the passing of the appropriate primary legislation implement
This data suggest that 70 per cent of Irish SMES and large corporates use trade credit.
The latest available data shows that exports of Irish goods and services have risen to a record 184 billion in 2013
DAFM) 242 A Memorandum of Understanding covering enhanced data cooperation between Revenue and the CSO to produce wider and deeper statistical analyses will reduce the administrative burden on businesses arising from CSO surveys.
su Theref Health to wor with th Two si betwe manag first pr Patien patien data fr develo is targe throug
targeted support for the beef sector via the highly innovative Beef Data and Genomics Programme;
Iot will generate huge volumes of data though smart connected objects leading to challenges around quality and interoperability that will have to be addressed.
better use if data is derived from process and product analytics. In addition, the existing Principle Investigators in NIBRT are developing a research plan to get started in ADC manufacturing in collaboration with other centres such as SSPC (Synthesis
Data Analytics Research CER Comprehensive Expenditure Review CRFS Clinical Research Facilities CRO Credit Review Office CSO Central Statistics Office CSSO
personalised public services, using open data and services, enhancing transparency and decision-making processes of public administrations,
Against the update of structural data, the project will test these hypotheses on the qualitative impacts of the Third Sector in terms of capital building (e g. social networks,
23 1. 3 Imputation of missing data...24 1. 4 Multivariate analysis...25 1. 5 Normalisation of data...
27 1. 6 Weighting and aggregation...31 1. 7 Robustness and sensitivity...34 1. 8 Back to the details...
44 2. 2 Quality dimensions for basic data...46 2. 3 Quality dimensions for procedures to build
51 Step 3. Imputation of missing data...55 3. 1 Single imputation...55 3. 2 Unconditional mean imputation...
89 6. 2 Data envelopment analysis (DEA...91 6. 3 Benefit of the doubt approach (BOD...
Examples of normalisation techniques using TAI data...87 Table 16. Eigenvalues of TAI data set...
90 Table 17. Factor loadings of TAI based on principal components...90 Table 18. Weights for the TAI indicators based on maximum likelihood (ML) or principal components (PC) method for the extraction of the common factors...
Data envelopment analysis (DEA) performance frontier...92 Figure 17. Analytical hierarchy process (AHP) weighting of the TAI indicators...
-Young-Levenglick CLA Cluster analysis DEA Data Envelopment Analysis DFA Discriminant Function Analysis DQAF Data Quality Framework EC European commission EM Expected
whereby a lot of work in data collection and editing is wasted or hidden behind a single number of dubious significance.
and use of composite indicators in order to avoid data manipulation and misrepresentation. In particular, to guide constructors
Data selection. Indicators should be selected on the basis of their analytical soundness, measurability, country coverage,
when data are scarce. Imputation of missing data. Consideration should be given to different approaches for imputing missing values.
Extreme values should be examined as they can become unintended benchmarks. Multivariate analysis. An exploratory analysis should investigate the overall structure of the indicators
assess the suitability of the data set and explain the methodological choices, e g. weighting, aggregation.
Skewed data should also be identified and accounted for. Weighting and aggregation. Indicators should be aggregated and weighted according to the underlying theoretical framework.
or excluding single indicators, the normalisation scheme, the imputation of missing data, the choice of weights and the aggregation method.
Back to the real data. Composite indicators should be transparent and fit to be decomposed into their underlying indicators or values.
and the underlying data are freely available on the Internet. For the sake of simplicity, only the first 23 of the 72 original countries measured by the TAI are considered here.
dimensions of technological capacity (data given) in Table A. 1 Creation of technology. Two individual indicators are used to capture the level of innovation in a society:(
The quality of a composite indicator as well as the soundness of the messages it conveys depend not only on the methodology used in its construction but primarily on the quality of the framework and the data used.
A composite based on a weak theoretical background or on soft data containing large measurement errors can lead to disputable policy messages
especially as far as methodologies and basic data are concerned. To avoid these risks, the Handbook puts special emphasis on documentation and metadata.
process. 2. Data selection Should be based on the analytical soundness, measurability, country coverage, and relevance of the indicators to the phenomenon being measured and relationship to each other.
when data are scarce (involvement of experts and stakeholders is envisaged at this step). To check the quality of the available indicators.
To create a summary table on data characteristics, e g.,, availability (across country, time), source, type (hard,
3. Imputation of missing data is needed in order to provide a complete dataset (e g. by means of single or multiple imputation).
To check the underlying structure of the data along the two main dimensions, namely individual indicators and countries (by means of suitable multivariate methods, e g.,
To compare the statisticallydetermined structure of the data set to the theoretical framework and discuss possible differences. 5. Normalisation Should be carried out to render the variables Comparable to select suitable normalisation procedure (s) that respect both the theoretical framework and the data properties.
To discuss the presence of outliers in the dataset as they may become unintended benchmarks.
To select appropriate weighting and aggregation procedure (s) that respect both the theoretical framework and the data properties.
the normalisation scheme, the imputation of missing data, the choice of weights, the aggregation method.
and/or ranks. 8. Back to the data is needed to reveal the main drivers for an overall good or bad performance.
To develop data-driven narratives based on the results. 10. Visualisation of the results Should receive proper attention,
Criteria for assuring the quality of the basic data set for composite indicators are discussed in detail in Section 2:
the data selection process can be quite subjective as there may be no single definitive set of indicators.
A lack of relevant data may also limit the developer's ability to build sound composite indicators.
Given a scarcity of internationally comparable quantitative (hard) data, composite indicators often include qualitative (soft) data from surveys or policy reviews.
Proxy measures can be used when the desired data are unavailable or when cross-country comparability is limited.
For example data on the number of employees that use computers might not be available. Instead, the number of employees who have access to computers could be used as a proxy.
As in the case of soft data, caution must be taken in the utilisation of proxy indicators.
To the extent that data permit, the accuracy of proxy measures should be checked through correlation and sensitivity analysis.
The builder should also pay close attention to whether the indicator in question is dependent on GDP or other size-related factors.
The quality and accuracy of composite indicators should evolve in parallel with improvements in data collection and indicator development.
The current trend towards constructing composite indicators of country performance in a range of policy areas may provide further impetus to improving data collection,
identifying new data sources and enhancing the international comparability of statistics. On the other hand we do not marry the idea that using
Poor data will produce poor results in a garbage-in garbage-out logic. From a pragmatic point of view,
Created a summary table on data characteristics, e g. availability (across country, time), source, type (hard,
1. 3. Imputation of missing data The idea of imputation could be both seductive and dangerous Missing data often hinder the development of robust composite indicators.
Data can be missing in a random or nonrandom fashion. The missing patterns could be:
Missing values do not depend on the variable of interest or on any other observed variable in the data set.
and if (ii) each of the other variables in the data set would have to be the same, on average,
but are conditional on other variables in the data set. For example, the missing values in income would be MAR
if the probability of missing data on income depends on marital status but, within each category of marital status,
whether data are missing at random or systematically, while most of the methods that impute missing values require a missing at random mechanism, i e.
There are three general methods for dealing with missing data:(i) case deletion,(ii) single imputation or (iii) multiple imputation.
The other two approaches consider the missing data as part of the analysis and try to impute values through either single imputation, e g. mean/median/mode substitution, regression imputation,
Data imputation could lead to the minimisation of bias and the use ofexpensive to collect'data that would otherwise be discarded by case deletion.
However, it can also allow data to influence the type of imputation. In the words of Dempster & Rubin (1983:
The idea of imputation is both seductive and dangerous. It is seductive because it can lull the user into the pleasurable state of believing that the data are complete after all,
and it is dangerous because it lumps together situations where the problem is sufficiently minor that it can be handled legitimately in this way
and imputed data have substantial bias. The uncertainty in the imputed data should be reflected by variance estimates.
This makes it possible to take into account the effects of imputation in the course of the analysis. However
A complete data set without missing values. A measure of the reliability of each imputed value
and the results. 1. 4. Multivariate analysis Analysing the underlying structure of the data is still an art Over the last few decades,
The underlying nature of the data needs to be analysed carefully before the construction of a composite indicator.
This preliminary step is helpful in assessing the suitability of the data set and will provide an understanding of the implications of the methodological choices, e g. weighting and aggregation,
and analysed along at least two dimensions of the data set: individual indicators and countries. Grouping information on individual indicators.
or appropriate to describe the phenomenon (see Step 2). This decision can be based on expert opinion and the statistical structure of the data set.
These multivariate analysis techniques are useful for gaining insight into the structure of the data set of the composite.
and (iv) a method for selecting groups of countries for the imputation of missing data with a view to decreasing the variance of the imputed values.
or when is believed it that some of them do not contribute to identifying the clustering structure in the data set,
as PCA or FA may identify dimensions that do not necessarily help to reveal the clustering structure in the data
and weaknesses of multivariate analysis Strengths Weaknesses Principal Components/Factor analysis Can summarise a set of individual indicators while preserving the maximum possible proportion of the total variation in the original data set.
Sensitive to modifications in the basic data: data revisions and updates, e g. new countries. Sensitive to the presence of outliers,
which may introduce a spurious variability in the data. Sensitive to small-sample problems, which are particularly relevant
when the focus is limited on a set of countries. Minimisation of the contribution of individual indicators which do not move with other individual indicators.
gives some insight into the structure of the data set. Purely a descriptive tool; may not be transparent
A discrete clustering model together with a continuous factorial model are fitted simultaneously to two-way data to identify the best partition of the objects, described by the best orthogonal linear combinations of the variables (factors) according to the least-squares criterion.
data reduction and synthesis, simultaneously in the direction of objects and variables. Originally applied to short-term macroeconomic data,
factorial k-means analysis has a fast alternating least-squares algorithm that extends its application to large data sets.
This methodology can be recommended as an alternative to the widely-used tandem analysis. By the end of Step 4 the constructor should have:
Checked the underlying structure of the data along various dimensions, i e. individual indicators, countries. Applied the suitable multivariate methodology, e g.
Analysed the structure of the data set and compared this to the theoretical framework. Documented the results of the multivariate analysis
and the interpretation of the components and factors. 1. 5. Normalisation of data Avoid adding up apples
and oranges Normalisation is required prior to any data aggregation as the indicators in a data set often have different measurement units.
the percentile bands force the categorisation on the data, irrespective of the underlying distribution. A possible solution is to adjust the percentile brackets across the individual indicators
The normalisation method should take into account the data properties, as well as the objectives of the composite indicator.
Selected the appropriate normalisation procedure (s) with reference to the theoretical framework and to the properties of the data.
A number of weighting techniques exist (Table 4). Some are derived from statistical models, such as factor analysis, data envelopment analysis and unobserved components models (UCM
METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 Weights may also be chosen to reflect the statistical quality of the data.
Higher weights could be assigned to statistically reliable data with broad coverage. However, this method could be biased towards the readily available indicators,
such as the benefit of the doubt (BOD) approach, are extremely parsimonious about weighting assumptions as they allow the data to decide on the weights
as in the case of environmental indices that include physical, social and economic data. If the analyst decides that an increase in economic performance cannot compensate for a loss in social cohesion or a worsening in environmental sustainability,
when constructing composite indicators, e g. on the selection of indicators, data normalisation, weights and aggregation methods, etc.
selection of individual indicators, data quality, normalisation, weighting, aggregation method, etc. The approach taken to assess uncertainties could include the following steps:
1. Inclusion and exclusion of individual indicators. 2. Modelling data error based on the available information on variance estimation. 3. Using alternative editing schemes,
e g. single or multiple imputation. 4. Using alternative data normalisation schemes, such as Mni-Max, standardisation,
No index can be better than the data it uses. But this is an argument for improving the data,
not abandoning the index. UN, 1992. The results of the robustness analysis are reported generally as country rankings with their related uncertainty bounds,
does derived the theoretical model provide a good fit to the data? What the lack of fit tells about the conceptual definition of the composite of the indicators chosen for it?
. While they can be used as summary indicators to guide policy and data work, they can also be decomposed such that the contribution of subcomponents
Correlation analysis should not be mistaken with causality analysis. Correlation simply indicates that the variation in the two data sets is similar.
Developed data-driven narratives on the results Documented and explained the correlations and the results. 1. 10.
JRC elaboration, data source: Eurostat, 2007. http://ec. europa. eu/eurostat 60 70 80 90 100 110 120 130 140 150 2000 2001 2002
related both to the quality of elementary data used to build the indicator and the soundness of the procedures used in its construction.
Even if data are accurate, they cannot be said to be of good quality if they are produced too late to be useful,
or appear to conflict with other data. Thus, quality is multifaceted a concept. The most important quality characteristics depend on user perspectives, needs and priorities,
With the adoption of the European Statistics Code of practice in 2005, the Eurostat quality framework is now quite similar to the IMF's Data Quality Framework (DQAF),
Are the source data, statistical techniques, etc. adequate to portray the reality to be captured? 4. Serviceability:
Are effective data and metadata easily available to data users and is there assistance to users?
HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS: METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 45 Given the institutional setup of the European Statistical System,
Punctuality refers to the time lag between the target delivery date and the actual date of the release of the data;
availability of micro or macro data, media (paper, CD-ROM, Internet, etc. Clarity refers to the statistics'information environment:
6. Coherence refers to the adequacy of the data to be combined reliably in different ways and for various uses.
and (ii) the quality of internal processes for collection, processing, analysis and dissemination of data and metadata.
i) the quality of basic data, and (ii) the quality of procedures used to build
the application of the most advanced approaches to the development of composite indicators based on inaccurate or incoherent data would not produce good quality results,
In the following section each is considered separately 2. 2. Quality dimensions for basic data The selection of basic data should maximise the overall quality of the final result.
In particular, in selecting the data the following dimensions (drawing on the IMF, Eurostat and OECD) are to be considered:
Relevance The relevance of data is a qualitative assessment of the value contributed by these data.
Careful evaluation and selection of basic data have to be carried out to ensure that the right range of domains is covered in a balanced way.
Given the actual availability of data proxy series are used often, but in this case some evidence of their relationships with target series should be produced whenever possible.
Accuracy The accuracy of basic data is the degree to which they correctly estimate or describe the quantities
and censuses that provide source data; from the fact that source data do not fully meet the requirements of the accounts in terms of coverage, timing,
and valuation and that the techniques used to compensate can only partially succeed; from seasonal adjustment;
which include (i) replacement of preliminary source data with later data,(ii) replacement of judgemental projections with source data,
however, the absence of revisions does not necessarily mean that the data are accurate. HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
accuracy of basic data is extremely important. Here the issue of credibility of the source becomes crucial.
The credibility of data products refers to confidence that users place in those products based simply on their image of the data producer, i e.,
One important aspect is trust in the objectivity of the data. This implies that the data are perceived to be produced professionally in accordance with appropriate statistical standards
and policies and that practices are transparent (for example, data are manipulated not, nor their release timed in response to political pressure).
Other things being produced equal, data by official sources (e g. national statistical offices or other public bodies working under national statistical regulations
or codes of conduct) should be preferred to other sources. Timeliness The timeliness of data products reflects the length of time between their availability
and the event or phenomenon they describe, but considered in the context of the time period that permits the information to be of value
The concept applies equally to short-term or structural data; the only difference is the timeframe.
Closely related to the dimension of timeliness, the punctuality of data products is also very important, both for national and international data providers.
and reflects the degree to which data are released in accordance with it. In the context of composite indicators, timeliness is especially important to minimise the need for the estimation of missing data or for revisions of previously published data.
As individual basic data sources establish their optimal trade-off between accuracy and timeliness, taking into account institutional, organisational and resource constraints
data covering different domains are released often at different points of time. Therefore special attention must be paid to the overall coherence of the vintages of data used to build composite indicators (see also coherence.
Accessibility The accessibility of data products reflects how readily the data can be located and accessed from original sources,
i e. the conditions in which users can access statistics (such as distribution channels, pricing policy, copyright, etc.).
The range of different users leads to considerations such as multiple dissemination formats and selective presentation of metadata.
which the data are available, the media of dissemination, and the availability of metadata and user support services.
It also includes the affordability of the data to users in relation to its value to them
and whether the user has a reasonable opportunity to know that the data are available
In the context of composite indicators, accessibility of basic data can affect the overall cost of production and updating of the indicator over time.
if poor accessibility of basic data makes it difficult for third parties to replicate the results of the composite indicators.
the issue of coherence across data sets can become relevant. Therefore, the selection of the source should not always give preference to the most accessible source,
Interpretability The interpretability of data products reflects the ease with which the user may understand
and analyse the data. The adequacy of the definitions of concepts, target populations, variables and terminology underlying the data
and of the information describing the limitations of the data, if any, largely determines the degree of interpretability.
The range of different users leads to considerations such as the presentation of metadata in layers of increasing detail.
the wide range of data used to build them and the difficulties due to the aggregation procedure require the full interpretability of basic data.
The availability of definitions and classifications used to produce basic data is essential to assess the comparability of data over time
and across countries (see coherence): for example, series breaks need to be assessed when composite indicators are built to compare performances over time.
Therefore the availability of adequate metadata is an important element in the assessment of the overall quality of basic data.
Coherence The coherence of data products reflects the degree to which they are connected logically and mutually consistent,
i e. the adequacy of the data to be combined reliably in different ways and for various uses.
Coherence implies that the same term should not be used without explanation for different concepts or data items;
that different terms should not be used for the same concept or data item without explanation;
and that variations in methodology that might affect data values should not be made without explanation.
Coherence in its loosest sense implies the data are at least reconcilable. For example, if two data series purporting to cover the same phenomena differ, the differences in time of recording, valuation,
and coverage should be identified so that the series can be reconciled. In the context of composite indicators, two aspects of coherence are especially important:
Coherence over time implies that the data are based on common concepts, definitions and methodology over time,
Coherence across countries implies that from country to country the data are based on common concepts, definitions, classifications and methodology,
the imputation of missing data, as well as the normalisation and the aggregation, can affect its accuracy, etc.
The imputation of missing data affects the accuracy of the composite indicator and its credibility.
The quality of basic data chosen to build the composite indicator strongly affects its accuracy and credibility.
Timeliness can also be influenced greatly by the choice of appropriate data. The use of multivariate analysis to identify the data structure can increase both the accuracy and the interpretability of final results.
and to evaluate possible gaps in basic data. HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS: METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 49 One of the key issues in the construction of composite indicators is the choice of the weighting and aggregation model.
which are correlated highly with the reference data. The presentation of composite indicators and their visualisation affects both relevance and interpretability of the results.
The OECD has developed recently the Data and Metadata Reporting and Presentation Handbook (OECD, 2007), which describes practices useful to improve the dissemination of statistical products.
Table 5. Quality dimensions of composite indicators CONSTRUCTION QUALITY DIMENSIONS PHASE Relevance Accuracy Credibility Timeliness Accessibility Interpretability Coherence Theoretical framework Data
selection Imputation of missing data Multivariate analysis Normalisation Weighting and aggregation Back to the data Robustness and sensitivity Links to other variables Visualisation Dissemination HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
The problem of missing data is discussed first. The need for multivariate analysis prior to the aggregation of the individual indicators is stressed.
METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 55 STEP 3. IMPUTATION OF MISSING DATA The literature on the analysis of missing data
The predictive distribution must be generated by employing the observed data either through implicit or explicit modelling:
The danger of this type of modelling of missing data is the tendency to consider the resulting data set as complete,
Filling in blanks cells with individual data, drawn from similar responding units. For example, missing values for individual income may be replaced with the income of another respondent with similar characteristics, e g. age, sex, race, place of residence, family relationships, job, etc.
and the time to convergence depends on the proportion of missing data and the flatness of the likelihood function. 56 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
or the robustness of the composite index derived from the imputed data set. 3. 2. Unconditional mean imputation Let Xq be associated the random variable with the individual indicator q, with q=1,,
Hence, the inference based on the entire data set, including the imputed data, does not fully count for imputation uncertainty.
For nominal variables, frequency statistics such as the mode or hot-and cold-deck imputation methods might be more appropriate. 3. 4. Expected maximisation imputation Suppose that X denotes the matrix of data.
In the likelihood-based estimation, the data is assumed to be generated by a model, described by a probability or density function f (X),
and describes the probability of observing a data set for a given. Since is unknown,
Assuming that missing data are MAR or MCAR14, the EM consists of two components: the expectation (E) and maximisation (M) steps.
as if there were no missing data, and second (E), the expected values of the missing variables are calculated,
Effectively, this process maximises the expectation of the complete data log-likelihood in each cycle, conditional on the observed data and parameter vector.
however, an initial estimate of the missing data is needed. This is obtained by running the first M-step on the non-missing observations only
To test this, different initial starting values for can be used. 3. 5. Multiple imputation Multiple imputation (MI) is a general approach that does not require a specification of parameterised likelihood for all data (Figure 10.
The imputation of missing data is performed with a random process that reflects uncertainty. Imputation is done N times,
The parameters of interest are estimated on each data set, together with their standard errors. Average (mean or median) estimates are combined using the N sets
Logic of multiple imputation Data set with missing values Set 1 Set 2 Set N Result 1 Result 2 Result N Combine
It assumes that data are drawn from a multivariate normal distribution and requires MAR or MCAR assumptions The theory of MCMC is understood most easily using Bayesian methodology (Figure 11).
The observed data are denoted Xobs, and the complete data set, X=(Xobs, Xmis), where Xmis is to be filled in via multiple imputation.
If the distribution of Xmis, with parameter vector, were known, then Xmis could be imputed by drawing from the conditional distribution f (Xmis Xobs,).
it shall be estimated from the data, yielding, and using the distribution f (Xmis Xobs). Since is itself a random variable,
The missing-data generating process may also depend on additional parameters, but if and are independent,
and the analyst may concentrate on modelling the missing data given the observed data and. If the two processes are not independent,
then a non-ignorable missing-data generating process pertains, which cannot be solved adequately without making assumptions on the functional form of the interdependency. 60 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
of which depends on the data. The first step in its estimation is to obtain the posterior distribution of from the data.
Usually this posterior is approximated by a normal distribution. After formulating the posterior distribution of, the following imputation algorithm can be used:
*Use the completed data X and the model to estimate the parameter of interest (e g. the mean)* and its variance V(*)within-imputation variance).
and covariance matrix from the data that does not have missing values. Use to estimate Prior distribution.
simulate values for missing data items by randomly selecting a value from the available distribution of values Posterior step:
i e. mean vector and cov. matrix are unchanged as we iterate) Use imputation from final iteration to form a data set without missing values need more iterations enough iterations HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 61 This combination will be the value that fills in the blank space in the data set.
The Multiple Imputation method imputes several values (N) for each missing value (from the predictive distribution of the missing data),
The N versions of completed data sets are analysed by standard complete data methods and the results combined using simple rules to yield single combined estimates (e g.
which formally incorporate missing data uncertainty. The pooling of the results of the analyses performed on the multiple imputed data sets implies that the resulting point estimates are averaged over the N completed sample points,
and the resulting standard errors and pvalues are adjusted according to the variance of the corresponding N completed sample point estimates.
Thus, the between-imputation variance provides a measure of the extra inferential uncertainty due to missing data
which method he/she has to use to fill in empty data spaces. To the best of our knowledge there is no definitive answer to this question but a number of rules of thumb (and lot of common sense.
The choice principally depends on the dataset available (e g. data expressed on a continuous scale or ordinal data where methods like MCMC cannot be used),
the number of missing data as compared to the dimension of the dataset (few missing data in a large dataset do not probably require sophisticated imputation methods),
and the identity of the country and the indicator for which the data is missing.
Therefore there is not"a"method we advise to use but the method should be fitted to the characteristics of the missing information.
eliminate some of the data (for the same countries and in the same proportion of the complete dataset),
and O (P) the average of the observed (imputed) data,) (O P the standard deviation of the observed (imputed) data.
As noticed by Willmott et al. 1985) the value of R2 could be unrelated to the sizes of the difference between the predicted and the observed values.
) The majority of methods in this section are thought for data expressed in an interval or ratio scale,
although some of the methods have been used with ordinal data (for example principal components and factor analysis, see Vermunt
& Magidson 2005). 4. 1. Principal components analysis The objective is to explain the variance of the observed data through a few linear combinations of the original data. 15
much of the data's variation can often be accounted for by a small number of variables principal components,
or linear relations of the original data, 1 2 Q Z, Z,..Z, that are uncorrelated.
, P<Q principal components that preserve a high amount of the cumulative variance of the original data.
It indicates that the principal components are measuring different statistical dimensions in the data. When the objective of the analysis is to present a huge data set using a few variables,
some degree of economy can be achieved by applying Principal Components Analysis (PCA) if the variation in the Qoriginal x variables can be accounted for by a small number of Z variables.
with a variance of 1. 7. The third and fourth principal components have an eigenvalue close to 1. The last four principal components explain the remaining 12.8%of the variance in the data set.
Bootstrap refers to the process of randomly resampling the original data set to generate new data sets.
Although social scientists may be attracted to factor analysis as a way of exploring data whose structure is unknown,
Assumption of interval data. Kim & Mueller (1978) note that ordinal data may be used if it is thought that the assignment of ordinal categories to the data will not seriously distort the underlying metric scaling.
Likewise, the use of dichotomous data is allowed if the underlying metric correlation between the variables is thought to be moderate(.
7) or lower. The result of using ordinal data is that the factors may be much harder to interpret.
Note that categorical variables with similar splits will necessarily tend to correlate with each other, regardless of their content (see Gorsuch, 1983).
This is particularly apt to occur when dichotomies are used. The correlation will reflect similarity of"difficulty"for items in a testing context;
The smaller the sample size, the more important it is to screen data for linearity.
METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 67 Multivariate normality of data is required for related significance tests.
The smaller the sample size, the more important it is to screen data for normality.
Likewise, the inclusion of multiple definitionally-similar individual indicators representing essentially the same data will lead to tautological results.
thereby defeating the data reduction purposes of factor analysis. On the other hand, too high intercorrelations may indicate a multicollinearity problem
however, is whether the TAI data set for the 23 countries can be viewed as a random sample of the entire population, as required by the bootstrap procedures (Efron, 1987;
Several points can be made regarding the issues of randomness and representativeness of the data. First, it is often difficult to obtain complete information for a data set in the social sciences,
as controlled experiments are not always possible, unlike in natural sciences. As Efron and Tibshirani (1993) state that,in practice the selection process is seldom this neat,
A third point on data quality is that a certain amount of measurement error is likely to pertain.
While such measurement error can only be controlled at the data collection stage, rather than at the analytical stage
it is argued that the data represent the best estimates currently available (UN, 2001. Figure 12 (right graph) demonstrates graphically the relationship between the eigenvalues from the deterministic PCA,
However, while PCA is based simply on linear data combinations, FA is based on a rather special model.
Contrary to the PCA, the FA model assumes that the data is based on the underlying factors of the model,
and that the data variance can be decomposed into that accounted for by common and unique factors.
and are sorted not into descending order according to how much of the original data set's variance is explained.
which indicates that university does not move with the other individual indicators in the data set,
it is unlikely that they share common factors. 2. Identify the number of factors necessary to represent the data
The c-alpha is. 70 for the data set of the 23 countries which is equal to Nunnally's cut off value.
The classification aims to reduce the dimensionality of a data set by exploiting the similarities/dissimilarities between cases.
and hence different classifications may be obtained for the same data, even using the same distance measure.
if the data are categorical in nature. 76 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS: METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 Figure 13 shows the country clusters based on the individual Technology Achievement Index
which indicates that the data are represented best by ten clusters: Finland; Sweden and the USA;
k-means clustering (standardised data. Table 13. K-means for clustering TAI countries Group1 (leaders) Group 2 (potential leaders) Group 3 (dynamic adopters) Finland Netherlands Sweden USA Australia
EM estimates mean and standard deviation of each cluster so as to maximise the overall likelihood of the data
Second, unlike k-means, EM can be applied to both continuous and categorical data. HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
and a continuous factorial model are fitted simultaneously to two-way data with the aim of identifying the best partition of the objects, described by the best orthogonal linear combinations of the variables (factors) according to the least-squares criterion.
data reduction and synthesis simultaneously in the direction of objects and variables. Originally applied to short-term macroeconomic data,
factorial k-means analysis has a fast alternating least-squares algorithm that extends its application to large data sets, i e.,
, multivariate data sets with>2 variables. The methodology can therefore be recommended as an alternative to the widely used tandem analysis that sequentially performs PCA
and CLA. 4. 5. Other methods for multivariate analysis Other methods can be used for multivariate analysis of the data set.
The characteristics of some of these methods are sketched below citing textbooks where the reader may find additional information and references.
(or relevant relationships between rows and columns of the table) by reducing the dimensionality of the data set.
Correspondence analysis starts with tabular data, e g. a multidimensional time series describing the variable number of doctorates in 12 scientific disciplines (categories) given in the USA between 1960 and 1975 (Greenacre
The correspondence analysis of this data would show, for example, whether anthropology and engineering degrees are at a distance from each other (based on the 80 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
As in PCA, CCA implies the extraction of the eigenvalues and eigenvectors of the data matrix.
and their robustness against possible outliers in the data (Ebert & Welsch, 2004). Different normalisation methods will produce different results for the composite indicator.
Using Celsius data normalised based on distance to the best performer, the level of Country A has increased over time.
40 Country B Humidity(%)50 45 Normalised data in Celsius Country A 0. 949 0. 949 Country B 0. 833 0. 821
Normalised data in Fahrenheit Country A 0. 965 0. 965 Country B 0. 833 0. 821 The example illustrated above is a case of an interval scale (Box 3
Another transformation of the data, often used to reduce the skewness of (positive) data, is the logarithmic transformation f:
bearing in mind that the normalised data will be affected by the log transformation. In some circumstances outliers21 can reflect the presence of unwanted information.
when data for a new time point become available. This implies an adjustment of the analysis period T,
I. To maintain comparability between the existing and the new data, the composite indicator for the existing data must be recalculated. 5. 4. Distance to a reference This method takes the ratios of the indicator tqc x for a generic country c and time t with respect to the individual
indicator t0 qc c x=for the reference country at the initial time 0 t. 0tt qc qc tqc c x
Examples of normalisation techniques using TAI data Mean years of school age 15 and above) Rank*z-score Min-Max distance to reference country (c) Above/below the mean(**Percentile
0. 00-1 4. 3 0(*)High value=Top in the list(**p=20%Examples of the above normalisation methods are shown in Table 15 using the TAI data.
The data are sensitive to the choice of the transformation and this might cause problems in terms of loss of the interval level of the information,
thus poor data availability may hamper its use. In formal terms, let qi x be as usual the level of indicator q for country
the composite no longer depends upon the dimensionality of the data set but rather is based on the statistical dimensions of the data.
According to PCA/FA, weighting intervenes only to correct for overlapping information between two or more correlated indicators and is not a measure of the theoretical importance of the associated indicator.
The first step in FA is to check the correlation structure of the data, as explained in the section on multivariate analysis.
The second step is the identification of a certain number of latent factors (fewer than the number of individual indicators) representing the data.
%With the reduced data set in TAI (23 countries) the factors with eigenvalues close to unity are the first four,
Eigenvalues of TAI data set Eigenvalue Variance(%)Cumulative variance(%)1 3. 3 41.9 41.9 2 1. 7 21.8 63.7 3 1
With the TAI data set there are four intermediate composites (Table 17. The first includes Internet (with a weight of 0. 24),
The four intermediate composites are aggregated by assigning a weight to each one of them equal to the proportion of the explained variance in the data set:
. 08 Electricity 0. 11 0. 12 Schooling 0. 19 0. 14 University 0. 02 0. 16 6. 2. Data envelopment
analysis (DEA) Data Envelopment Analysis (DEA) employs linear programming tools to estimate an efficiency frontier that would be used as a benchmark to measure the relative performance of countries. 26 This requires construction of a benchmark (the frontier) and the measurement
Data envelopment analysis (DEA) performance frontier Indicator 1 Indicator 2 a b c d d'0 Source: Rearranged from Mahlberg
The observed data consist in a cluster of q=1 Q (c) indicators, each measuring an aspect of ph (c). Let c=1,
However, since not all countries have data on all individual indicators, the denominator of w c,
and 2 s q (hence at least 3 indicators per country are needed for an exactly identified model) so the likelihood function of the observed data based on equation (25) will be maximised with respect to (q) s,(q) s,
AHP allows for the application of data, experience, insight, and intuition in a logical and thorough way within a hierarchy as a whole.
Although this methodology uses statistical analysis to treat the data, it relies on the opinion of people (e g. experts, politicians, citizens),
Reliability and robustness of results depend on the availability of sufficient data. With highly correlated individual indicators there could be identification problems.
) Can be used both for qualitative and quantitative data. Transparency of the composite is higher. Weighting is based on expert opinion and not on technical manipulations.
Let us then apply Borda's rule to the data presented in Table 25. To begin, the information can be presented in a frequency matrix fashion,
let us apply it to the data presented in Table 25. The outranking matrix is shown in Table 27;
For explanatory purposes, consider only five of the countries included in the TAI data set39
only if data are expressed all on a partially comparable interval scale (i e. temperature in Celsius or Fahrenheit) of type:+>
Non-comparable data measured on a ratio scale (i e. kilograms and pounds: where>0 i i x x f i e. i varying across individual indicators) can only be aggregated meaningfully by using geometric functions,
technique for the TAI data set with 23 countries. Although in all cases equal weighting is used,
inclusion/exclusion of one indicator at a time, imputation of missing data, different normalisation methods, different weighting schemes and different aggregation schemes.
1 X Estimation of missing data 1 Use bivariate correlation to impute missing data 2 Assign zero to missing datum The second input factor,
(either for the BAP or AHP schemes) are assigned to the data. Clearly the selection of the expert has no bearing
i e. by censoring all countries with missing data. As a result, only 34 countries, in theory, may be analysed.
Hong kong, as this is the first country with missing data. The analysis is restricted to the set of countries
Practically, however, in the absence of a genuine theory on what causes what, the correlation structure of the data set can be of some help in at least excluding causal relationships between variables (but not necessarily between the theoretical constructs
However, a distinction should be made between spatial data (as in the case of TAI) and data
The case of spatial data is complicated more, but tools such as Path Analysis and Bayesian networks (the probabilistic version of path analysis) could be of some help in studying the many possible causal structures
whereas a low value would point to the absence of a linear relationship (at least as far the data analysed are concerned).
However, the resulting path coefficients or correlations only reflect the pattern of correlation found in the data.
exclude the two least influential indicators from the TAI data set. Even in this scenario, TAI6, the indicators are weighted equally.
(i) determine that the pattern of covariances in the data is consistent with the model specified;(
of which path diagram is more likely to be supported by the data. Bayesian networks are graphical models encoding probabilistic relationships between variables.
when combined with the data via Bayes'theorem, produces a posterior distribution. This posterior, in short a weighted average of the prior density and of the conditional density (conditional on the data), is the output of 136 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 Bayesian analysis. Note that the output of classical analysis is rather a point estimate.
ii) to see how the different evidence (the data available) modifies the probability of each given node;(
When questioned on the data reliability/quality of the HDI, Haq said that it should be used to improve data quality,
rather than to abandon the exercise. In fact, all the debate on development and public policy and media attention would not have been possible
Chantala K, Suchindran C.,(2003) Multiple Imputation for Missing Data. SAS Onlinedoctm: Version 8. Charnes A.,Cooper W. W.,Lewin A y,
. and Seiford L. M. 1995), Data Envelopment Analysis: Theory, Methodology and Applications. Boston: Kluwer. 142 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
Cherchye L. 2001), Using data envelopment analysis to assess macroeconomic policy performance, Applied Economics, 33: 407-416 Cherchye L. and Kuosmanen T. 2002), Benchmarking sustainable development:
Dempster A p. and Rubin D. B. 1983), Introduction (pp. 3-10), in Incomplete Data in Sample Surveys (vol. 2:
Funtowicz S. O.,Munda G.,Paruccini M. 1990), The aggregation of environmental data using multicriteria methods, Environmetrics, Vol. 1 (4): 353-36.
METHODOLOGY AND USER GUIDE ISBN 978-92-64-04345-9-OECD 2008 Little R. J. A. and Schenker N. 1994), Missing Data, in Arminger
Little R. J. A (1997), Biostatistical Analysis with Missing Data, in Armitage P. and Colton T. eds.
Little R. J. A. and Rubin D. B. 2002), Statistical analysis with Missing Data, Wiley Interscience, J. Wiley & Sons, Hoboken, New jersey.
Mahlberg B. and Obersteiner M. 2001), Remeasuring the HDI by data Envelopment analysis, Interim report IR-01-069, International Institute for Applied System Analysis, Laxenburg, Austria.
Massart D. L. and Kaufman L. 1983), The Interpretation of Analytical Chemical Data by the Use of Cluster analysis, New york:
Milligan G. W. and Cooper M. C. 1985), An Examination of Procedures for Determining the Number of Clusters in a Data Set, Psychometrika, 50: 159-179.
/60/34002216. pdf. OECD (2007), Data and Metadata Reporting and Presentation Handbook, available at http://www. oecd. org/dataoecd/46/17/37671574. pdf). Parker
) Available at http://spitswww. uvt. nl/vermunt/vanderark2004. pdf Vichi M. and Kiers H. 2001), Factorial k-means analysis for two-way data, Computational Statistics
Economists have long been hostile to subjective data. Caution is prudent, but hostility is warranted not.
Another common method (called imputing means within adjustment cells) is to classify the data for the individual indicator with some missing values in classes
which, for complex patterns of incomplete data, can be complicated a very function of. As a result these algorithms often require algebraic manipulations and complex programming.
but careful computation is needed. 14 For NMAR mechanisms one needs to make assumptions on the missing-data mechanism
as in most American cities it is not possible to go directly between two points. 20 The Bartlett test is valid under the assumption that data are a random sample from a multivariate normal distribution.
in the multiplicative aggregation it is proportional to the relative score of the indicator with respect to the others. 39 Data are normalized not.
whenever it does not change the ordinal information of the data matrix. 158 HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS:
Overtext Web Module V3.0 Alpha
Copyright Semantic-Knowledge, 1994-2011