WPS4644
Policy ReseaRch WoRking PaPeR 4644
India Shining and Bharat Drowning:
Comparing Two Indian States to the Worldwide
Distribution in Mathematics Achievement
Jishnu Das
Tristan Zajonc
The World Bank
Development Research Group
Human Development and Public Services Team
June 2008
Policy ReseaRch WoRking PaPeR 4644
Abstract
This paper uses student answers to publicly released states can represent India, the two statements ``for every
questions from an international testing agency together ten top performers in the United States there are four in
with statistical methods from Item Response Theory India'' and ``for every ten low performers in the United
to place secondary students from two Indian states-- States there are two hundred in India'' are both consistent
Orissa and Rajasthan--on a worldwide distribution of with the data. The combination of India's size and large
mathematics achievement. These two states fall below variance in achievement give both the perceptions that
43 of the 51 countries for which data exist. The bottom India is shining even as Bharat, the vernacular for India,
5 percent of children rank higher than the bottom 5 is drowning. Comparable estimates of inequalities in
percent in only three countries--South Africa, Ghana learning are the building blocks for substantive research
and Saudi Arabia. But not all students test poorly. on the correlates of earnings inequality in India and other
Inequality in the test-score distribution for both states low-income countries; the methods proposed here allow
is next only to South Africa in the worldwide ranking for independent testing exercises to build up such data by
exercise. Consequently, and to the extent that these two linking scores to internationally comparable tests.
This paper--a product of the Human Development and Public ServicesTeam, Development Research Group--is part of a
largereffortinthedepartmenttomeasureandunderstandinequalityintheprovisionofeducation.PolicyResearchWorking
Papers are also posted on the Web at http://econ.worldbank.org. The author may be contacted at jdas1@worldbank.org.
The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.
Produced by the Research Support Team
India Shining and Bharat Drowning: Comparing Two Indian States to the
Worldwide Distribution in Mathematics Achievement
Jishnu Das Tristan Zajonc
Center for Policy Research, New Delhi Harvard University
World Bank, Washington DC
"It has long been claimed that everything one can say about India is true--and so
is the opposite." -Devesh Kapur, 2005
1 Introduction
Net primary enrollment in India has risen steadily over the last several decades and now exceeds
90 percent in most of the country. Large planned increases in the government education budget
suggest renewed interest and action on the part of the state, with an emphasis on secondary
schooling. Not surprisingly, increasing enrollments and resources have shifted the debate from
how many children are in school to what they are learning. A consensus is building that getting
children into schools may not be enough. Filmer et al. (2006) go so far as to propose augmenting
the Millennium Development Goals with a Millennium Learning Goal that provides international
benchmarks on how much children know at a pre-specified age. We ask the following question: Is
there a way to place Indian children in secondary schools on an international scale (given India's
We thank Lant Pritchett for extensive discussions of the paper. Kin Bing Wu who led a World Bank sector
study on secondary education in India designed the collection of the data we use here, and we are grateful to
her for making the data and her report available to us. Eric Hanushek and Eugene Gonzalez provided invaluable
comments on an early version of this paper and their insights have been critical for the current revision. Michelle
Riboud and Sam Carlson provided useful comments that pertain to India's education sector. The findings,
interpretations, and conclusions expressed in this paper are those of the authors and do not necessarily represent
the views of the World Bank, its Executive Directors, or the governments they represent. Working papers describe
research in progress by the authors and are published to elicit comments and to further debate.
1
reluctance to participate in internationally benchmarked tests) and, if so, what would we find in
terms of the average score and variance of the achievement distribution?
We propose a method that uses publicly released questions (items) from the Trends in In-
ternational Mathematics and Science Study (TIMSS) 1999 8th-Grade Mathematics test to place
Indian students on an internationally comparable achievement scale. The test, which consists
of 36 items taken from the full TIMSS item bank, was administered to 6,000 students in pub-
lic and private schools in two Indian states--Rajasthan and Orissa. Using the published item
parameters for these 36 questions in conjunction with the Item Response Theory test-equating
methods used by TIMSS, we construct a distribution of scores for the tested children that is
directly comparable to the worldwide distribution; this allows us to compare the tested children
to the international average and to place them in reference to the 51 other countries tested by
TIMSS in 1999 and 2003.
The average scores of children in Rajasthan and Orissa place these states below 46 and 42 of
the 51 countries tested in 1999 or 2003. After nine years of education, between 30 and 40 percent
of enrolled children in these two states cannot pass a low international benchmark, described
as "some basic mathematical knowledge." Children enrolled in secondary schools in these two
Indian states are 3.1 (OECD) standard-deviations below the OECD mean. Where children in
these two states are relative to the rest of the world is harder to ascertain. On the one hand, the
TIMSS sample is heavily biased towards relatively high-income countries. The median scores,
for instance, in Rajasthan and Orissa do not look too bad compared to Philippines and Chile.
On the other hand, secondary school enrollments in India are also lower--53 percent of the
appropriate age group is enrolled, compared to more than 90 percent in South Africa, the worst
performer in the TIMSS sample. To the extent that children currently out of school are less
"motivated" or "able", test scores would arguably look worse at higher levels of enrollment.
The test-score distribution is also highly unequal--the difference between the top 5 percent
and bottom 5 percent in both states is among the highest in the world, next only to South
Africa. Students at the bottom of the distribution in both states score similarly or worse than
the bottom students in the three worst performing countries. At the same time, students at the
top of the distribution score higher than the top students in other low performing countries, and
higher than the median student in all but the best countries. The top 5 percent of students in
2
Orissa, for example, score higher than the median student in more than 42 of 46 countries tested
in 2003.
Faced with similar results on learning, defenders of the quality of education in Indian schools
often point to the large number of globally competitive Indians. We perform the following
thought experiment: Suppose that these two states represent India (more on this below). Could
the country's size combined with the large variance in scores explain how divergent beliefs can
be sustained by the same data? As it turns out, in absolute terms, India has just under half the
number of 14-year olds who pass the advanced international benchmark as the United States--100
thousand compared to 250 thousand--and roughly the same number who pass the intermediate
international benchmark. Indeed, India has more top achievers than any European country
tested, which, although not surprising given India's size, helps explain India's visible position on
the academic stage. But another view is also sustainable. The average child scores far below
any reasonable curricular standard and a large minority in these two states fails completely. If
the results form these two states hold more generally, over 18 million 14-year olds in India are
either not enrolled or are failing the lowest international benchmark if enrolled. That number is
22 times the number of failing children in the United States and more than any other country
tested.
Beyond providing illustrative results for India, this paper is about the building blocks for
research on learning and learning inequality in low-income countries where data on internationally
comparable tests are typically absent. This requires 1) techniques to place individual students on
a single comparable achievement metric and 2) methods to calculate other population quantities,
such as the fraction of children passing particular criterion-referenced thresholds or the 5th to
95th percentile achievement spread. Clarifying what is required for comparable measures of
learning and learning dispersion allows the research to focus on substantive rather than statistical
issues, without worrying about whether results are driven by measurement tools and differing
methodologies.
To preview the methodology, independent tests can be linked to the TIMSS achievement
distribution provided at least one question is drawn from the TIMSS item bank to fix the free
parameters. The primary methodological difficulty arises because "knowledge" or "achievement"
is inferred from the data rather than directly observed. Since individual knowledge is measured
3
with error, the variance of the achievement distribution aggregated from Maximum Likelihood
estimates of individual knowledge overestimates the true variance. An alternate method, out-
lined by Mislevy, Beaton, Kaplan & Sheehan (1992), draws from the posterior of every student's
achievement distribution to obtain an unbiased measure of the full learning distribution. These
draws--known as "plausible values"--are interpreted as individual achievement with the property
that when aggregated to a population distribution they recover the correct population moments.
We show that the variance of the distribution is sensitive to the estimation method used (i.e.
Maximum Likelihood, Bayesian, or Plausible Values), primarily because the TIMSS test is too
difficult for a large fraction of Indian children.1 The method of plausible values offers an alter-
native for the calculation of higher moments in any setting--such as poverty mapping--where
individuals attributes are estimated with a known standard error.
Linking scores to an international distribution contributes to the literature on education
in low-income countries in several ways. First, linked test scores are comparable across space
and time. Despite increasing worldwide testing using standardized methods--e.g. TIMSS (51
countries), PIRLS (35 countries), IALS (22 countries) and PISA (49 countries)--the Indian gov-
ernment, like many others, is reluctant to participate in such large-scale testing exercises. As
a result, what little is known about learning achievement in India, and most low-income coun-
tries, arises from an ad-hoc collection of criterion-referenced exams.2 These tests, administered
by independent agencies, are typically not validated using standard testing tools, cannot be
equated over time or across countries, and are not subject to a battery of robustness checks that
accompany large-scale testing in the OECD countries. The methods applied here allow indepen-
dent researchers to report achievement distributions for the tests they control that are directly
1Brown & Micklewright (2004) also highlights the importance of using a consistent methodology. They find,
for instance, that rankings of countries by within-country difference in TIMSS changed substantially for some
countries when the scoring model used in 1999 was retrospectively applied to 1995 data.
2Examples for India include a large national study by the National Center for Educational Research and
Training (NCERT) in 1994, which found that children scored an average of 47 percent in language and 41 percent
in mathematics (Shukla et al. 1994), and state-wide studies with smaller samples in Bihar, Tamil Nadu, Delhi,
Uttar Pradesh, Madhya Pradesh and Rajasthan(Bashir 1994, Hasan 1995, Govinda & Varghese 1993, Aggarwal
2000, Goyal 2007). In a major recent effort, the NGO Pratham tested children from almost all districts and found
low levels of learning: 52 percent of children between the ages of 7 and 10 could read a small paragraph with
short sentences at first grade difficulty levels, 32 percent could read a story text and 54 per cent were unable
to divide or subtract (Pratham 2006). Similar results have been reported for Africa. In a relatively large effort,
the Monitoring Learning Achievement Project (Chinapah et al. 2000, Strauss & Burger 2000)covered 13 African
countries and found literacy, numeracy, and life-skills scores for fourth graders between 30 and 70 percent.
4
comparable to those obtained worldwide.3
Comparable achievement measures contribute to our understanding of earnings inequality
and its correlates. A growing literature examines the relationship between earnings inequality
and test-score dispersion. Nickell (2004) and Blau & Kahn (2005) report a high correlation
between test-score dispersion and wage inequality; Nickell (2004) for instance suggests that 70
percent of the dispersion in earnings internationally is attributable to the dispersion in test-scores.
Similarly, Bedard & Ferrall (2003) show that test-score inequality at early ages is correlated
with wage inequality in the same cohort later in life. In contrast to this literature, Devroye
& Freeman (2001) argue that wage dispersion within narrowly defined skill sets is higher than
that across and that institutional mechanisms of collective bargaining matter more. India has
recently seen a dramatic increase in inequality (Debroy & Bhandari 2007), at the same time
that inequality in educational attainment is falling (Jalan & Murgai 2007). It is likely that as
inequality in attainment declines further and returns to skill increase (Kijima 2006), attention
will increasingly focus on the inequality in cognitive ability.
The remainder of this paper is structured as follows. Section 2 outlines the Item Response
Theory method for equating test scores. The technical section and accompanying appendix
provides sufficient details for critique and replication. Section 3 discusses the data, sampling
strategy, and test design. Section 4 reports the international benchmarking results and variance
decompositions. Section 5 outlines some caveats to our method and several robustness checks;
Section 6 concludes.
2 Overview of Linking Methodology
Properly linking India's mathematics achievement to the world distribution requires either a
single test given across all countries (and each year) or a means of linking alternate test forms
which may include different items. Since giving a single test is clearly infeasible in most situations,
educational testing organizations have developed statistical tools that allow scores from different
exams to be expressed on a unified scale. Item Response Theory (IRT) is one such technique
3Inter alia, such standardization would help benchmark the relative efficacy of different educational interven-
tions. High inequality in the Indian learning distribution, for instance, implies that interventions leading to a,
say, 0.2 s.d. increase in learning represent a greater increase in "knowledge" than a similar effect in the United
States.
5
and is used in most large-scale testing situations such as TIMSS, PIRLS, NAEP and the SAT
and GRE. The basic intuition behind this technique is to model the behavior of each item--i.e.
its difficulty, ability to discriminate between two children, and likelihood of being guessed--so
that any differences in items can be removed from the score. This contrasts with the commonly
reported percent correct score, which gives performance on a test-specific scale.
The fundamental building block of IRT is therefore the item response function (IRF), which
links the latent ability, , to the probability a randomly drawn examinee of a given ability will
answer the item correctly. One of the most popular models for dichotomous responses is the
three-parameter (3PL) logistic model introduced by Birnbaum (1968) and used by TIMSS for
multiple choice items. Letting Xig represent the (0/1) response for individual i on item g, the
IRF for the 3PL model is
1 - cg
Pg(Xig = 1|) = cg + . (1)
1 + exp[-ag ( - bg)]
This function describes all 36 items administered to our sample and gives the probability of
observing a correct response given ability and item parameters (ag,bg,cg).
Figure 1 provides the intuition behind the 3PL model parameters. The pseudo guessing
parameter, cg, incorporates the fact that on multiple choice exams even the worst performers
( -) will sometimes guess correctly. The difficulty parameter, bg, measures the item's
overall difficulty since the probability of answering correctly depends equally on ability and
difficulty. The discrimination parameter, ag, captures how quickly the likelihood of success
changes with respect to ability. Intuitively, an item with a high discrimination parameter can
distinguish between examinees with abilities just below and above bg. Overall, this relatively
flexible functional form has proved adept at fitting item response patterns.
To illustrate graphically how IRT links items and tests, Figure 2 plots the item response
functions for two TIMSS items that map ability on the horizontal axis to the percentage correct
on the test. A third curve plots the test characteristic curve for a test composed of these two
items only. Since the item response functions are fully characterized by the published TIMSS
items parameter and the structural assumption of a logistic function, it is easy to read the mean
ability of a child by the percentage correct on the test. For instance, if item 19 is administered
6
and 60 percent of children respond correctly, the mean ability is 425. By comparison, the same
result on item 21 would suggest a higher mean ability level since the question is more difficult.
The key advantage of IRT in large testing situations is this ability to link tests, either in a
cross-section (when different children are administered different test questions) or over time (when
children are tested more than once). Formally, IRT equates competence levels by identifying
off the set of common items across the tests and defining a reference population. Absent a
reference population, the IRF given by (1) provides competence levels and item parameters
that are identified up to an affine transformation--poor performance cannot be distinguished
from a difficult test and a large variance in achievement cannot be distinguished from a highly
discriminating test. Specifically, the transformations
= + k (2)
bg = bg + k (3)
ag
ag = (4)
cg = cg (5)
will yield identical characteristic curves, so that Pg( ; ag,bg,cg) = Pg(; ag,bg,cg). However,
if item parameters are fixed, the scale of --the mean and variance--is fixed as well. Thus
by calibrating items using a defined reference group we can score the performance of all other
children relative to that group, regardless of which items children actually complete. In our case,
the reference group is given by the TIMSS knowledge scale. This scale fixes the item parameters
such that the TIMSS 1995 sample of eighth grade children have mean 500 and standard deviation
100 (Yamamoto & Kulick 2000).
In our application, all students receive the same exam and all item parameters are fixed using
TIMSS. In general, however, this need not be the case. Students can receive different exams and
new items so long as each item can be linked to a common set of fixed items or a fixed reference
population. For example, three two-item exams with item pairs (1,2), (2,3) and (3,4) can all be
linked provided that one of the four items is fixed, even if each test is administered to a different
population.4
4To see this, note that if we fix item 1--drawing it from the TIMSS item bank, say--we can estimate the
7
2.1 Estimating the Mean
Given a set of individuals who were administered the same test, the likelihood function of ob-
serving the N × G response matrix x is
L(,a,b,c|x) = Pg(i;ag,bg,cg)xig [1 - Pg(i;ag,bg,cg)]1 -xig , (6)
i g
where Pg is the 3PL model given by (1) and xig is the 0/1 response for individual i on item g.
Because of convergence issues associated with joint maximum likelihood methods that iterate
between solutions for item parameters and individual abilities, most researchers use marginal
maximum likelihood (MML) to estimate the 3PL model. To estimate any unknown item pa-
rameters, this method integrates out the ability distribution f() to get the marginal likelihood
function. Bock & Aitken (1981) propose an efficient EM algorithm to perform the resulting max-
imization problem. In addition to the parameter estimates, this algorithm returns a summary
measure of the ability distribution f() such as a mean and variance or a quadrature approx-
imation. To obtain individual ability estimates, one can maximize the full likelihood function
(6) treating the item parameters as fixed. For our application, this is all that is required to
produce MLEs since all parameters are known. The sample means--the average score in Orissa
and Rajasthan--can be computed from the individual ability estimates or, potentially, the means
obtained during the marginalization of the full distribution.
While maximum likelihood methods are usually perfectly adequate to estimate sample means,
there are some exceptions. One significant problem is that MLE proficiency is undefined if
children answer fewer items correctly than would be expected by chance. So long as one child
has an undefined ability estimate, so too is the sample average. As a result, researchers commonly
limit the proficiency scale to some finite range. We follow TIMSS and bound MLE scores between
5 and 995--in our sample, 91 of the 6000 tested children are bounded below by 5. A second,
more technical concern relates to the methods used to maximize the likelihood function (6) for
ability. Yen et al. (1991) find that this likelihood function is often multimodal even for tests
parameters for item 2 using the first exam. Given parameters for item 2, we can then estimate the parameter for
item 3 using students who received the second test. These students need not have the same ability distribution
as the first group because they can be compared directly using item 2. Using a similar argument we can link the
third exam to the first two.
8
with up to 50 items, which is a potential pitfall for many numerical maximization algorithms
commonly employed.
Bayesian methods avoid some of these problems by incorporating additional information
through a prior. Leaving just enough notation to capture the basic idea, the Bayesian approach
focuses on the posterior distribution,
p(|x) L(|x)p(), (7)
which is proportional to the product of the likelihood and prior. The expected a posterior (EAP)
estimate of ability is simply the mean of the posterior distribution for each individual i. One
advantage of EAP scores is that they are always well defined, even for poorly performing students;
when the likelihood function provides no additional information, the posterior simply converges
to the prior. Moreover, provided that the prior distribution is correctly specified, the mean of
the EAP scores is an unbiased estimate of the sample mean and has a smaller mean squared
error than the corresponding MLE based estimate.
2.2 Estimating the Variance and Quantiles
In addition to the average performance level in Rajasthan and Orissa, we are also interested in
the shape of the full distribution. The primary difficulty here is that if the test is too short,
too easy or too difficult, the individual errors become too large to ignore and the distribution of
estimated individual proficiencies no longer converges to the population distribution (Yamamoto
& Kulick 2000, Mislevy, Beaton, Kaplan & Sheehan 1992). To get a sense for whether this is
an issue in the Indian case, Figure 3 plots the distribution of MLE abilities in a histogram (left
axis) and the associated ±1.96 se confidence interval on the right axis.5 For children below the
mean, the precision of the ability estimate is very low. Simply put, for most Indian children,
the test is too hard. In this situation, the mean of the sample will still generally approach the
5Item Response Theory provides the standard error for each score from the inverse Fisher information matrix
after ML estimation of the IRT model. As the number of items grows large, this standard error summarizes the
normal sampling distribution of the estimator. However, as the number of items shrinks, the sampling distribution
becomes highly non-normal. In particular, our test is weakly informative for poorly performing students because
we cannot distinguish between students scoring poorly and those score very poorly; we can reject that such
students are high achievers. Consistent with how ML standard errors are calculated, Figure 3 does not capture
this non-normal behavior and instead graphs ±1.96 se.
9
population mean, but the same is not true for the estimated variance.
To see this, consider the variance of the MLE scores ^ and the EAP scores ¯. The variance
of the MLE scores includes both the variance of true scores and measurement error e. That is,
Var(^) = Var() + Var(e). (8)
Defining the test reliability ratio as Var()/ Var(^), we have
Var()
Var(^) = > Var(). (9)
By comparison, the EAP scores are a weighted average of the MLE score and the population
mean, ¯ = ^+ (1 - )µ. The variance of the EAP scores is therefore
Var(¯) = Var ^+ (1 - )µ = 2 Var(^) = Var() < Var(). (10)
The true variance, Var(), is bounded above by the MLE score variance and below by the EAP
score variance. It should be clear that this argument extends to any percentile moments such
as the top and bottom quintile. Unfortunately, the error structure in IRT is complicated and
closed-form corrections are not readily available.
One simple way to address this issue is to bound the distribution estimates using MLE and
EAP scores. Where these estimates are similar, no further work may be required--convenient
because both MLE and EAP scores are readily available from standard reports in test analysis
programs such as BILOG-MG. Unfortunately, in parts of the distribution where the test is only
weakly informative the bounds may be quite large; in our application, this turns out to be true
for estimates of lower quantiles.6
A more satisfactory solution, and the one followed by TIMSS, is to draw "plausible values"
from the posterior distribution of each student's ability estimate and then use these draws to ap-
proximate the true achievement distribution (Mislevy 1991, Mislevy, Beaton, Kaplan & Sheehan
1992, Mislevy, Johnson & Muraki 1992, Yamamoto & Kulick 2000). Staying with our simplified
6Another approach would be to use the information obtained during the integration process of the marginal
maximum likelihood procedure. While this would give unbiased estimates of certain moments it depends crucially
on the integration strategy used and is generally rather cumbersome.
10
posterior notation, we draw five plausible values for each child
~i,k p(i|x) (k = 1,...,5) (11)
and then estimate sample moment of interest as
5
1
s^ = s(~k) (12)
5
k=1
where s(~k) may be the variance, 90th percentile, etc, of the N element vector of plausible values
~k.
Unfortunately, no publicly available software can draw plausible values for the model we
estimate, making it difficult for other researchers to replicate the TIMSS methodology precisely.
We use the Markov Chain Monte Carlo (MCMC) algorithm proposed by Patz & Junker (1999a,b)
to compute the EAP scores and plausible values. This technique differs from the computational
approach used by TIMSS but is highly flexible and relatively straightforward to implement. We
provide a fuller explanation of our estimation strategy in Appendix A.
To see whether these concerns are of practical importance, Figure 4 shows the estimated MLE,
EAP and PV distributions of ability. The MLE and EAP estimates represent the upper and lower
bounds of the variance of the distribution, with the PV estimates somewhere in between. There
is a considerable divergence in the shape of the distribution to the left of the mean, while at
higher levels of ability, the three distributions look roughly similar. Because individual scores are
only weakly informative for poor performers, the MLE and EAP estimates diverge and plausible
value methodology is crucial to estimating percentile cutoffs at the bottom of the distribution.
3 Data
We use data collected in 2005 as part of a larger World Bank study designed and led by Kin
Bing Wu, and conducted by the Social and Rural Research Institute (SRI) unit of IMRB In-
ternational.7 The study includes detailed surveys of principals, teachers, and students in 288
7For further details on the survey design and an analysis of the correlates of performance, see Wu et al. (2006,
2007).
11
schools in the Indian states of Rajasthan (in the West) and Orissa (in the East). The sample was
designed by first selecting districts using population proportional to size (PPS) sampling, where,
in the absence of data on school-by-school enrollment, the PPS methodology was applied to the
population of schools across districts. Following the selection of districts, the total number of
schools to be selected in each district was arrived at and schools were surveyed in both urban
and rural areas, as well as across institutional affiliation; that is, government schools, private
unaided and private aided schools were all included in the sample.
There are three important features of the data. First, only children enrolled in school were
tested. According to the World Development Indicators, gross enrollment in India in lower
secondary schools is only 53 percent, so this leaves out half the children of the relevant age-
group. Consequently, there are intractable issues of trying to generalize the comparison of tested
children to all children across countries. For instance, the gross enrollment in secondary schools
in three other countries close to Orissa and Rajasthan in the world rankings varies dramatically,
from 44 percent in Ghana to 75 percent in Botswana and 90 percent in South Africa. To the
extent that marginal performers are less likely to be enrolled, comparisons of India with South
Africa would therefore favor India; alternatively comparisons between India and Ghana favor
the latter. Although problematic for the mean, the lack of information on non-enrolled children
may not be as problematic for different percentiles--it may be plausible to assume, for instance,
that the 50 percent of children not tested are likely to join the group that performs "poorly", in
a sense to be made precise below.
Second, although all attempts were made to ensure that no type of school or location was
left out of the sampling procedure, it has been difficult to accurately weight the data given
paucity of data on enrollments in private unaided and aided schools at the district level. This is
a general problem that any testing exercise has to address and it calls for an urgent compilation
of a universal dataset that can be used for sampling in the future.
Third, the data are from two states only, and therefore generalizations to all of India may
be misleading--Rajasthan and Orissa are both poorer states with larger tribal populations.
Learning outcomes though may be different from those suggested by income rankings. The
results from a countrywide testing exercise in rural areas (Pratham 2006) gives us some sense of
where these states lie in the Indian distribution. Among children tested in Grade 8 countrywide
12
(rural areas only), 82.4 percent could read a story, 75.2 percent could divide and 95.5 percent
could write. The average of Orissa (83.98 percent (read), 71 percent (divide) and 95.3 percent
(write)) and Rajasthan (92.9 percent can read, 92.4 percent can divide and 98.5 percent can write)
is surprisingly not far off the Indian average--if anything, these results suggest that children in
these two states may be scoring higher than the rest of the country. However, lots of caution
is still warranted--particularly since Orissa performs better than Rajasthan in the tests we use
while Pratham finds the opposite.
In the selected schools, students in ninth grade were administered a 36-item test where all
items were selected from the list of publicly released items published by the TIMSS. The test
sought to cover the content domains tested under the TIMSS with 11 items on Algebra, 5 on Data
Representation, Analysis and Probability, 9 on Fractions and Number Sense, 7 on Geometry and
4 on Measurement. The performance expectation across these content domains also varied and
ranged from "Communicating and Reasoning" to "Using Complex Procedures" (Table 1). The
items selected were neither too difficult nor too hard in the TIMSS calibration, ranging from -1.07
(a student 1 standard deviation below the mean would have a 50 percent chance of answering this
question correctly, absent guessing) to 1.244; the items were also uniformly distributed across
this difficulty range.
4 Results
4.1 International Benchmarking
There are two views that currently dominate thinking about educational policy in India. One
view--active proponents of which include prominent NGOs--is that Bharat is drowning. Average
learning levels are so low that the typical child will leave primary school without knowing how
to read or perform elementary mathematical operations. A second view--often expressed by
those in the government and in the media--is that India is shining. This group points to India's
increasing global presence, the large number of Indian professionals in high paying jobs, and the
dramatic growth of its service industry, particularly in information technology. As it turns out,
both views contain an element of truth, and both views can be justified by presenting different
pieces of the same data.
13
Mirroring the view that Bharat is drowning, absolute achievement, as measured by the percent
correct score, is low compared to curricular standards. A significant fraction of children have
not mastered the content categories expected for their grade (Table 1). By ninth grade, only
11 percent of children in Rajasthan and 17 percent in Orissa can correctly choose the smallest
value from the set 0.625, 0.25, 0.375, 0.5, and 0.125 (Table 1, Q2). The question "A runner ran
3000m in exactly 8 minutes. What was his average speed in meters per second?" (Q7) stumps
all but 23 percent of children in Rajasthan and 32 percent of children in Orissa. A simple test
of division and fractions--"if 4 times a number is 48 what is 1/3 of the number?" (Q10)--is too
difficult for 65 percent of children in Rajasthan and 64 percent of children in Orissa.
While the item-by-item comparison suggests that Indian children are performing significantly
below the international average, interpreting the magnitude of this effect is difficult because it
depends on a test-specific metric. As discussed, the percentage correct score is a function of
latent achievement differences--our true parameter of interest--and the discriminating power of
the test, and thus inseparable from the specific test design.
Figure 5 uses the linking methodology proposed previously to present cross-country compar-
isons on the TIMSS achievement scale.8 Based on the average score, Rajasthan and Orissa rank
below 46 (42) of the 51 countries tested with a score of 382 and 404. This ranking straddles
Bahrain, Chile and Morocco and is boxed in by Egypt above and the Philippines below. Averag-
ing across the entire tested sample, India scores 392--below 43 of 51 countries. This compares
to the international average of 487 in 1999 and 467 in 2003. Seen in standard deviations of all
children tested, the two Indian states are 0.7 student standard-deviations or 1 country standard-
deviation below the TIMSS mean. Relative to the OECD mean, the tested Indian children are
3.1 (OECD) standard deviations below. That is, if we rank all the OECD countries, India would
lie below the 1st percentile in the distribution of OECD country scores.
The true picture may be worse. Since the tests included only enrolled children, the com-
parisons favor India to the extent that enrollment is lower relative to other countries. In both
8We follow the TIMSS methodology as closely as possible and compute sample averages using the EAP scores,
which is, in this case, simply more efficient than using plausible values. The MLE scores, which are estimated
using BILOG-MG rather than our custom MCMC routines, yield somewhat lower estimates of the average: 374
and 386. The discrepancy between the EAP and MLE averages is likely due to students scoring in an area where
the likelihood function is virtually flat or undefined. In this situation, regularity and stability become a major
concern with MLE.
14
Botswana (75 percent) and South Africa (90 percent) gross enrollment in secondary schools is
higher. It is likely that a representative sample of children (enrolled and unenrolled) would place
India below additional countries.
That the average child is performing poorly masks the considerable variation in the distribu-
tion. At the bottom, children score extremely poorly. There is no evidence that the distribution
is more compressed at the bottom than for other low-performing countries. In fact, only three
countries--Saudi Arabia, Ghana, and South Africa--score worse than Rajasthan or Orissa if
ranked by the 5th percentile cutoff score (Figure 6). When the education system fails, it fails
completely.
4.2 Inequality in the Learning Distribution
Following Micklewright & Schnepf (2006), we report a simple statistic measuring test-score
dispersion--the difference between 5th and 95th percentiles of the test score distribution. Figure
7 shows the significant educational inequality in the Indian learning distribution. In both the
Indian states, the 5-95 percentile spread is greater than 300, and just below the most unequal
country in the TIMSS sample--South Africa.
TIMSS 2003 also presents achievement benchmarks based on an intensive effort to anchor
performance to objective criteria. Table 2, drawn from TIMSS 2003 (Exhibit 2.1), describes the
low (400), intermediate (475), high (550), and advanced (625) international benchmarks; Table 3
shows the results. In Rajasthan and Orissa, 1 percent of children pass the advanced benchmark.
This actually is above many other poor performing countries. At the same time, only 42 percent
in Rajasthan and 50 percent in Orissa pass the lowest benchmark. Put another way, only 40
to 50 percent of Rajasthan and Orissa's enrolled ninth graders have "some basic mathematical
knowledge"--the description of the low international benchmark.
A second useful exercise that demonstrates the vast differences between tested children is to
rank Table 3 by those who reach each of the different international benchmarks. Ranked by the
low international benchmark, Rajasthan is 8th from the bottom and Orissa 9th; ranked by the
intermediate benchmark, they are now 9th and 14th from the bottom respectively; ranked by
the high international benchmark they are now 11th and 16th from the bottom. The advanced
international benchmarks put both states at the respectable positions of 12th and 18th, although
15
the precise ranking is difficult to obtain given rounding.
To the extent that these two states represent India, the combination of a wide achievement
distribution and immense population explains why perceptions of India can vary so dramatically.
In Table 4, we use population age-cohort estimates and enrollment rates to estimate the number of
14-year olds in each country who pass the international benchmarks set by TIMSS. The results
are striking. If one percent of Indian children reach the advanced international benchmark--
the average suggested by Rajasthan and Orissa--the total cohort size ranks 5th out of all the
countries tested. Only Japan, the United States, South Korea, and Taiwan have more students
passing the top benchmark. For every ten children who pass the advanced benchmark in the
United states, there are four children who pass the benchmark in India. Indeed, the 101 thousand
Indian children who pass the advanced benchmark exceeds the total number of children who pass
in bottom 32 countries combined. If India were added to the TIMSS sample, one out of every
fourteen children who pass the advanced benchmark would be Indian.
The view from the top--that Indian's form a substantial fraction of top performers worldwide--
contrasts sharply with the view from the bottom. The sheer magnitude of India's youth popu-
lation and poor average performance means that over 17 million Indian 14-year olds are either
not enrolled or fail the low international benchmark. This number is 22 times the number in the
United States, 217 times the number in South Korea, and 726 times the number Japan. Indeed,
there are more Indian's either not enrolled or who fail the low benchmark than in all the other
TIMSS countries combined.
4.3 Variance Decomposition
The striking disparity between top- and bottom-achievers hints that children receive different
educational inputs, both based on the state in which they live and the characteristics of their
families and schools. While it is impossible to draw definitive causal conclusions using simple
correlations or variance decompositions, the patterns that emerge from even a basic analysis are
broadly consistent with a view of an education system rife with inequality but rich in potential.
In a hopeful sign, the form inequality takes suggests that public policy plays a role. The impact
of household attributes--educational inputs that the government has little power to control--
appears mitigated by the institutional structure of states and schools.
16
We present a heuristic approach towards examining the source of achievement in Figure
8. Here, we first regress test scores on district dummies and then plot the residuals--this is
a measure of how much of the variation is accounted for by districts. We then add in child
and household characteristics--age, gender, caste, parental literacy, and wealth--and plot the
residuals again; finally we repeat the exercise including school dummies. To the extent that
districts, households, or schools explain a large portion of the variation in the test score data, we
expect that residual plot to be more "concentrated" once the appropriate dummies are accounted
for. So, if districts matter a lot, we expect the residual plot from a regression of test scores on
district dummies to be "tighter" than the distribution of all test scores.
As Figure 8 shows, schools seem to matter most. Progressively adding district effects and
family characteristics compresses the distribution slightly. Only when we add school fixed effects
is the collapse noticeable; the gaps between schools accounts for more than the gaps between
children from different household characteristics.
Table 5 confirms this result more formally using a simple regression based variance decom-
position. Here, we first regress achievement on district dummies. The R2 from this regression
gives a measure of the variance explained by districts alone. Examining the change in R2 after
adding household controls gives the fraction of achievement variation explained by observable
characteristics above and beyond the district effect. While indicative of households' contribution
to learning, we cannot claim households causally explain this fraction of variance since children
sort into schools. If this occurs, observable household characteristics may explain achievement
simply because schools determine learning and children sort. Proceeding onward, we add school
dummy variables and report the increase in R2. This gives some sense for the importance of
schools, but again we cannot make definitive causal statements. A significant increase in variance
explained at this stage implies either that schools matter or that children sort on unobservable
characteristics. After accounting districts, observables, and schools, the remaining variation is
idiosyncratic. As Figure 8 shows, measurement error, which cannot be decomposed by definition,
forms a significant portion of this idiosyncratic variation.
Table 5 shows the results of this exercise. In Orissa (Rajasthan), schools explain an additional
32 percent (41 percent) of the test score variation above districts and observable household
characteristics. This is twice the amount of variation explained by districts and household
17
characteristics in Orissa and five times the variation explained by those attributes in Rajasthan.
Even if half of this effect is due to selection on unobservables, schools remain important. For
comparison, the maximum variation possibly attributable to school specific factors in OECD
countries is 14 percent--less than half the value for India (Pritchett 2004). If we were to remove
the variation due to measurement error and renormalize our decomposition to sum to one, the
schools' role would appear even more significant.
5 Robustness Checks
Some caveats are in order. TIMSS uses a complex test design where children are given a subset
of items in a specific format. Our results are based on a test that includes 36 TIMSS questions,
but the test-design is clearly different. The educational testing literature has many examples of
design effects, where test scores are shown to change depending on the design of the test. By
presenting results using IRT equating methods, we are essentially ignoring this rich literature.
One robustness check used in the item response literature compares the actual responses of
children, averaged across ability groups with that predicted on the basis of item parameters.
In our particular case, these tests of "item fit" reveal the extent to which the shape of the item
response function predicted from the TIMSS item parameters corresponds to the actual responses
of examinees. Figure A1 shows the predicted and actual responses for all 36 items.
For the majority of items, both the 3PL model and the item parameters closely predicted how
children would perform. In a few instances, however, the fit could be improved. As an example,
item 33 is a poorly-fitted item where high ability Indian children seem to struggle more than
their international peers. While these few items are unlikely to introduce significant bias, future
researchers should carefully select items during the pilot phase to minimize deviations from the
expected response patterns.
Further, a factor model of item responses generated the first eigenvalue (3.9) 9 times greater
than the second (0.4), easily satisfying Drasgow & Lissak's (1983) rule-of-thumb for assessing the
unidimensionality assumption. Nevertheless, we could not conduct formal tests of Differential
Item Functioning (DIF) given that we do not have access to item-by-item responses for other
TIMSS examinations (and these are typically not available in the public domain). Mullis &
18
Martin (2000), however, conduct the required analysis for the TIMSS 1999 sample and there is
little reason to suspect the results would not extend to India.
The methods and results discussed here should not be taken as advocacy for dispensing with
TIMSS altogether and using their publicly released items to place tested children on international
distributions. TIMSS provides a level of analysis and robustness checking that independent
researchers cannot easily replicate. We view the methods presented here more as a bridge between
current practices and TIMSS-like comparability rather than an alternative. Even in this case,
a larger pilot that compares TIMSS results with those obtained by the methods suggested here
would yield important information on the biases inherent in our equating methods.
6 Conclusion
The educational administration in India has often shaken off the bad news emerging from the
primary educational sector on the grounds that the Indian system is based on the rigors of
selection. A gruelling primary schooling would weed out all but the best performers, who would
then graduate onwards to secondary schools and receive a higher quality education. One response
to the poor testing results from the primary level has in fact been to point to India's position in
the global economy and the comparable performance of its top firms and professionals to their
international counterparts. In essence, if the schooling system is so poor, how is it that India
has all these top global performers?
But this misses the point. Both positions are sustained by the data. Children from these
two states clearly fail any potential Millennium Learning Goal. If results are similar for the rest
of the country, over 17 million 14 year-olds, around 80 percent of the population, are either not
enrolled or cannot pass the lowest international benchmark. But India's massive population and
wide variance in achievement also ensure that Indians are amply represented in the worldwide
cohort of top performers. One out of fourteen children who pass the advanced benchmark in the
TIMSS sample are Indian, a ratio only four other countries can match. For every ten children
in the United States who pass the advanced benchmark--and only 7 percent do--there are four
who pass it in India.
How this situation plays out over the next decade has much to do with how production
19
technologies evolve in the labor market. If Indian firms manage to adopt "Ford Model-T" tech-
nologies that require a handful of highly skilled and educated workers to match with a large
number of unskilled workers, India shining can act as a "rising tide that lifts all boats." But if
Indian firms adopt "McKinsey" technologies that require skilled workers and unskilled workers
to match among themselves (as the IT consulting firms require, but not necessarily call-centers)
it is likely that the country will be characterized by increasing inequalities; an enclave of a few
privileged and self-perpetuating rich surrounded by a majority poor.
There is some hope in the variance decompositions and associations that inequalities in the
educational system can be addressed through government policies. A consistent finding across
OECD countries is the low explanatory power of schools in explaining the variation in test scores
compared to households. This is problematic for policy, since it is easier to change behavior
among teachers and to improve schools, than it is to do the same among parents. That a
large fraction of the variation in achievement arises from differences across schools suggests that
there are school-level variables, manipulable by policy, that could result in positive impacts.
What these might be, and where to go from here, should form the basis of future research and
evaluations.
More generally, the methods proposed in this paper highlight the potential benefits of linking
scores to the worldwide achievement distribution. While such efforts cannot replace the im-
portant work undertaken by TIMSS, they represent a clear improvement over the collection of
ad-hoc exams employed by most researchers, and require little additional work. India is hardly
alone in its absence from the TIMSS rankings, and many countries could benefit from an analysis
similar to ours. Over time, through such efforts, independent researchers may help make tracking
a Millennium Learning Goal a reality.
20
A Item Response Theory
A.1 Estimating MLE Scores
Linking our test form to the TIMSS knowledge score distribution requires a underlying model
of the response process. In our case, all 36 items presented can be described by the 3PL model
given (1). Letting xig {0, 1} denote the response for individual i on item g and X be the full
data matrix, the likelihood of observing X given a vector of associated abilities, , is
N G
P(X|) = Pg(xig|i) (13)
i g
N G
= Pg(i)xig[1 - Pg(i)]1-xig, (14)
i g
where the product form arises from assuming independence across items and individuals. Unlike
most IRT models we have suppressed the notation of the item parameters to highlight the fact
that they are fixed. In many cases there may be a mix of fixed anchor items and new uncalibrated
items, but we do not face that situation here.
With fixed parameters it is relatively trivial to maximize the likelihood function associated
with each individual using Newton-Raphson or some other numerical procedure; each first order
condition is independent of the others so we do not face a curse of dimensionality. But some
difficulties remain. In particular, the 3PL model's guessing parameter makes MLEs undefined
for those scoring below the guessing rate. These flat parts of the likelihood function can make
numerical estimates unstable. Yen et al. (1991) also find that some response vectors can produce
likelihood functions with multiple modes even for tests of a reasonable length (such as 36 items).
These modes can trap derivative based maximization algorithms at local rather than global peaks.
To study these issue, we computed ML estimates using both a Newton-Raphson algorithm and
BILOG-MG. While the estimates agreed perfectly for most individuals, there appeared to be
some instability, particularly near the bottom of the distribution where our test is only weakly
informative and where students often score below the guessing rate. Given these differences we
choose to report only BILOG based ML estimates.
21
A.2 Estimating EAP Scores and Plausible Values by Markov Chain Monte
Carlo
Both EAP and plausible values are based on the posterior distribution of individuals' ability. In
Section 2 we introduced the basics of the Bayesian approach using simplified notation. To be
more precise, we now change the setup slightly and introduce notation for manifest predictors of
the score. Letting Y denote the matrix of predictors such as state, gender, age, wealth, parental
literacy and school type, we follow TIMSS and assume that covariates are linked to ability using
a simple linear model
= Y + , (15)
where i N(0,2). Given this model, we can express the joint posterior distribution for all
parameters as
P(,,|X,Y ) P(X|,,,Y )P(,,|Y ) (16)
= P(X|)P(,,|Y ) (17)
= P(X|)P(|,,Y )P(,|Y ) (18)
= P(X|)P(|,,Y )P()P() (19)
= (20)
Pj(xij|i)P(i|,,Yi)P()P()
i j
where (16) follows from Bayes Rule, (17) follows from unidimensionality, (18) follows from the
multiplication rule, (19) follows from independence of , 2 and Y , and (20) follows from the
independence across individuals and items. Our parameters of interest--the EAP and plausible
value estimates of ability--are the expected value of the posterior i or simply independent draws
from this distribution. One can therefore think of plausible values as an empirical approximation
of the posterior.
The computational problem becomes how to draw from this posterior distribution. Patz &
Junker (1999a,b) illustrate how Markov Chain Monte Carlo (MCMC) techniques, particularly
so-called Metropolis-Hastings within Gibbs, can be used to draw from the posterior distribution
even in very complicated IRT settings. The basic idea of MCMC is to simulate observations
22
from a Markov chain whose stationary distribution is the joint posterior distribution of interest.
There are many strategies for constructing a chain with this property. In the IRT context,
MH-within-Gibbs achieves the objective in a relatively straightforward manner.
The basic motivation behind "Gibbs samplers" is to reduce the simulation problem to lower
dimensional, perhaps univariate, space. In our case, we are interested in the distribution of
N + K + 1 random variables, = 1,...,N,1,...,K,|X,Y . Gibbs sampling constructs
a Markov chain Mt = (1 ,...,N ,1,...,K ,( ) by sampling from the full conditionals as
(t) (t) (t) t)
follows:
(t+1) (t) (t) (t)
· 1 p(1|2 ,...,N ,1,...,K ,( ,X,Y )
t)
(t+1) (t+1) (t) (t) (t)
· 2 p(2|1 ,3 ,...,N ,1,...,K ,( ,X,Y )
t)
· ...
It can be shown this chain converges to a stationary distribution (e.g. Casella & George
1992, Tierney 1994). In the IRT context the full conditionals simplify considerably because of
independence between individuals. That is, transition probabilities for each type of N + K + 1
parameters is given by
jP(xij|i) P(i|,,Yi)P()P()
p(i|,,Xi,Yi) = (21)
jPj(xij|i) P(i|,,Yi)P()P(2)di
i jP(xij|i) P(i|,,Yi) P()P()
p(k|,-k,,Xi,Yi) = (22)
jPj(xij|i) P(i|,,Yi)P()P(2)dk
i jP(xij|i) P(i|,,Yi) P()P()
p(|,,Xi,Yi) = (23)
j Pj(xij|i) P(i|,,Yi)P()P()d
If sampling from these full conditional distributions is easy, Gibbs sampling provides a means to
generate a sample from the posterior of each parameter.
In practice, computing the normalizing constant in the denominator of each conditional may
be difficult--e.g. a closed form solution may not exist. The MH-within-Gibbs algorithm avoids
this complication by inserting a Metropolis step when sampling from the full conditionals. Chib &
23
Greenberg (1995) provide an excellent pedagogic introduction to Metropolis-Hastings algorithms.
A representative example of the algorithm for parameter i is:
1. Simulate
~i i + i i N(0,si)
(t)
(24)
2. Accept the proposed value as follows:
(t) with probability 1 -
(t+1) i
i = (25)
~i with probability
where
p(~i|( ,( ,Xi,Yi)
t) t)
= min 1, (26)
p(i |( ,( ,Xi,Yi)
(t) t) t)
By using a symmetric proposal distribution N(0,si) the normal MH criterion does not include
the the proposal distribution. Moreover, note that by substituting (21) into (26) we are left
with an algorithm that includes only known functions since the denominator cancels. We can
therefore easily compute and simulate a Markov chain that converges to the posterior of
interest. The MH steps for the regression parameters k and are completely analogous. For a
more comprehensive description of MCMC methods applied to IRT problems see Patz & Junker
(1999a,b).
To compute the EAP and plausible values estimates we ran a chain of 4,000 observations,
discarding the first 2,000 as a burn-in period. As part of the linear model, we included private
school attendance, age, age-squared, family size, family size squared, gender, father literacy,
mother literacy, wealth category, caste, state, school facilities category, an intercept and a missing
data dummy as explanatory variables. Including these manifest predictors makes our estimates
more precise and is required for subsequent analysis using plausible values to be valid (Mislevy,
Beaton, Kaplan & Sheehan 1992). We assumed flat priors for the and parameters making
the EAP estimates analogous to empirical Bayes, although this assumption has little effect since
the data dominates the prior for these parameters. To ensure convergence, we experimented
24
with the proposal distribution variances until the acceptance rates average around 44 percent
with no significant outliers. Visually checking the chain graphs and running multiple chains
and comparing the results confirmed that the chains rapidly converged after several hundred
observations and autocorrelations were modest. Finally, we averaged the last 2,000 observations
to compute the the EAP estimate. Even with this relatively modest chain length, the Monte
Carlo error was tiny compared to the variance associated with each score. We also took five
evenly spaced draws from the posterior as plausible values.
25
References
Aggarwal, Y. (2000), Primary Education in Delhi. How much do the Children learn?, NIEPA,
New Delhi.
Bashir, S. (1994), `Achievement Performance at the Primary Level in Public and Private Schools
of Tamil Nadu', Indian Education Review 29(3-4), 126.
Bedard, K. & Ferrall, C. (2003), `Wage and test score dispersion: some international evidence',
Economics of Education Review 22(1), 3143.
Birnbaum, A. (1968), Some Latent Trait Models and Their Use in Inferring an Examinee's
Ability, in F. M. Lord & M. R. Novick, eds, `Statistical Theories of Mental Test Scores',
Addison-Wesley Publishing Company.
Blau, F. D. & Kahn, L. M. (2005), `Do Cognitive Test Scores Explain Higher US Wage Inequal-
ity?', The Review of Economics and Statistics 87(1), 184193.
Bock, R. & Aitken, M. (1981), `Marginal maximum likelihood estimation of item parameters:
An application of an EM algorithm', Psychometrika 46, 443459.
Brown, G. & Micklewright, J. (2004), `Using International Surveys of Achievement and Literacy:
A View from the Outside', UNESCO Institute for Statistics Working Paper .
Casella, G. & George, E. (1992), `Explaining the Gibbs Sampler', The American Statistician
46(3), 167174.
Chib, S. & Greenberg, E. (1995), `Understanding the Metropolis-Hastings Algorithm', The Amer-
ican Statistician 49(4), 327335.
Chinapah, V., H'ddigui, E. M., Kanjee, A., Falayajo, W., Fomba, C. O., Hamissou, O., Rafali-
manana, A. & Byomugisha, A. (2000), With Africa for Africa. Towards Quality Education for
All, Human Sciences Research Council, Pretoria, South Africa.
Debroy, B. & Bhandari, L. (2007), `Exclusive growth inclusive inequality', Center for Policy
Research Working Paper .
26
Devroye, D. & Freeman, R. (2001), `Does Inequality in Skills Explain Inequality in Earnings
Across Advanced Countries?'.
Drasgow, F. & Lissak, R. (1983), `Modified parallel analysis: A procedure for examining the
latent dimensionality of dichotomously scored item responses', Journal of Applied Psychology
68(3), 36373.
Filmer, D., Hasan, A. & Pritchett, L. (2006), `A Millennium Learning Goal: Measuring Real
Progress in Education', CGD Working Paper 97 .
Govinda, R. & Varghese, N. (1993), Quality of Primary Schooling in India: A Case Study of
Madhya Pradesh, International Institute for Educational Planning, UNESCO; NIEPA, Na-
tional Institute of Educational Planning and Administration.
Goyal, S. (2007), Learning achievements in india: A study of primary education in rajasthan,
Technical report, Human Development Unit, South Asia Region. The World Bank.
Hasan, A. (1995), Baseline Survey of Learning Achievement in Primary Grades in Bihar, AN
Sinha Institute of Social Studies, Patna.
Jalan, J. & Murgai, R. (2007), "`intergenerational mobility in education in india', Processed.
Delhi: the World Bank .
Kijima, Y. (2006), `Why did wage inequality increase? Evidence from urban India 198399',
Journal of Development Economics 81(1), 97117.
Micklewright, J. & Schnepf, S. V. (2006), `Inequality of learning in industrialized countries', IZA
Discussion Paper No. 2517 .
Mislevy, R. (1991), `Randomization-based inference about latent variables from complex sam-
ples', Psychometrika 56(2), 177196.
Mislevy, R., Beaton, A., Kaplan, B. & Sheehan, K. (1992), `Estimating Population Character-
istics from Sparse Matrix Samples of Item Responses', Journal of Educational Measurement
29(2), 133161.
27
Mislevy, R., Johnson, E. & Muraki, E. (1992), `Scaling Procedures in NAEP', Journal of Edu-
cational Statistics 17(2), 131154.
Mullis, I. V. & Martin, M. O. (2000), Item Analysis and Review, in M. O. Martin, K. D.
Gregory & S. E. Stemler, eds, `TIMSS 1999 Technical Report', International Study Center
Boston College, Chestnut Hill, Massachusetts, pp. 225234.
Nickell, S. (2004), `Poverty and Worklessness in Britain', Economic Journal 114(494), C1C25.
Patz, R. & Junker, B. (1999a), `A Straightforward Approach to Markov Chain Monte Carlo Meth-
ods for Item Response Models', Journal of Educational and Behavioral Statistics 24(2), 146.
Patz, R. & Junker, B. (1999b), `Applications and Extensions of MCMC in IRT: Multiple Item
Types, Missing Data, and Rated Responses', Journal of Educational and Behavioral Statistics
24(4), 342.
Pratham (2006), Annual Status of Education Report, Pratham, Mumbai.
Pritchett, L. (2004), `Towards a New Consensus for Addressing the Global Challenge of the Lack
of Education', Copenhagen Consensus Challenge Paper .
Shukla, S., Garg, V., Jain, V., Rajput, S. & Arora, O. (1994), Attainment of Primary School
Children in India, National Council of Educational Research and Training, New Delhi.
Strauss, J. & Burger, M. (2000), Monitoring Learning Achievement Project, Pretoria: Depart-
ment of Education.
Tierney, L. (1994), `Markov Chains for Exploring Posterior Distributions', The Annals of Statis-
tics 22(4), 17011728.
Wu, K. B., Goldschmidt, P., Boscardin, C. K. & Azam, M. (2007), Girls in india: Poverty,
location and social disparities, in M. A. Lewis & M. E. Lockheed, eds, `Exclusion, Gender
and Education: Case Studies from the Developing World.', Center For Global Development,
Washington D.C.
28
Wu, K. B., Goldschmidt, P., Boscardin, C. K. & Sankar, D. (2006), Student achievement in
mathematics and its determinants in rajasthan and orissa, in `Report on the Survey of Public
and Private Secondary and Senior Secondary Schools', The World Bank. Processed.
Yamamoto, K. & Kulick, E. (2000), Scaling Methodology and Procedures for the TIMSS Mathe-
matics and Science Scales, in M. O. Martin, K. D. Gregory & S. E. Stemler, eds, `TIMSS 1999
Technical Report', International Study Center Boston College, Chestnut Hill, Massachusetts,
pp. 237264.
Yen, W., Burket, G. & Sykes, R. (1991), `Nonunique solutions to the likelihood equation for the
three-parameter logistic model', Psychometrika 56(1), 3954.
29
TABLE 1. ITEM COMPARISON OF 2005 ASSESSMENT OF 9TH GRADE MATHEMATICS IN
RAJASTHAN & ORISSA WITH TIMSS 1999 ASSESSMENT OF 8TH GRADE MATHEMATICS
Rajasthan Orissa
Q No. Content Category Performance International
Expectation All Students All Average
Students
Data Representation,
No. 1 Analysis and Using Complex 33 32 60
Probability Procedures
No. 2 Fractions and Number
Sense Knowing 11 17 46
No. 3 Geometry Using Complex
Procedures 26 31 59
No. 4 Algebra Knowing 48 47 65
No. 5 Geometry Investigating and
Solving Problems 39 48 62
No. 6 Algebra Knowing 32 30 50
No. 7 Fractions and Number Investigating and
Sense Solving Problems 23 32 33
Data Representation,
No. 8 Analysis and Knowing 43 24 57
Probability
No. 9 Measurement Knowing 32 20 49
No. 10 Algebra Investigating and
Solving Problems 35 36 47
No. 11 Fractions and Number
Sense Knowing 30 21 50
Data Representation,
No. 12 Analysis and Using Complex 54 46 64
Probability Procedures
No. 13 Algebra Knowing 24 40 49
No. 14 Measurement Investigating and
Solving Problems 29 36 42
No. 15 Geometry Knowing 38 48 54
No. 16 Fractions and Number Using Routine
Sense Procedures 16 26 39
No. 17 Geometry Using Routine
Procedures 36 36 58
No. 18 Algebra Using Routine
Procedures 38 51 65
No. 19 Fractions and Number Using Complex
Sense Procedures 55 54 75
Data Representation,
No. 20 Analysis and Using Complex 43 39 58
Probability Procedures
No. 21 Algebra Communicating
and Reasoning 28 39 45
No. 22 Algebra Using Routine
Procedures 32 36 33
No. 23 Geometry Investigating and
Solving Problems 23 31 40
No. 24 Fractions and Number Using Routine
Sense Procedures 40 35 52
No. 25 Fractions and Number
Sense Knowing 47 49 61
No. 26 Measurement Knowing 51 49 60
No. 27 Fractions and Number Investigating and
Sense Solving Problems 32 37 44
No. 28 Measurement Investigating and
Solving Problems 19 31 22
No. 29 Algebra Knowing 59 66 71
No. 30 Geometry Using Routine
Procedures 25 23 37
No. 31 Algebra Knowing 33 43 57
No. 32 Fractions and Number Investigating and
Sense Solving Problems 34 39 45
Data Representation,
No. 33 Analysis and Using Complex 31 31 79
Probability Procedures
No. 34 Algebra Knowing 17 29 37
No. 35 Geometry Using Complex
Procedures 25 28 46
No. 36 Algebra Knowing 32 40 47
Average 34 37 52
Source: This table from Wu et al ( 2006 ) summarizes the test results from the Rajasthan and Orissa
Secondary School Survey, 2005 and TIMSS 1999.
FIGURE 1. THREE-PARAMETER LOGISTIC (3PL) ITEM RESPONSE FUNCTION
Notes: The parameters a, b and c represent the item discrimination, difficulty and pseudo guessing
parameters, respectively.
FIGURE 2. RELATIONSHIP BETWEEN PERCENT CORRECT, ABILITY, AND THE TEST
CHARACTERISTIC CURVE
1 Item 19 (Easier, lower discrimination)
.8
.6
.4
.2
0
1 Item 21 (Harder, higher discrimination)
.8
Correct .6
.4
.2
Percent 0
1
Test Characteristic Curve (Both Items)
.8
.6
.4
.2
0
0 200 400 600 800
Ability
Notes: The top two panels give the item characteristic curves for items 19 and 21. The bottom panel shows
the test characteristic curve of an exam which only presents these two items. One can read the link between
the percent correct and latent ability using the x- and y-axes (dashed lines).
FIGURE 3. PRECISION OF MLE SCORES
.004 800
45o
.003 95% Confidence Interval 600
.002 400 Achievement
Density
.001 200 Mathematics
0 0
0 200 400 600 800
Mathematics Achievement
Notes: Plot of MLE achievement on MLE achievement (slope=1) with upper and lower 95% confidence
intervals for individual scores plotted as dotted lines. The MLE score histogram (an overestimate of the true
population distribution) is plotted in gray. MLE standard errors are computed from the inverse Fisher
information matrix and thus graphed as symmetrical +/- 1.96*se. In reality, the precision is not
symmetrical; large standard errors arise because it is difficult to discriminate between low and very low
achievers, and between high and very high achievers.
FIGURE 4. DISTRIBUTION OF MLE, EAP AND PLAUSIBLE VALUE SCORES
.005
Score (MLE)
Score (EAP)
Score (PV)
.004
.003
Density
.002
.001
0
0 200 400 600 800
Mathematics Achievement
Notes: The MLE, EAP, and PV score distributions are represented by a kernel density. As discussed in the
text, the true population distribution is bounded by the MLE and EAP estimates and given by the PV
estimates. The PV kernel density was averaged over five plausible values per student. For reference, the
average international score is 487 in 2001 and 467 in 2003.
FIGURE 5. AVERAGE INTERNATIONAL MATHEMATICS ACHIEVEMENT, INCLUDING ORISSA
AND RAJASTHAN
Singapore
Korea, Rep. of
Hong Kong, SAR
Chinese Taipei
Japan
Belgium (Flemish)
Netherlands
Estonia
Canada
Hungary
Finland
Czech Republic
Slovak Republic
Russian Federation
Malaysia
Latvia
Australia
United States
Lithuania
Sweden
Scotland
Israel
England
NewSlovenia
Zealand
Italy
Armenia
Serbia
Bulgaria
Romania
Thailand
Moldova,Norway
Rep. of
Cyprus
Macedonia,Lebanon
Rep. Of
Turkey
Jordan
Iran, Islamic Rep. of
Indonesia
Tunisia Orissa
Egypt
Orissa India*
Palestinian, Nat'l notes)
India* (seeBahrain
Auth.
Morocco Rajasthan
Chile
Rajasthan
Philippines
Botswana
Saudi Arabia
SouthGhana
Africa
0 200 400 600
Mathematics Achievement
Source: TIMSS Mathematics 2001, Grade 8, Exhibit 1.1; TIMSS Mathematics 2003, Grade 8, Exhibit D.1;
authors' calculations from Rajasthan and Orissa Secondary School Survey, Grade 9, 2005. Most recent year
is used when both are available.
Notes: *India is given purely as a best guess and assumes the tested sample within Orissa and Rajasthan
represents India as a whole. We only report EAP estimates of the mean since these are most closely
analogous to the TIMSS methodology.
FIGURE 6. BOTTOM 5TH PERCENTILE OF MATHEMATICS ACHIEVEMENT, INCLUDING
ORISSA AND RAJASTHAN
Singapore
Hong Kong, SAR
Korea, Rep. of
Japan
Netherlands
Estonia
Chinese Taipei
Hungary
Belgium (Flemish)
Malaysia
Latvia
Russian Federation
Sweden
Slovenia
England
Slovak Republic
Lithuania
United States
Scotland
Australia
New Zealand
Italy
Israel
Norway
Bulgaria
Armenia
Serbia
Lebanon
Romania
Moldova, Rep. of
Cyprus
Tunisia
Iran, Islamic Rep. of
Macedonia, Rep. of
Jordan
Bahrain
Morocco
Indonesia
Chile
Egypt
Botswana
Philippines Orissa
Palestinian Nat'l Auth. Rajasthan
Orissa
Rajasthan
Saudi Arabia
Ghana
South Africa
0 100 200 300 400 500
Bottom 5% Mathematics Achievement
Source: TIMSS Mathematics 2001, Grade 8, Exhibit 1.1; TIMSS Mathematics 2003, Grade 8, Exhibit D.1;
authors' calculations from Rajasthan and Orissa Secondary School Survey, Grade 9, 2005. Most recent year
is used when both are available.
Notes: The Indian percentiles are calculated using the plausible values methodology described in the text.
FIGURE 7. 95TH - 5TH PERCENTILE SPREAD OF MATHEMATICS ACHIEVEMENT, INCLUDING
ORISSA AND RAJASTHAN
South Africa
Orissa Orissa
Rajasthan
Chinese Taipei Rajasthan
Egypt
Palestinian Nat'l Auth.
Ghana
Romania
Serbia
Indonesia
Macedonia, Rep. of
Jordan
Philippines
Bulgaria
Israel
Korea, Rep. of
Armenia
Chile
Slovak Republic
Singapore
United States
Australia
Cyprus
Moldova, Rep. of
Japan
New Zealand
Lithuania
Hungary
Saudi Arabia
England
Russian Federation
Italy
Bahrain
Scotland
Belgium (Flemish)
Iran, Islamic Rep. of
Malaysia
Latvia
Sweden
Hong Kong, SAR
Botswana
Slovenia
Norway
Estonia
Netherlands
Morocco
Lebanon
Tunisia
0 100 200 300 400
95% 5% Spread in Mathematics Achievement
Source: TIMSS Mathematics 2001, Grade 8, Exhibit 1.1; TIMSS Mathematics 2003, Grade 8, Exhibit D.1;
authors' calculations from Rajasthan and Orissa Secondary School Survey, Grade 9, 2005. Most recent year
is used when both are available.
Notes: The Indian 95%-5% spread is calculated using the plausible values methodology described in the text.
TABLE 2. TIMSS 2003 INTERNATIONAL BENCHMARKS OF MATHEMATICS ACHIEVEMENT
Advanced International Benchmark 625
Students can organize information, make generalizations, solve non-routine problems, and draw and
justify conclusions from data. They can compute percent change and apply their knowledge of numeric
and algebraic concepts and relationships to solve problems. Students can solve simultaneous linear
equations and model simple situations algebraically. They can apply their knowledge of measurement
and geometry in complex problem situations. They can interpret data from a variety of tables and
graphs, including interpolation and extrapolation.
High International Benchmark 550
Students can apply their understanding and knowledge in a wide variety of relatively complex
situations. They can order, relate, and compute with fractions and decimals to solve word problems,
operate with negative integers, and solve multi-step word problems involving proportions with whole
numbers. Students can solve simple algebraic problems including evaluating expressions, solving
simultaneous linear equations, and using a formula to determine the value of a variable. Students can
find areas and volumes of simple geometric shapes and use knowledge of geometric properties to solve
problems. They can solve probability problems and interpret data in a variety of graphs and tables.
Intermediate International Benchmark 475
Students can apply basic mathematical knowledge in straightforward situations. They can add, subtract,
or multiply to solve one-step word problems involving whole numbers and decimals. They can identify
representations of common fractions and relative sizes of fractions. They understand simple algebraic
relationships and solve linear equations with one variable. They demonstrate understanding of
properties of triangles and basic geometric concepts including symmetry and rotation. They recognize
basic notions of probability. They can read and interpret graphs, tables, maps, and scales.
Low International Benchmark 400
Students have some basic mathematical knowledge.
Source: Exhibit 2.1, TIMSS 2003.
TABLE 3. PERCENT OF ENROLLED CHILDREN PASSING INTERNATIONAL MATHEMATICS
BENCHMARKS, INCLUDING ORISSA AND RAJASTHAN
Low Intermediate High Advanced
Country International International International International
Benchmark Benchmark Benchmark Benchmark
(>400) (>475) (>550) (>625)
Singapore 99 93 77 44
Korea, Rep. of 98 90 70 35
Hong Kong, SAR 98 93 73 31
Japan 98 88 62 24
Netherlands 97 80 44 10
Estonia 97 79 39 9
Chinese Taipei 96 85 66 38
Hungary 95 75 41 11
Belgium (Flemish) 95 82 47 9
Malaysia 93 66 30 6
Latvia 93 68 29 5
Russian Federation 92 66 30 6
Sweden 91 64 24 3
Slovak Republic 90 66 31 8
Australia 90 65 29 7
United States 90 64 29 7
Lithuania 90 63 28 5
Scotland 90 63 25 4
Slovenia 90 60 21 3
New Zealand 88 59 24 5
Israel 86 60 27 6
Italy 86 56 19 3
Bulgaria 82 51 19 3
Armenia 82 54 21 2
Norway 81 44 10 0
Serbia 80 52 21 4
Romania 79 52 21 4
Cyprus 77 45 13 1
Moldova, Rep. of 77 45 13 1
Lebanon 68 27 4 0
Macedonia, Rep. Of 66 34 9 1
Jordan 60 30 8 1
Indonesia 55 24 6 1
Iran, Islamic Rep. of 55 20 3 0
Tunisia 55 15 1 0
Egypt 52 24 6 1
Bahrain 51 17 2 0
Orissa 50 27 9 1
Palestinian, Nat'l Auth. 46 19 4 0
Rajasthan 42 17 4 1
Morocco 42 10 1 0
Chile 41 15 3 0
Philippines 39 14 3 0
Botswana 32 7 1 0
Saudi Arabia 19 3 0 0
South Africa 10 6 2 0
Ghana 9 2 0 0
Source: TIMSS Mathematics 2003, Grade 8, Exhibit 2.2 and authors' calculations from Rajasthan and Orissa
Secondary School Survey, Grade 9, 2005.
Notes: Countries ranked by percent passing low benchmark. Estimates based on plausible values. All
percentiles are for enrolled and tested children only.
TABLE 4. ESTIMATED NUMBER OF 14-YEAR OLDS PASSING INTERNATIONAL
MATHEMATICS BENCHMARKS, IN THOUSANDS
Not Enrolled
or Below Low Low Intermediate High Advanced
Country InternationalInternational International International International
Benchmark Benchmark Benchmark Benchmark Benchmark
(<400) (>400) (>475) (>550) (>625)
Japan 26 1189 1068 753 291
United States 792 3316 2358 1069 258
Korea, Rep. of 81 633 581 452 226
Chinese Taipei 28 291 257 200 115
India* 17589 4634 2216 705 101
Russian
Federation 389 1080 775 352 70
Indonesia 3128 1424 622 155 26
Malaysia 155 363 258 117 23
Singapore 0 52 48 40 23
Hong Kong,
SAR 19 59 56 44 19
Netherlands 27 174 144 79 18
Australia 64 215 155 69 17
Italy 112 436 284 96 15
Egypt 967 675 312 78 13
Hungary 16 100 79 43 12
Belgium
(Flemish) 10 111 96 55 11
Romania 87 153 101 41 8
Israel 26 83 58 26 6
Serbia 52 82 53 21 4
Slovak Republic 18 50 37 17 4
Sweden 12 104 73 27 3
New Zealand 10 49 33 13 3
Bulgaria 20 52 33 12 2
Lithuania 7 38 26 12 2
Armenia 14 38 25 10 1
Jordan 70 66 33 9 1
Latvia 6 19 14 6 1
Estonia 2 13 10 5 1
Slovenia 3 17 11 4 1
Philippines 1528 478 172 37 0
Iran, Islamic
Rep. of 787 592 215 32 0
South Africa 900 59 35 12 0
Chile 194 89 33 7 0
Norway 13 49 26 6 0
Moldova, Rep. of 25 38 22 6 0
Morocco 597 103 25 2 0
Lebanon 31 34 13 2 0
Macedonia, Rep.
Of 14 16 8 2 0
Tunisia 120 70 19 1 0
Cyprus 3 9 5 1 0
Saudi Arabia 535 59 9 0 0
Ghana 521 18 4 0 0
Botswana 36 9 2 0 0
Bahrain 7 6 2 0 0
* We assume that the average passing rates for Rajasthan and Orissa approximates the percentage of
children passing each benchmark for India as a whole.
Notes: Cells contain the estimated number of 14-year olds passing each benchmark, in thousands, based
on Table 6, the net enrollment rate (WDI, 2005), and population data (U.S. Census Bureau, Population
Division, International Programs Center, 2005). We assume children who are not enrolled do not pass the
low-benchmark. Scotland and the Palestinian National Authority Territories were dropped for lack of
population data. Enrollment rate is for the most recent reported years or imputed if only the gross rate is
available.
FIGURE 8. COLLAPSING THE SCORE DISTRIBUTION
What Would Happen if Children Were...
Scores: Actual and Counterfactuals
As they are In same district
15 15
10 10
Percent Percent
5 5
0 0
In same district In same school
with same family characteristics with same family characteristics
15 15
10 10
Percent 5 Percent 5
0 0
Notes: Panel 1 shows the distribution of MLE math scores for all children. We use MLE scores since fixed
effects were not included in the construction of plausible values (see Mislevy et al, 1992). Panel 2 shows the
residual distribution controlling for a district fixed effect. Panel 3 shows the residual distribution controlling
for a district fixed effect and child age, age squared, gender, caste, mother literacy, father literacy, and
household wealth. Panel 4 shows the residual distribution including an additional school fixed effect. A
considerable portion of the distribution is Panel 4 is due to measurement error.
TABLE 5. VARIANCE DECOMPOSITION: DISTRICTS, HOUSEHOLDS, SCHOOLS, AND
CHILDREN
Percent of variance explained by: Orissa Rajasthan
Same district 9 4
And household characteristics 6 4
And same school 32 41
Remaining (idiosyncratic) 53 52
Notes: Cells contain the percentage of variance (partial R-squared) explained by (a) a district fixed effect;
(b) a district fixed effect and child age, age-squared, gender, caste, mother literacy, father literacy, and
household wealth; and (c) a school fixed effect and child age, age squared, gender, caste, mother literacy,
father literacy, and household wealth. The idiosyncratic variation includes any remaining variation, a
substantial portion of which is measurement error.
TABLE A1. AVERAGE INTERNATIONAL MATHEMATICS ACHIEVEMENT IN 1999 AND 2003,
INCLUDING ORISSA AND RAJASTHAN
Country Average Score 2003 Average Score 1999
Singapore 605 (3.6) 604 (6.3)
Korea, Rep. of 589 (2.2) 587 (2.0)
Hong Kong, SAR 586 (3.3) 582 (4.3)
Chinese Taipei 585 (4.6) 585 (4.0)
Japan 570 (2.1) 579 (1.7)
Belgium (Flemish) 537 (2.8) 558 (3.3)
Netherlands 536 (3.8) 540 (7.1)
Canada 531 (2.5)
Estonia 531 (3.0)
Hungary 529 (3.2) 532 (3.7)
Finland 520 (2.7)
Czech Republic 520 (4.2)
Malaysia 508 (4.1) 519 (4.4)
Latvia 508 (3.2) 505 (3.4)
Russian Federation 508 (3.7) 526 (5.9)
Slovak Republic 508 (3.3) 534 (4.0)
Australia 505 (4.6) 525 (4.8)
United States 504 (3.3) 502 (4.0)
Lithuania 502 (2.5) 482 (4.3)
Sweden 499 (2.6)
England 496 (4.1)
Scotland 498 (3.7)
Israel 496 (3.4) 466 (3.9)
New Zealand 494 (5.3) 491 (5.2)
Slovenia 493 (2.2) 530 (2.8)
Italy 484 (3.2) 479 (3.8)
Armenia 478 (3.0)
Serbia 477 (2.6)
Bulgaria 476 (4.3) 511 (5.8)
Romania 475 (4.8) 472 (5.8)
Thailand 467 (5.1)
Norway 461 (2.5)
Moldova, Rep. of 460 (4.0) 469 (3.9)
Cyprus 459 (1.7) 476 (1.8)
Macedonia, Rep. Of 435 (3.5) 447 (4.2)
Lebanon 433 (3.1)
Turkey 429 (4.3)
Jordan 424 (4.1) 428 (3.6)
Iran, Islamic Rep. of 411 (2.4) 422 (3.4)
Indonesia 411 (4.8) 403 (4.9)
Tunisia 410 (2.2) 448 (2.4)
Egypt 406 (3.5)
Orissa 404 (1.7)
Bahrain 401 (1.7)
India* (see notes) 392 (1.1)
Palestinian, Nat'l Auth. 390 (3.1)
Chile 387 (3.3) 392 (4.4)
Morocco 387 (2.5) 337 (2.6)
Rajasthan 382 (1.4)
Philippines 378 (5.2) 345 (6.0)
Botswana 366 (2.6)
Saudi Arabia 332 (4.6)
Ghana 276 (4.7)
South Africa 264 (5.5) 275 (6.8)
Source: TIMSS Mathematics 2001, Grade 8, Exhibit 1.1; TIMSS Mathematics 2003, Grade 8, Exhibit D.1;
authors' calculations from Rajasthan and Orissa Secondary School Survey, Grade 9, 2005.
Notes: *India is given purely as a best guess and assumes the tested sample within Orissa and Rajasthan
represents India as a whole. We only report EAP estimates of the mean since these are most closely
analogous to the TIMSS methodology. Parentheses contain standard errors of the mean.
TABLE A2. DISTRIBUTION OF INTERNATIONAL MATHEMATICS ACHIEVEMENT IN 2003,
INCLUDING ORISSA AND RAJASTHAN
Country 5th 25th 50th 75th 95th
percentile percentile percentile percentile percentile
Singapore 455(6.6) 556(6.7) 614(4.0) 662(3.5) 723(2.8)
Chinese Taipei 407(6.0) 518(7.0) 596(4.6) 657(5.0) 733(6.0)
Korea, Rep. of 439(3.1) 537(3.2) 596(2.5) 647(2.5) 715(3.0)
Hong Kong, SAR 455(11.9) 546(4.0) 593(3.3) 635(3.0) 691(4.6)
Japan 433(4.4) 519(2.0) 572(2.6) 623(2.2) 697(5.1)
Belgium (Flemish) 398(8.9) 495(3.7) 545(3.1) 588(2.8) 643(3.3)
Netherlands 417(8.4) 488(4.5) 540(5.8) 587(4.8) 644(6.8)
Estonia 416(4.8) 484(3.6) 531(4.0) 577(2.7) 645(4.0)
Hungary 398(8.1) 476(2.9) 531(3.5) 584(4.1) 656(4.2)
Latvia 386(5.2) 458(5.2) 510(2.9) 559(3.5) 625(5.4)
Russian Federation 381(5.5) 456(4.2) 509(4.5) 561(4.0) 632(7.5)
Slovak Republic 371(6.5) 453(4.7) 509(3.9) 564(4.3) 642(4.2)
Malaysia 388(3.7) 455(3.9) 507(5.5) 562(6.1) 630(5.3)
Australia 368(10.4) 450(3.9) 506(3.7) 561(5.8) 634(6.6)
United States 369(4.7) 450(2.9) 505(3.0) 560(3.5) 635(3.8)
Lithuania 370(4.5) 448(2.9) 503(2.4) 557(4.0) 628(2.5)
Scotland 368(8.5) 449(5.0) 501(4.3) 550(3.9) 615(6.0)
Sweden 378(4.0) 452(4.3) 501(2.6) 548(2.9) 614(6.3)
Israel 353(5.9) 438(4.8) 498(5.3) 555(3.5) 630(5.3)
England 373(5.3) 445(5.9) 497(5.9) 552(9.2) 627(5.6)
New Zealand 364(9.9) 441(5.2) 495(5.3) 548(7.1) 623(12.5)
Slovenia 375(9.3) 445(2.4) 492(2.0) 542(1.6) 610(3.7)
Italy 355(6.0) 432(4.0) 486(2.9) 537(3.2) 606(5.0)
Armenia 330(7.5) 423(5.1) 483(3.3) 539(3.2) 605(3.5)
Romania 321(7.8) 413(4.6) 479(4.9) 540(4.9) 619(9.0)
Serbia 326(6.2) 417(4.8) 479(4.0) 540(3.1) 618(4.8)
Bulgaria 333(7.5) 421(5.5) 478(4.6) 535(4.6) 611(6.6)
Norway 340(5.2) 414(2.2) 465(3.3) 511(1.7) 573(2.4)
Moldova, Rep. of 321(5.8) 405(7.3) 464(4.9) 518(4.4) 585(5.1)
Cyprus 321(3.8) 405(3.4) 463(1.8) 518(1.5) 586(1.6)
Macedonia, Rep. of 283(4.8) 376(5.1) 439(2.9) 497(3.4) 574(4.7)
Lebanon 324(3.4) 387(3.9) 432(3.7) 479(4.0) 545(5.8)
Jordan 279(5.3) 362(4.1) 427(4.9) 488(5.0) 567(5.2)
Indonesia 266(11.6) 350(7.9) 411(6.0) 472(4.0) 558(3.6)
Iran, Islamic Rep. of 294(4.8) 360(3.5) 408(3.0) 461(2.4) 537(6.2)
Tunisia 316(2.2) 368(2.4) 407(2.4) 450(2.6) 515(6.2)
Egypt 256(3.0) 341(6.0) 405(4.1) 471(3.7) 560(3.2)
Bahrain 277(3.2) 347(1.5) 402(1.8) 455(2.2) 525(1.4)
Orissa 233 329 401 482 577
Palestinian Nat'l Auth. 241(5.2) 326(3.2) 389(4.1) 455(4.2) 542(5.4)
Morocco 275(4.8) 340(3.0) 387(3.0) 434(3.0) 497(2.8)
Chile 258(4.5) 328(4.8) 382(3.4) 441(4.5) 531(4.9)
Rajasthan 215 312 381 449 544
Philippines 241(3.6) 316(5.6) 373(6.4) 437(6.5) 527(8.0)
Botswana 251(5.1) 316(3.0) 365(2.5) 415(2.7) 487(5.0)
Saudi Arabia 204(10.0) 279(6.6) 331(5.1) 385(4.5) 460(5.4)
Ghana 130(5.8) 213(4.3) 274(5.3) 337(7.3) 430(9.1)
South Africa 117(5.2) 191(3.5) 248(4.0) 316(7.5) 484(20.1)
Source: TIMSS Mathematics 2003, Grade 8, Exhibit D.1 and authors' calculations from Rajasthan and
Orissa Secondary School Survey, Grade 9, 2005.
Notes: Countries ranked by median score. Estimates of population percentiles computed using plausible
values. Parentheses contain standard errors.
FIGURE A1 TIMSS ITEM RESPONSE FUNCTIONS AND OBSERVED RESPONSES
Item 1 Item 2 Item 3
1 1 1
.8 .8 .8
Correct Correct .6 Correct .6
.6
.4 .4
.4
.2 .2
Probability Probability Probability
.2
0 0
0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
Achievement (theta) Achievement (theta) Achievement (theta)
Item 4 Item 5 Item 6
1 1 1
.8 .8
.8
Correct Correct .6 Correct
.6 .6
.4
.4
.4
.2
Probability Probability Probability
.2
0 .2
0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
Achievement (theta) Achievement (theta) Achievement (theta)
Item 7 Item 8 Item 9
1 1
.8
.8 .8
.6
Correct .6 Correct Correct .6
.4
.4 .4
.2 .2 .2
Probability Probability Probability
0 0 0
0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
Achievement (theta) Achievement (theta) Achievement (theta)
Item 10 Item 11 Item 12
1 1 1
.8 .8 .8
Correct .6 Correct Correct
.6 .6
.4
.4
.4
.2
Probability Probability Probability
.2
0 .2
0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
Achievement (theta) Achievement (theta) Achievement (theta)
Item 13 Item 14 Item 15
1 1 1
.8 .8 .8
Correct .6 Correct Correct
.6 .6
.4
.4 .4
.2
Probability Probability Probability
.2
0 .2
0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
Achievement (theta) Achievement (theta) Achievement (theta)
Item 16 Item 17 Item 18
1 1 1
.8 .8 .8
Correct .6 Correct Correct
.6 .6
.4
.4 .4
.2
Probability Probability Probability
.2 .2
0
0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
Achievement (theta) Achievement (theta) Achievement (theta)
Item 19 Item 20 Item 21
1 1 1
.8 .8 .8
Correct .6 Correct Correct .6
.6
.4 .4
.4
.2 .2
Probability Probability Probability
.2
0 0
0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
Achievement (theta) Achievement (theta) Achievement (theta)
Item 22 Item 23 Item 24
1 1 1
.8 .8 .8
Correct .6 Correct .6 Correct
.6
.4 .4
.4
.2 .2
Probability Probability Probability
.2
0 0
0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
Achievement (theta) Achievement (theta) Achievement (theta)
Item 25 Item 26 Item 27
1 1 1
.8
.8 .8
Correct Correct .6 Correct
.6 .6
.4
.4 .4
.2
Probability Probability Probability
.2 0 .2
0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
Achievement (theta) Achievement (theta) Achievement (theta)
Item 28 Item 29 Item 30
1 1 1
.8
.8 .8
Correct .6 Correct Correct
.6 .6
.4
.4
.4
.2
Probability Probability Probability
.2
0 .2
0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
Achievement (theta) Achievement (theta) Achievement (theta)
Item 31 Item 32 Item 33
1 1 1
.8 .8 .8
Correct Correct Correct .6
.6 .6
.4
.4 .4
.2
Probability Probability Probability
.2 .2
0
0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
Achievement (theta) Achievement (theta) Achievement (theta)
Item 34 Item 35 Item 36
1 1 1
.8 .8 .8
Correct .6 Correct .6 Correct
.6
.4 .4
.4
.2 .2
Probability Probability Probability
.2
0 0
0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
Achievement (theta) Achievement (theta) Achievement (theta)
Notes: Observed responses (dots) are means of 25 achievement bins. Expected responses (lines) use fixed
TIMSS item parameters.