When admin data is the only data

For low-frequency outcomes in remote Indian districts — rural employment days, housing completion status, foodgrain offtake, school attendance — survey data is expensive, dated, and often unavailable at the geographic scale policy operates on. Administrative data is the alternative. NREGA muster rolls and the MGNREGA MIS, PMAY-G housing records, PDS dealer ledgers, UDISE+ school information, the Aadhaar-enabled Public Financial Management System. Each is large, each is granular to the district or block level, each updates more frequently than the National Sample Survey, and each is freely available through a public portal.

We use these sources continually. Most of our district-level analytical work would not be possible without them. We also keep a private list, refined over the years, of the ways in which they go quietly wrong — not because the data is bad in any simple sense, but because it was not collected for the purposes we put it to. The gap between what an administrative record is and what we treat it as is exactly where the most consequential errors live.

The first thing to keep in mind about NREGA muster rolls is that they are not a record of work done. They are a record of work as the worksite operator reported it, signed off by the panchayat, entered into the MIS by a block-level data entry operator. Each of those steps can introduce variation that has nothing to do with the underlying activity. We have seen blocks where MIS attendance figures are systematically rounded to multiples of five. We have seen districts where post-payment delay disputes resulted in attendance being entered retroactively, weeks after the work, with whatever data the operator had to hand. Treating the MIS attendance variable as a direct measure of person-days worked is treating a bureaucratic output as a behavioural input.

The second is that job card holders are a selected population. Households without job cards are not in the NREGA data at all. Whether a household has a job card is correlated with caste, literacy, awareness of the scheme, distance from the panchayat office, and political position within the village. Using NREGA participation as a measure of rural distress is sensible if you remember that the population observable through NREGA is the population that successfully completed an administrative enrolment step. The most-distressed households are not always in that population. Sometimes they are systematically absent from it.

The third is definitional drift. The MIS defines a "person-day generated" in a specific way. That definition has changed over time, has been operationalised differently in different states, and has been gamed where the official metric and the field reality have diverged. The published time series suggests comparability across years that the underlying definitions do not support. Sensitivity to which years are being compared and what the definition was in each is the basic literacy required to read NREGA trend data honestly.

The fourth is geographic coverage. Districts where the MIS is poorly maintained look like districts with low NREGA activity. They are sometimes the opposite — districts where the administrative state is too thin to record activity that is happening. This is the classic problem of administrative data: absence is ambiguous between "did not happen" and "was not recorded." For an outcome variable, ambiguous absence is a serious problem.

PMAY-G illustrates a related dynamic. Housing scheme records are often used as a proxy for poverty status or for housing deprivation. The records have their own selection pipeline: who applied, who was found eligible by the local administration, who was approved, who received the first instalment, who completed construction, who got the final geo-tagged photograph uploaded to the MIS. Each step has a population that drops out. The published count of "houses sanctioned" is consistent with widely different counts of houses actually inhabited, depending on which step the analyst is reading.

What is the practical guidance? Three things.

First, document the administrative process that produced the record. Read the operational guidelines for the scheme, talk to a block-level functionary about how the data entry actually happens, and find at least one case where the official record and the field reality diverged. The purpose is not to discredit the data; it is to know what the data is.

Second, where stakes are high, triangulate with a small survey-based ground-truth in a randomly drawn subsample. Comparing the administrative record to direct observation in twenty randomly selected villages costs perhaps fifteen days of fieldwork and is worth more than any amount of additional analysis on the original records. The match rate between the two is itself an important finding.

Third, do not headline-report the administrative variable as if it were the underlying outcome. Report it as what it is — the administrative record — and where the gap to the underlying outcome is known, quantify it. "Of the 4,217 PMAY-G houses sanctioned in this district between 2022 and 2024, our ground-truth audit of 80 randomly selected geotagged completions found 71 inhabited, 6 incomplete, 3 not located" is a different kind of sentence than "this district sanctioned 4,217 PMAY-G houses." Both are true. They support different policy conclusions.

The administrative state in India produces more data per district than any external research team will ever match. The opportunity is real. So is the responsibility to know what that data is. We have come to treat the published MIS variable as the start of the work, not the end of it.

Useful references: NREGA MIS for muster roll and works data; PMAY-G MIS for housing scheme records; UDISE+ for school-level data; and Card and DellaVigna's JEP essay on the use of administrative data in applied research is a useful methodological cross-reference.