Why we stopped trusting satisfaction scores

Beneficiary satisfaction scores are everywhere in development monitoring. The standard module asks respondents to rate a service from one to five, or strongly-disagree to strongly-agree, on dimensions like courtesy of staff, accessibility, quality of service, overall satisfaction. The data are collected, aggregated, and reported in dashboards. Eighty-two per cent of beneficiaries are very satisfied. The figure becomes the headline.

The figure is also, in most contexts, almost meaningless. Satisfaction scores in development settings are systematically inflated, and the inflation has a structure that conventional reporting does not surface.

Three mechanisms drive the inflation. The first is courtesy bias: the well-documented tendency for respondents to answer in the way they think the enumerator wants to hear, particularly in cultures where direct criticism of an outsider's work is considered impolite. Courtesy bias is large in absolute magnitude and consistent across studies that have measured it. The second is withdrawal fear: respondents worry, often correctly, that low scores will be read as evidence that the programme is not working and could lead to its termination. The relationship between honest feedback and continued service is rarely as transparent to beneficiaries as it is to evaluators. The third is reciprocity: programmes that have provided tangible benefits create a sense of obligation that makes negative reporting feel ungrateful. None of these is rare. Together they pull the response distribution sharply toward the favourable end.

The empirical signature of inflated satisfaction scores is recognisable. Ceiling effects are extreme: 75% to 90% in the top category. Variance is low; respondents pile into the same answer regardless of the service variation the survey was designed to detect. Discriminant validity collapses; satisfaction scores correlate with each other across dimensions but do not vary meaningfully with externally observable service quality. A measure that does not vary across visibly different facilities is not measuring service quality. It is measuring the survey relationship.

What is the work-around? Three approaches, none of which is a complete solution.

The first is behavioural measurement. Exit choices — whether beneficiaries return for follow-up appointments, whether they refer family members, whether they choose this provider over alternatives when alternatives exist — carry information that satisfaction scales do not. Complaint patterns, when complaint mechanisms exist and are accessible, are also informative; the rate of formal complaints is almost always lower than the rate of dissatisfaction, but variation in complaint rates across facilities tracks variation in service quality more closely than satisfaction scores do. Repeat-usage data from administrative records is the cheapest and most reliable behavioural measure when it is available.

The second is comparative phrasing. The standard satisfaction question puts the respondent in a one-on-one relationship with the service: how satisfied are you? Comparative phrasings shift the frame: "Would you recommend this service to your sister?" produces lower endorsement rates than direct satisfaction questions, and the decline is informative rather than artefactual. The respondent has to weigh the benefit and harm to a specific other person, which is a more demanding cognitive task than rating one's own experience and is less susceptible to courtesy bias. "What would you change about this service if you could change one thing?" is a related move; it solicits criticism while preserving the respondent's politeness norm.

The third is mode and enumerator effects. Satisfaction scores collected by phone, by a third-party enumerator unaffiliated with the programme, or through interactive voice response (IVR) are systematically lower than scores collected face-to-face by enumerators perceived to be associated with the programme. The differences are not small. A programme that wants to know whether its service is improving over time should pick one mode and hold it constant; a programme that wants to know how good its service actually is should consider running parallel data collection in multiple modes, knowing that the modal estimate from a third-party CATI vendor is closer to the truth than the face-to-face number that ends up in the report.

The political economy of satisfaction scores is worth naming. They are easy to report. The numbers are large, the trend is reliably positive, and they appeal to funders who want headline indicators of programme acceptance. They are also, increasingly, easy to dismiss. Sophisticated funders read the eighty-two-per-cent figure with the same scepticism that sophisticated readers bring to net promoter scores in commercial settings. A monitoring system that produces only satisfaction scores produces an output that is uninformative to the audience that matters most.

What we now recommend, in monitoring system design, is to budget perhaps twenty per cent of the beneficiary feedback effort on conventional satisfaction scores — they remain useful for tracking changes within a single facility over time, where the inflation is roughly constant — and the remaining eighty per cent on behavioural data, comparative items, complaint mechanisms, and structured open-ended feedback. The headline numbers come down. The information content goes up. Programmes that take this trade seriously tend to discover problems they previously did not know they had, which is exactly the function the monitoring system was meant to perform.

Useful references: Iarossi's The Power of Survey Design (World Bank) on courtesy bias and response set effects; the Feedback Labs library on closing-the-loop feedback design; and Reichheld's original HBR essay on Net Promoter Score, which despite its commercial framing makes the comparative-phrasing argument cleanly.