Crucial decisions are increasingly being made by automated machine learning algorithms. These algorithms rely on data, and without high quality data, the resulting decisions may be inaccurate and/or unfair. In some cases, data is readily available: for example, location data passively collected by smartphones. In other cases, data may be difficult to obtain by automated means, and it is necessary to directly survey the population.

However, individuals are not always motivated to take surveys if they receive no benefit. Offering a monetary reward may incentivize some individuals to participate, but there is a problem with this approach: what if an individual’s data is correlated with their willingness to take the survey?

For concreteness, imagine that you are a health administrator trying to estimate the average weight in a population. This is a sensitive attribute that individuals may be reluctant to disclose, especially if their weight is not considered healthy. A generic survey may yield disproportionately more respondents with “healthy” weights, and thus may result an an inaccurate estimate (see, e.g, Shields et al., 2011).

In this post, we discuss three papers which propose solutions to this problem through the lens of mechanism design. The idea is to carefully design payments so that we received an unbiased sample, leading to a hopefully accurate estimate.

## Model

We use $z_i$ to denote agent $i$‘s data (e.g., her weight). We assume that each agent also has a personal cost $c_i$, representing her level of reluctance to reveal her data. Agent $i$ is willing to reveal $z_i$ if and only if she receives a payment of at least $c_i$. Our goal is to allocate higher payments to agents with higher $c_i$‘s, in order to get an unbiased sample. However, we also must obey an budget constraint: we cannot spend more than $B$ total. The solution is to transact non-deterministically: with some probability, offer to purchase an agent’s data. Agents with higher costs will receive higher payments, but lower transaction probabilities.

We assume that agents are drawn at random independently from some distribution. Our crucial assumption is that we known the marginal distribution of agent costs, which we denote $\mathcal{F}$ (we will explore later what happens when this assumption is removed). However, we do not know the distribution of $z_i$, and that distribution can be arbitrarily correlated with $\mathcal{F}$. As mentioned above, one might expect agents with less “desirable” $z_i$‘s may have higher costs, but one can imagine more complex correlations as well.

Our mechanisms consist of two parts: an allocation rule $A$, and a payment rule $P$. Given $A$ and $P$, the mechanism works as follows:

1. Ask each agent $i$ to report $c_i$. Let $c$ denote the actual reported cost.
2. With probability $A(c)$, we purchase the agent’s data and pay her $P(c)$. With probability $1 - A(c)$, we do not buy the data, and no payment is made.
3. At the end, use the data we learned to form an estimate of the population average of $z_i$. Let $\bar{z}$ denote our estimate.

In this model, agent $i$‘s expected utility for reporting $c$ is $u_i(c) = A(c) (P(c) - c_i)$.

We have four main requirements:

1. Truthfulness. It should be in each agent’s best interest to truthfully report $c_i$.
2. Individual rationality. Agents should not receive negative utility if they are honest, i.e., we should have $P(c_i) \ge c_i$ for all $i$.
3. Budget constrained. Our total expected payment should not exceed $B$, i.e., $\mathbb{E}[\sum_i A(c_i) P(c_i)] \le B$.
4. Unbiased. Our estimate isn’t consistently too high or too low. Specifically, the expected value of our estimate should be equal to the true average, i.e., $\mathbb{E}[\bar{z}] = \mathbb{E}[z_i]$.

Lack of bias doesn’t mean that our estimate is accurate, however. To this end, our primary goal is to minimize the variance, subject to the mechanism obeying the four above criteria. We evaluate variance via a worst-case framework: given a mechanism, we wish minimize the variance with respect to the worst-case distribution of agents for that mechanism. The idea is that the distribution of $z_i$‘s is not known to the mechanism, so we require it to perform well for all distributions.

When we refer to the “optimal” mechanism, we mean minimum variance, subject to being truthful, individually rational, budget constrained, and unbiased (henceforth TIBU).

The Horvitz Thomspon Estimator

Once we have learned the $z_i$‘s, how do we actual form an estimate of the mean? Luckily for us, this question has a simple answer. If we restrict ourselves to linear unbiased estimators, there is a unique way to do this, known as the Horvitz-Thompson estimator: $\bar{z} = \displaystyle\sum\limits_i \cfrac{z_i}{A(c_i)}$

Thus our task is simply to choose $A$ and $P$.

## Approach #1

This model was first considered by Roth and Schoenebeck (2012). They are able to characterize a mechanism which is TIBU and has variance at most $1/n$ more than the optimal variance, where $n$ is the number of agents. However, they do make the strong assumption that $z_i$ is either 0 or 1.

Their approach relies on Take-It-Or-Leave-It mechanisms. Such a mechanism is defined by a distribution $G$ over the positive real numbers, and works as follows:

1. Each agent $i$ reports a cost $c$.
2. Sample a payment $p$ from $G$.
3. If $p \ge c$, buy the agent’s data with payment $p$. If $p < c$, do not buy the agent’s data.

This amounts to an allocation rule $A(c) = 1 - \text{Pr}[p\ge c]$, and a payment rule $P(c)$ equal to the distribution $G$ conditioned on being at least $c$. The authors show that these mechanisms are fully general, i.e., any allocation and payment rule can be implemented by a Take-It-Or-Leave-It mechanism.

The proof of their main result is primarily based on using the calculus of variations to optimize over the space of distributions $G$. The paper contains some additional results, for example regarding an alternate model where we wish to minimize the budget, but that is outside the scope of this blog post.

## Approach #2

Although the above result is a great step, it leaves room for improvement. First of all, the assumption that $z_i$ is binary is quite strong, and does not apply to our running example of body weight. Secondly, their mechanism does not quite achieve the optimal variance. Chen et al. (2018) remedy both of these concerns. That is, they allow $z_i$ to be any real number, and they characterize the TIBU mechanism with optimal variance. Their result also generalizes to more complex statistical estimates, not just the average $z_i$, and it holds for both continuous and discrete agent distributions.

The approach of Chen et al. (2018) is based on two primary ideas. First, they show that any monotone allocation rule (i.e., we are always less likely to purchase data from an agent with higher cost) can be implemented in a TIBU fashion by a unique payment rule. Thus we only need to identify the optimal allocation rule. (This is similar to the standard result from auction theory about implementable monotone allocation rules (Myerson 1981).)

The second idea is to view the problem as a zero-sum game between ourselves (the mechanism designer) and an adversary who chooses the distribution of agents. Given a distribution, we choose an allocation rule to minimize the variance, and given an allocation rule, the adversary chooses a distribution to maximize the variance.

The authors are able to solve for the equilibrium of this game and thus identify the TIBU mechanism with minimum possible variance.

## Approach #3

Approach #2 gave us our desired result: a minimum variance mechanism subject to our four desired properties (TIBU), for any distribution of $z_i$‘s. But we are still making a very strong assumption: that we know the distribution of agent costs.

Chen and Zheng (2019) do away with this assumption in a follow-up paper. They consider a model where the mechanism has no prior information on the distribution of costs (or on the distribution of data), and $n$ agents arrive one-by-one in a uniformly random order. Each agent reports a cost, and we decide whether to buy her data, and what to pay her. In order to price well, we need to learn the cost distribution, but we must do this while simultaneously making irrevocable purchasing decisions. The main result is a TIBU mechanism with variance at most a constant factor worse than optimal.

The authors note that after each step $i$, the reported costs up to that point induce an empirical cost distribution $\mathcal{F}_i$. Using the results of Chen et al. (2018), we can determine the optimal mechanism for $\mathcal{F}_i$. The basic idea is to use that mechanism for the current step, learn a new agent cost $c_i$ (note that the agent reports $c_i$ regardless of whether we purchase her data) and then update our empirical distribution accordingly. (The authors actually end up using an approximately optimal allocation rule, but the idea is the same.) The mechanism also uses more budget in the earlier rounds, to make up for the pricing being less accurate.

## Discussion

In this post, we considered the problem of surveying a sensitive attribute where an agent’s data may be correlated with their willingness to participate. We discussed three different approaches, all of which rely of giving higher payments to agents with higher costs, in order to incentivize them to participate and to obtain an unbiased estimate. The final approach was able to give a truthful, individually rational, budget feasible, and unbiased mechanism with approximately optimal variance, without making any prior assumptions on the distribution of agents.

However, all three of the approaches assume that agents cannot lie about the data. This is reasonable for some attributes, such as a body weight, where an agent can be asked to step onto a physical scale. However, requiring participants come in person to a particular location will certainly lead to less engagement. Furthermore, for other sensitive attributes, there may not be a verifiable way to obtain the data. Future work could investigate alternative models where this assumption is not necessary. For example, perhaps agents do not maliciously lie, but rather are simply inaccurate at reporting their own attributes. For example, research has demonstrated that people consistently over-report height and under-report weight (e.g., Gorber et al., 2007). Could a mechanism learn the pattern of inaccuracy and compensate for that to still obtain an unbiased estimate?

## References

1. Yiling Chen, Nicole Immorlica, Brendan Lucier, Vasilis Syrgkanis, and Juba Ziani. “Optimal data acquisition for statistical estimation.” In Proceedings of the 2018 ACM Conference on Economics and Computation. 2018.
2. Yiling Chen and Shuran Zheng. “Prior-free data acquisition for accurate statistical estimation.” Proceedings of the 2019 ACM Conference on Economics and Computation. 2019.
3. Sarah Connor Gorber, Mark S. Tremplay, David Moher, and B. Gorber (2007). A comparison of direct vs. self‐report measures for assessing height, weight and body mass index: a systematic review. Obesity reviews8(4), 307-326.
4. Roger Myerson. Optimal auction design. Mathematics of Operations Research, 6(1):58–73. 1981.
5. Aaron Roth and Grant Schoenebeck. “Conducting truthful surveys, cheaply.” Proceedings of the 2012 ACM Conference on Electronic Commerce. 2012.
6. Margot Shields, Sarah Connor Gorber, Ian Janssen, and Mark S. Tremblay. (2011). Bias in self-reported estimates of obesity in Canadian health surveys: an update on correction equations for adults. Health Reports22(3), 35.