The theory group at Stanford invites applications for the Motwani postdoctoral fellowship in theoretical computer science. Information and application instructions below. Applications will be accepted until the positions are filled, but review of applicants will begin after Dec 15.

Website: https://academicjobsonline.org/ajo/jobs/17685

Email: theory.stanford@gmail.com

The Simons Collaboration on the Theory of Algorithmic Fairness seek highly qualified candidates (within five years of the award of their PhD) for a postdoctoral research position. Appointments will begin Summer or Fall 2021.

This multi-year program will host several postdoctoral researchers working on modeling and theoretical work understanding (a) the sources of discriminatory behavior of algorithms, and (b) how best to mitigate such impacts.

Descriptions of the scientific agendas of this research collaboration can be found at https://toc4fairnesses.org/.

Fellows will be able to collaborate broadly, including with researchers at partner institutions: Stanford University, Toyota Institute of Technology at Chicago, Massachusetts Institute of Technology, Harvard University, UC Berkeley, Cornell University, Hebrew University of Jerusalem, Weizmann Institute of Science, University of Toronto, University of Washington, and University of Pennsylvania.

The anticipated term for a fellowship is one or two years – to be decided at the time of appointment, with the possibility of extension based on mutual agreement. In addition to competitive salary and benefits, the fellowship also includes funding for independent travel to workshops, conferences and other universities and research labs.

In order to apply, please email a CV, research statement, and have two reference letters sent to jamiemmt@cs.washington.edu. Applications and reference letters are due Dec. 31, 2020, though we will consider applications which arrive after that date. Decisions will be made in February.

Jamie Morgenstern, chair of the postdoctoral search committee

Simons Collaboration on the Foundations of Fairness in Machine Learning

- The Simons Collaboration will offer additional postdoc opportunities across the participating institutions. To be advertised soon.

- The perspective taken in both these projects is of TOC but the research will gain from collaborations with other fields within and outside of CS (Statistics, Economics, Philosophy, Law, Social Sciences at large). I will therefore be open to postdocs with various backgrounds (and would consider co-hosting with colleagues in other fields).

- Candidates from under-represented populations are encouraged to apply and I promise to give those applications my close attention.

https://sigact.org/tcswomen/. ]]>

Crucial decisions are increasingly being made by automated machine learning algorithms. These algorithms rely on data, and without high quality data, the resulting decisions may be inaccurate and/or unfair. In some cases, data is readily available: for example, location data passively collected by smartphones. In other cases, data may be difficult to obtain by automated means, and it is necessary to directly survey the population.

However, individuals are not always motivated to take surveys if they receive no benefit. Offering a monetary reward may incentivize some individuals to participate, but there is a problem with this approach: what if an individual’s data is correlated with their willingness to take the survey?

For concreteness, imagine that you are a health administrator trying to estimate the average weight in a population. This is a sensitive attribute that individuals may be reluctant to disclose, especially if their weight is not considered healthy. A generic survey may yield disproportionately more respondents with “healthy” weights, and thus may result an an inaccurate estimate (see, e.g, Shields et al., 2011).

In this post, we discuss three papers which propose solutions to this problem through the lens of *mechanism design*. The idea is to carefully design payments so that we received an unbiased sample, leading to a hopefully accurate estimate.

We use to denote agent ‘s data (e.g., her weight). We assume that each agent also has a personal cost , representing her level of reluctance to reveal her data. Agent is willing to reveal if and only if she receives a payment of at least . Our goal is to allocate higher payments to agents with higher ‘s, in order to get an unbiased sample. However, we also must obey an budget constraint: we cannot spend more than total. The solution is to transact non-deterministically: with some probability, offer to purchase an agent’s data. Agents with higher costs will receive higher payments, but lower transaction probabilities.

We assume that agents are drawn at random independently from some distribution. Our crucial assumption is that we known the marginal distribution of agent costs, which we denote (we will explore later what happens when this assumption is removed). However, we do not know the distribution of , and that distribution can be arbitrarily correlated with . As mentioned above, one might expect agents with less “desirable” ‘s may have higher costs, but one can imagine more complex correlations as well.

Our mechanisms consist of two parts: an *allocation rule* , and a *payment rule* . Given and , the mechanism works as follows:

- Ask each agent to report . Let denote the actual reported cost.
- With probability , we purchase the agent’s data and pay her . With probability , we do not buy the data, and no payment is made.
- At the end, use the data we learned to form an estimate of the population average of . Let denote our estimate.

In this model, agent ‘s expected utility for reporting is .

We have four main requirements:

**Truthfulness.**It should be in each agent’s best interest to truthfully report .**Individual rationality.**Agents should not receive negative utility if they are honest, i.e., we should have for all .**Budget constrained.**Our total expected payment should not exceed , i.e., .**Unbiased.**Our estimate isn’t consistently too high or too low. Specifically, the expected value of our estimate should be equal to the true average, i.e., .

Lack of bias doesn’t mean that our estimate is accurate, however. To this end, our primary goal is to **minimize the variance**, subject to the mechanism obeying the four above criteria. We evaluate variance via a worst-case framework: given a mechanism, we wish minimize the variance with respect to the worst-case distribution of agents for that mechanism. The idea is that the distribution of ‘s is not known to the mechanism, so we require it to perform well for all distributions.

When we refer to the “optimal” mechanism, we mean minimum variance, subject to being truthful, individually rational, budget constrained, and unbiased (henceforth TIBU).

**The Horvitz Thomspon Estimator**

Once we have learned the ‘s, how do we actual form an estimate of the mean? Luckily for us, this question has a simple answer. If we restrict ourselves to linear unbiased estimators, there is a unique way to do this, known as the *Horvitz-Thompson estimator*:

Thus our task is simply to choose and .

This model was first considered by Roth and Schoenebeck (2012). They are able to characterize a mechanism which is TIBU and has variance at most more than the optimal variance, where is the number of agents. However, they do make the strong assumption that is either 0 or 1.

Their approach relies on *Take-It-Or-Leave-It* mechanisms. Such a mechanism is defined by a distribution $G$ over the positive real numbers, and works as follows:

- Each agent reports a cost .
- Sample a payment from .
- If , buy the agent’s data with payment . If , do not buy the agent’s data.

This amounts to an allocation rule , and a payment rule equal to the distribution conditioned on being at least . The authors show that these mechanisms are fully general, i.e., any allocation and payment rule can be implemented by a Take-It-Or-Leave-It mechanism.

The proof of their main result is primarily based on using the calculus of variations to optimize over the space of distributions . The paper contains some additional results, for example regarding an alternate model where we wish to minimize the budget, but that is outside the scope of this blog post.

Although the above result is a great step, it leaves room for improvement. First of all, the assumption that is binary is quite strong, and does not apply to our running example of body weight. Secondly, their mechanism does not quite achieve the optimal variance. Chen et al. (2018) remedy both of these concerns. That is, they allow to be any real number, and they characterize the TIBU mechanism with optimal variance. Their result also generalizes to more complex statistical estimates, not just the average , and it holds for both continuous and discrete agent distributions.

The approach of Chen et al. (2018) is based on two primary ideas. First, they show that any monotone allocation rule (i.e., we are always less likely to purchase data from an agent with higher cost) can be implemented in a TIBU fashion by a unique payment rule. Thus we only need to identify the optimal allocation rule. (This is similar to the standard result from auction theory about implementable monotone allocation rules (Myerson 1981).)

The second idea is to view the problem as a zero-sum game between ourselves (the mechanism designer) and an adversary who chooses the distribution of agents. Given a distribution, we choose an allocation rule to minimize the variance, and given an allocation rule, the adversary chooses a distribution to maximize the variance.

The authors are able to solve for the equilibrium of this game and thus identify the TIBU mechanism with minimum possible variance.

Approach #2 gave us our desired result: a minimum variance mechanism subject to our four desired properties (TIBU), for any distribution of ‘s. But we are still making a very strong assumption: that we know the distribution of agent costs.

Chen and Zheng (2019) do away with this assumption in a follow-up paper. They consider a model where the mechanism has no prior information on the distribution of costs (or on the distribution of data), and $n$ agents arrive one-by-one in a uniformly random order. Each agent reports a cost, and we decide whether to buy her data, and what to pay her. In order to price well, we need to learn the cost distribution, but we must do this while simultaneously making irrevocable purchasing decisions. The main result is a TIBU mechanism with variance at most a constant factor worse than optimal.

The authors note that after each step , the reported costs up to that point induce an empirical cost distribution . Using the results of Chen et al. (2018), we can determine the optimal mechanism for . The basic idea is to use that mechanism for the current step, learn a new agent cost (note that the agent reports regardless of whether we purchase her data) and then update our empirical distribution accordingly. (The authors actually end up using an approximately optimal allocation rule, but the idea is the same.) The mechanism also uses more budget in the earlier rounds, to make up for the pricing being less accurate.

In this post, we considered the problem of surveying a sensitive attribute where an agent’s data may be correlated with their willingness to participate. We discussed three different approaches, all of which rely of giving higher payments to agents with higher costs, in order to incentivize them to participate and to obtain an unbiased estimate. The final approach was able to give a truthful, individually rational, budget feasible, and unbiased mechanism with approximately optimal variance, without making any prior assumptions on the distribution of agents.

However, all three of the approaches assume that agents cannot lie about the data. This is reasonable for some attributes, such as a body weight, where an agent can be asked to step onto a physical scale. However, requiring participants come in person to a particular location will certainly lead to less engagement. Furthermore, for other sensitive attributes, there may not be a verifiable way to obtain the data. Future work could investigate alternative models where this assumption is not necessary. For example, perhaps agents do not maliciously lie, but rather are simply inaccurate at reporting their own attributes. For example, research has demonstrated that people consistently over-report height and under-report weight (e.g., Gorber et al., 2007). Could a mechanism learn the pattern of inaccuracy and compensate for that to still obtain an unbiased estimate?

- Yiling Chen, Nicole Immorlica, Brendan Lucier, Vasilis Syrgkanis, and Juba Ziani. “Optimal data acquisition for statistical estimation.” In
*Proceedings of the 2018 ACM Conference on Economics and Computation*. 2018. - Yiling Chen and Shuran Zheng. “Prior-free data acquisition for accurate statistical estimation.”
*Proceedings of the 2019 ACM Conference on Economics and Computation*. 2019. - Sarah Connor Gorber, Mark S. Tremplay, David Moher, and B. Gorber (2007). A comparison of direct vs. self‐report measures for assessing height, weight and body mass index: a systematic review.
*Obesity reviews*,*8*(4), 307-326. - Roger Myerson. Optimal auction design. Mathematics of Operations Research, 6(1):58–73. 1981.
- Aaron Roth and Grant Schoenebeck. “Conducting truthful surveys, cheaply.”
*Proceedings of the 2012 ACM Conference on Electronic Commerce*. 2012. - Margot Shields, Sarah Connor Gorber, Ian Janssen, and Mark S. Tremblay. (2011). Bias in self-reported estimates of obesity in Canadian health surveys: an update on correction equations for adults.
*Health Reports*,*22*(3), 35.

]]>

———–

Dear colleagues,

We invite you to nominate speakers for TCS Women Rising Star talks at STOC 2019, which are planned as part of our virtual TCS Women Spotlight Workshop. To be eligible, your nominee has to be a female or a minority researcher working in theoretical computer science (all topics represented at STOC are welcome) and has to be a graduating PhD student or a postdoc. You can make your nomination by filling this form by May 28th:

STOC 2020 workshops will happen between June 23 and 25, with exact day/time TBD.

You can see the list of speakers from last year here:

Looking forward to your nominations and to seeing you at the our TCS Women Spotlight Workshop,

Barna Saha, Virginia Vassilevska Williams, and Sofya Raskhodnikova

]]>But this is also an example of the gravity of decisions by researchers and software developers. Taking it to extreme, imagine a predictor that is used to determine which patients are denied treatment in an overwhelmed hospital. The booming research area of algorithmic fairness sees a very short turnover from research ideas (in many areas) to deployment. In an ideal world, it would have been much better to first have a couple of decades to develop the computational foundations of algorithmic fairness, before the practical need arose. But in the real world, the huge scale of algorithmic decision making creates immense demand for solutions. Industry, as well as policy and law makers are unlikely to wait decades or even years, nor is it clear that they should. From my perspective, this reality underscores the urgency for *principled* and *deliberate* research – rather than *hasty* research – continuously developing the foundations of algorithmic fairness and offering answers to real-world challenges.

There are plenty of resources about research talks, and mostly they emphasize form over matter. How many words in a slide? How many slides in a talk? how to and how not to use font colors? How to and how not to use animation? and so on. While all of these are important, I find that the failing of many research talks is on a much more basic level.

Think back to a research talk you heard recently, or to one you heard a few months ago. You may remember how you felt and what you thought of the talk but what do you remember of this talk in terms of content? Most of us will find that we don’t remember much, I rarely do. Yet in our presentations, we often follow a research-paper-like mold and squeeze in many little details that are somehow important to us, forgetting that they will all vanish in our audience’s memory soon after (or completely missed in the first place). Giving a talk (writing a paper, writing a blog post etc.) is about communication: who is your audience? what are the limitation of the medium? what is the message you want to convey? Since so little stays with the audience long term, it makes sense to make sure that this little will be what seems most important for you to convey.

The idea I am promoting here is not new, and there are various techniques towards this goal. One (which I think Oded Goldreich shared with me), is to think of audience’s attention as a limited currency. Whenever you share a big idea you spend a big token and other ideas cost a smaller token. Imagine you have one or two big tokens and a few smaller tokens. Another approach, emphasizes the notion of **a premise**. The idea promoted here is that a talk needs a premise and this should be the title of the talk. Furthermore, every slide needs a premise and it should be the title of the slide. A premise is a main idea and is a complete sentence. It is not unusual to find a slide titled “Analysis” or “Efficiency” but neither of these is a premise. “Problem X has an efficient algorithm” could be. The talk’s premise could help you distill what you want the audience to take out of your talk. It also helps shape the talk, as everything that doesn’t serve the premise shouldn’t be there. Note that each paper can provoke many different premises and thus many different talks.

Here I want to play with a different idea, that I find intriguing, even if it may seem a bit extreme. It will not be controversial that a good talk (and paper) tells a story. After all, humans understand and remember narratives. But could we take inspiration from the form of storytelling in fiction writing? A vast literature, classifies different kinds of stories and explores their templates (see for example this short discussion). Can we find analogues to these types in scientific research talks?

The type of story that is easiest to relate to is the **Quest/Hero’s Journey** (think Lord of the Rings). These have several distinct ingredients: a call to adventure, tests, allies, enemies, ordeal, reward, victorious return. Some research talks that follow this template do it well and preserve a sense of suspense and excitement, others seem like a long list of problems and the tricks that the work uses to handle them.

I believe that many other story templates can find analogues is research talks as well. Here are my initial attempts:

**Coming of age**stories – this area of research previously only had naive ideas but this works brings significant depth.**The Underdog**(think David and Goliath): a modest technique that concurred a great challenge.**Rags to Riches**(think the Ugly Duckling): an area or technique that were not successful prove powerful.- Similarly:
**Rebirth**(reinvention, renewal).

- Similarly:
**Comedy**(or the Clarity Tale) – conceptual works shedding a new perspective.**Tragedy**(or the Cautionary Tale) – Some impossibility results come to mind (couldn’t we view Arrow’s impossibility theorem as being tragic?)**Redemption stories**: the field so far has missed the point, was misleading or harmful, but this work makes amends.

Can you suggest papers and a story type that could fit them?

]]>