Crucial decisions are increasingly being made by automated machine learning algorithms. These algorithms rely on data, and without high quality data, the resulting decisions may be inaccurate and/or unfair. In some cases, data is readily available: for example, location data passively collected by smartphones. In other cases, data may be difficult to obtain by automated means, and it is necessary to directly survey the population.

However, individuals are not always motivated to take surveys if they receive no benefit. Offering a monetary reward may incentivize some individuals to participate, but there is a problem with this approach: what if an individual’s data is correlated with their willingness to take the survey?

For concreteness, imagine that you are a health administrator trying to estimate the average weight in a population. This is a sensitive attribute that individuals may be reluctant to disclose, especially if their weight is not considered healthy. A generic survey may yield disproportionately more respondents with “healthy” weights, and thus may result an an inaccurate estimate (see, e.g, Shields et al., 2011).

In this post, we discuss three papers which propose solutions to this problem through the lens of *mechanism design*. The idea is to carefully design payments so that we received an unbiased sample, leading to a hopefully accurate estimate.

We use to denote agent ‘s data (e.g., her weight). We assume that each agent also has a personal cost , representing her level of reluctance to reveal her data. Agent is willing to reveal if and only if she receives a payment of at least . Our goal is to allocate higher payments to agents with higher ‘s, in order to get an unbiased sample. However, we also must obey an budget constraint: we cannot spend more than total. The solution is to transact non-deterministically: with some probability, offer to purchase an agent’s data. Agents with higher costs will receive higher payments, but lower transaction probabilities.

We assume that agents are drawn at random independently from some distribution. Our crucial assumption is that we known the marginal distribution of agent costs, which we denote (we will explore later what happens when this assumption is removed). However, we do not know the distribution of , and that distribution can be arbitrarily correlated with . As mentioned above, one might expect agents with less “desirable” ‘s may have higher costs, but one can imagine more complex correlations as well.

Our mechanisms consist of two parts: an *allocation rule* , and a *payment rule* . Given and , the mechanism works as follows:

- Ask each agent to report . Let denote the actual reported cost.
- With probability , we purchase the agent’s data and pay her . With probability , we do not buy the data, and no payment is made.
- At the end, use the data we learned to form an estimate of the population average of . Let denote our estimate.

In this model, agent ‘s expected utility for reporting is .

We have four main requirements:

**Truthfulness.**It should be in each agent’s best interest to truthfully report .**Individual rationality.**Agents should not receive negative utility if they are honest, i.e., we should have for all .**Budget constrained.**Our total expected payment should not exceed , i.e., .**Unbiased.**Our estimate isn’t consistently too high or too low. Specifically, the expected value of our estimate should be equal to the true average, i.e., .

Lack of bias doesn’t mean that our estimate is accurate, however. To this end, our primary goal is to **minimize the variance**, subject to the mechanism obeying the four above criteria. We evaluate variance via a worst-case framework: given a mechanism, we wish minimize the variance with respect to the worst-case distribution of agents for that mechanism. The idea is that the distribution of ‘s is not known to the mechanism, so we require it to perform well for all distributions.

When we refer to the “optimal” mechanism, we mean minimum variance, subject to being truthful, individually rational, budget constrained, and unbiased (henceforth TIBU).

**The Horvitz Thomspon Estimator**

Once we have learned the ‘s, how do we actual form an estimate of the mean? Luckily for us, this question has a simple answer. If we restrict ourselves to linear unbiased estimators, there is a unique way to do this, known as the *Horvitz-Thompson estimator*:

Thus our task is simply to choose and .

This model was first considered by Roth and Schoenebeck (2012). They are able to characterize a mechanism which is TIBU and has variance at most more than the optimal variance, where is the number of agents. However, they do make the strong assumption that is either 0 or 1.

Their approach relies on *Take-It-Or-Leave-It* mechanisms. Such a mechanism is defined by a distribution $G$ over the positive real numbers, and works as follows:

- Each agent reports a cost .
- Sample a payment from .
- If , buy the agent’s data with payment . If , do not buy the agent’s data.

This amounts to an allocation rule , and a payment rule equal to the distribution conditioned on being at least . The authors show that these mechanisms are fully general, i.e., any allocation and payment rule can be implemented by a Take-It-Or-Leave-It mechanism.

The proof of their main result is primarily based on using the calculus of variations to optimize over the space of distributions . The paper contains some additional results, for example regarding an alternate model where we wish to minimize the budget, but that is outside the scope of this blog post.

Although the above result is a great step, it leaves room for improvement. First of all, the assumption that is binary is quite strong, and does not apply to our running example of body weight. Secondly, their mechanism does not quite achieve the optimal variance. Chen et al. (2018) remedy both of these concerns. That is, they allow to be any real number, and they characterize the TIBU mechanism with optimal variance. Their result also generalizes to more complex statistical estimates, not just the average , and it holds for both continuous and discrete agent distributions.

The approach of Chen et al. (2018) is based on two primary ideas. First, they show that any monotone allocation rule (i.e., we are always less likely to purchase data from an agent with higher cost) can be implemented in a TIBU fashion by a unique payment rule. Thus we only need to identify the optimal allocation rule. (This is similar to the standard result from auction theory about implementable monotone allocation rules (Myerson 1981).)

The second idea is to view the problem as a zero-sum game between ourselves (the mechanism designer) and an adversary who chooses the distribution of agents. Given a distribution, we choose an allocation rule to minimize the variance, and given an allocation rule, the adversary chooses a distribution to maximize the variance.

The authors are able to solve for the equilibrium of this game and thus identify the TIBU mechanism with minimum possible variance.

Approach #2 gave us our desired result: a minimum variance mechanism subject to our four desired properties (TIBU), for any distribution of ‘s. But we are still making a very strong assumption: that we know the distribution of agent costs.

Chen and Zheng (2019) do away with this assumption in a follow-up paper. They consider a model where the mechanism has no prior information on the distribution of costs (or on the distribution of data), and $n$ agents arrive one-by-one in a uniformly random order. Each agent reports a cost, and we decide whether to buy her data, and what to pay her. In order to price well, we need to learn the cost distribution, but we must do this while simultaneously making irrevocable purchasing decisions. The main result is a TIBU mechanism with variance at most a constant factor worse than optimal.

The authors note that after each step , the reported costs up to that point induce an empirical cost distribution . Using the results of Chen et al. (2018), we can determine the optimal mechanism for . The basic idea is to use that mechanism for the current step, learn a new agent cost (note that the agent reports regardless of whether we purchase her data) and then update our empirical distribution accordingly. (The authors actually end up using an approximately optimal allocation rule, but the idea is the same.) The mechanism also uses more budget in the earlier rounds, to make up for the pricing being less accurate.

In this post, we considered the problem of surveying a sensitive attribute where an agent’s data may be correlated with their willingness to participate. We discussed three different approaches, all of which rely of giving higher payments to agents with higher costs, in order to incentivize them to participate and to obtain an unbiased estimate. The final approach was able to give a truthful, individually rational, budget feasible, and unbiased mechanism with approximately optimal variance, without making any prior assumptions on the distribution of agents.

However, all three of the approaches assume that agents cannot lie about the data. This is reasonable for some attributes, such as a body weight, where an agent can be asked to step onto a physical scale. However, requiring participants come in person to a particular location will certainly lead to less engagement. Furthermore, for other sensitive attributes, there may not be a verifiable way to obtain the data. Future work could investigate alternative models where this assumption is not necessary. For example, perhaps agents do not maliciously lie, but rather are simply inaccurate at reporting their own attributes. For example, research has demonstrated that people consistently over-report height and under-report weight (e.g., Gorber et al., 2007). Could a mechanism learn the pattern of inaccuracy and compensate for that to still obtain an unbiased estimate?

- Yiling Chen, Nicole Immorlica, Brendan Lucier, Vasilis Syrgkanis, and Juba Ziani. “Optimal data acquisition for statistical estimation.” In
*Proceedings of the 2018 ACM Conference on Economics and Computation*. 2018. - Yiling Chen and Shuran Zheng. “Prior-free data acquisition for accurate statistical estimation.”
*Proceedings of the 2019 ACM Conference on Economics and Computation*. 2019. - Sarah Connor Gorber, Mark S. Tremplay, David Moher, and B. Gorber (2007). A comparison of direct vs. self‐report measures for assessing height, weight and body mass index: a systematic review.
*Obesity reviews*,*8*(4), 307-326. - Roger Myerson. Optimal auction design. Mathematics of Operations Research, 6(1):58–73. 1981.
- Aaron Roth and Grant Schoenebeck. “Conducting truthful surveys, cheaply.”
*Proceedings of the 2012 ACM Conference on Electronic Commerce*. 2012. - Margot Shields, Sarah Connor Gorber, Ian Janssen, and Mark S. Tremblay. (2011). Bias in self-reported estimates of obesity in Canadian health surveys: an update on correction equations for adults.
*Health Reports*,*22*(3), 35.

]]>

———–

Dear colleagues,

We invite you to nominate speakers for TCS Women Rising Star talks at STOC 2019, which are planned as part of our virtual TCS Women Spotlight Workshop. To be eligible, your nominee has to be a female or a minority researcher working in theoretical computer science (all topics represented at STOC are welcome) and has to be a graduating PhD student or a postdoc. You can make your nomination by filling this form by May 28th:

STOC 2020 workshops will happen between June 23 and 25, with exact day/time TBD.

You can see the list of speakers from last year here:

Looking forward to your nominations and to seeing you at the our TCS Women Spotlight Workshop,

Barna Saha, Virginia Vassilevska Williams, and Sofya Raskhodnikova

]]>But this is also an example of the gravity of decisions by researchers and software developers. Taking it to extreme, imagine a predictor that is used to determine which patients are denied treatment in an overwhelmed hospital. The booming research area of algorithmic fairness sees a very short turnover from research ideas (in many areas) to deployment. In an ideal world, it would have been much better to first have a couple of decades to develop the computational foundations of algorithmic fairness, before the practical need arose. But in the real world, the huge scale of algorithmic decision making creates immense demand for solutions. Industry, as well as policy and law makers are unlikely to wait decades or even years, nor is it clear that they should. From my perspective, this reality underscores the urgency for *principled* and *deliberate* research – rather than *hasty* research – continuously developing the foundations of algorithmic fairness and offering answers to real-world challenges.

There are plenty of resources about research talks, and mostly they emphasize form over matter. How many words in a slide? How many slides in a talk? how to and how not to use font colors? How to and how not to use animation? and so on. While all of these are important, I find that the failing of many research talks is on a much more basic level.

Think back to a research talk you heard recently, or to one you heard a few months ago. You may remember how you felt and what you thought of the talk but what do you remember of this talk in terms of content? Most of us will find that we don’t remember much, I rarely do. Yet in our presentations, we often follow a research-paper-like mold and squeeze in many little details that are somehow important to us, forgetting that they will all vanish in our audience’s memory soon after (or completely missed in the first place). Giving a talk (writing a paper, writing a blog post etc.) is about communication: who is your audience? what are the limitation of the medium? what is the message you want to convey? Since so little stays with the audience long term, it makes sense to make sure that this little will be what seems most important for you to convey.

The idea I am promoting here is not new, and there are various techniques towards this goal. One (which I think Oded Goldreich shared with me), is to think of audience’s attention as a limited currency. Whenever you share a big idea you spend a big token and other ideas cost a smaller token. Imagine you have one or two big tokens and a few smaller tokens. Another approach, emphasizes the notion of **a premise**. The idea promoted here is that a talk needs a premise and this should be the title of the talk. Furthermore, every slide needs a premise and it should be the title of the slide. A premise is a main idea and is a complete sentence. It is not unusual to find a slide titled “Analysis” or “Efficiency” but neither of these is a premise. “Problem X has an efficient algorithm” could be. The talk’s premise could help you distill what you want the audience to take out of your talk. It also helps shape the talk, as everything that doesn’t serve the premise shouldn’t be there. Note that each paper can provoke many different premises and thus many different talks.

Here I want to play with a different idea, that I find intriguing, even if it may seem a bit extreme. It will not be controversial that a good talk (and paper) tells a story. After all, humans understand and remember narratives. But could we take inspiration from the form of storytelling in fiction writing? A vast literature, classifies different kinds of stories and explores their templates (see for example this short discussion). Can we find analogues to these types in scientific research talks?

The type of story that is easiest to relate to is the **Quest/Hero’s Journey** (think Lord of the Rings). These have several distinct ingredients: a call to adventure, tests, allies, enemies, ordeal, reward, victorious return. Some research talks that follow this template do it well and preserve a sense of suspense and excitement, others seem like a long list of problems and the tricks that the work uses to handle them.

I believe that many other story templates can find analogues is research talks as well. Here are my initial attempts:

**Coming of age**stories – this area of research previously only had naive ideas but this works brings significant depth.**The Underdog**(think David and Goliath): a modest technique that concurred a great challenge.**Rags to Riches**(think the Ugly Duckling): an area or technique that were not successful prove powerful.- Similarly:
**Rebirth**(reinvention, renewal).

- Similarly:
**Comedy**(or the Clarity Tale) – conceptual works shedding a new perspective.**Tragedy**(or the Cautionary Tale) – Some impossibility results come to mind (couldn’t we view Arrow’s impossibility theorem as being tragic?)**Redemption stories**: the field so far has missed the point, was misleading or harmful, but this work makes amends.

Can you suggest papers and a story type that could fit them?

]]>One of the great unsung perks of being a college student is having access to the university library. There is something thrilling about hunting down exactly the right reference deep in the stacks, or reading through the archived papers of a public figure from years back.

The pandemic has closed all of our libraries for the time being. Even so, through the fruits of computer science—databases, the Internet, e-readers, and so on—we can get access to much of the same information even when we are cooped up at home.

But for me, one of the true pleasures of using a library is the fact that I can browse through any book I want in complete privacy. If I want to go up to the stacks and read about tulip gardening, or road-bike maintenance, or strategies for managing anxiety, I can do that pretty much without anyone else knowing.

In contrast, if I go online today and search for “tulip gardening,” Google will take careful note of my interest in tulips and I will be seeing ads about gardening tools for months.

An ideal digital library would let us download and read books without anyone—not even the library itself—learning which books we are reading. How could we build such a privacy-respecting digital library?

In this post, we will discuss the private-library problem and how our recent work on private information retrieval might be able to help solve it.

Let us define the problem a little more precisely. We will imagine a protocol running between a library, which holds the books, and a student, who wants to download a particular book.

Say that the library has books—let’s call the books . To keep things simple, let’s pretend that each book consists of just a single bit of information, so for all .

The student starts out holding the index of her desired book. To fetch the digital book from the library, the student and library exchange some messages. At the end of the interaction, we want the following two properties to hold:

**Correctness.**The student should have her desired book (i.e., the bit ).**Privacy.**The library should have not learned any information, in a cryptographic sense, about which book the student downloaded.

Of course, we have grossly simplified the problem: a real book is more than a single bit in length, book titles are not consecutive integers, maybe the student would like to find a book using a keyword search, etc. But even this simplified private-information-retrieval problem, which Chor, Goldreich, Kushilevitz, and Sudan introduced in the 90s, is already interesting enough.

There is a simple solution to this problem: the student can just ask the library to send her the contents of all books. This solution achieves both correctness and privacy, so what’s the problem? Are we done?

Well, there are two problems:

- The amount of
**communication**is large: Just to read a single book, the client must download the contents of the entire library! So this is terribly inefficient. - The amount of
**computation**is large: Just to fetch a single book, the library must do work proportional to the size of the entire library. So “checking out” a book from this digital library will take a long time.

Research on private information retrieval typically focuses on the first problem: how can we reduce the *communication* cost? Using a variety of clever techniques, it is possible to drive down the communication cost to something very small—sub-polynomial or even logarithmic in the library size .

But today we are interested in the *computational* burden on the library. Is there any way that the student can privately download a book from the library while requiring the library to do only work in the process?

To have both correctness and privacy, it seems that the library needs to touch each of the books in the process of responding to each student’s request. And, in some sense, this is true. So, to allow the library to run in time sublinear in , we will have to tweak the problem slightly.

Our idea is to have the library do the -time computation in an **offline phase**, which takes place *before* the student decides which book she wants to read. For example, this offline phase might happen overnight while the library’s servers would otherwise be idle.

Later on, once the student decides which book in the library she wants to read, the student and library can run a -time **online phase** in which the student is able to retrieve her desired book. The total communication cost, in both offline and online phases, will be .

So, by pushing the library’s expensive linear scan to an offline phase, the library can service the student’s request for a book in sublinear online time.

Let’s see how to construct such an offline/online scheme. To make things simple for the purposes of this post, let’s assume that the student has access to two non-colluding libraries that hold the same set of books. To be concrete, let’s call the two libraries “Stanford” and “Berkeley.”

The privacy property will hold as long as the librarians at Stanford and Berkeley don’t get together and share the information that they learned while running the protocol with the student. So Stanford and Berkeley here are “non-colluding.” (Equivalently, our scheme that protects privacy against an adversary that controls one of the two libraries—but not both.)

Now, let’s describe an offline/online protocol by which the student can privately fetch a book from the digital library:

**Offline Phase.**

- The student partitions the integers into non-overlapping sets chosen at random, where each set has size . Call these sets .
- The student sends these sets to Stanford (the first library). To reduce the communication cost here, the student can compress these sets using pseudorandomness.
- For each set , the Stanford library computes the
*parity*of all of the books indexed by set and returns the parity bits to the student. In other words, if the books are , then .

The total communication in this phase is only bits and the student and the Stanford library can run this step *before* the student decides which book she wants to read.

**Online Phase.** Once the student decides that she wants to read book , the student and Berkeley (the second library) run the following steps:

- The student finds the set that contains the index of her desired book.
- The student flips a coin that is weighted to come up heads with some probability , to be fixed later.
- If the coin lands heads:
- The student sends to the Berkeley library.

- If the coin lands tails:
- The student samples .
- The student sends to the Berkeley library.

- The Berkeley library receives the set from the student. The Berkeley library returns the contents of all books whose indices appear in set to the student.
- Now, the student can recover its desired book as follows:
- If heads: the student now has the parity of the books in (from the offline phase) and the value of all books in that are not book . This is enough to recover the contents of book .
- If tails: . In this case the Berkeley library has sent the contents of book to the student in the online phase.

Even before we fix the weight of the coin, we see that the protocol satisfies **correctness**, since no matter how the coin lands the client recovers its desired book. Also, the total communication cost is bits, which is sublinear as we had hoped. Finally, the **online computation cost** is also sublinear: the Berkeley library just needs to return books to the client, which it can do in time roughly .

The last matter to address is **privacy**. Again, we are assuming that the adversary controls only one of the two libraries.

- In the offline phase, the student’s message to the Stanford library is independent of , so the protocol is perfectly private with respect to Stanford.
- In the online phase, we must be more careful. It turns out that if we choose the weight of the coin as , then the set that the student sends to UC Berkeley in the online phase is just a uniformly random size- subset of .

So, the student can privately fetch a book from our digital libraries in sublinear online time. What else is left to do?

- Getting rid of the need for two non-colluding libraries is a clear next step. Our work has some results along these lines, but they pay a price either in (a) asymptotic efficiency or in (b) the strength of the cryptographic assumptions required.
- A beautiful paper of Beimel, Ishai, and Malkin shows that if the library can store its collection of books using a special type of error-correcting encoding, the
**total**computational time at the libraries (not just the online time) can be sublinear in . As far as we know, these schemes are not concretely efficient enough to use in practice. Could they be made so? - Privacy is just one of the many pleasures of using a physical library. During this period of confinement, I also miss the smell of the books, the beauty of light filtering through the stacks, and the peacefulness of thinking in a study carrel. Can a digital library ever give us these things too?

If any of these questions catch your fancy, please check out our Eurocrypt paper for more background, pointers, and results.

Don Knuth has reportedly said “Using a great library to solve a specific problem… Now *that* […] is real living.” With better digital libraries, maybe we could all live a little bit more during these challenging days.

Unfortunately, but non-surprisingly, the conference will be virtual this year. But I’m sure that, thanks to Aaron Roth and to the inaugural PC, we will make the best what’s possible and have a great event.

]]>In this post, we’ll discuss four works about secure distributed computation. First, we’ll talk about a method of using MDS (maximum distance separable) error correcting codes to add security and privacy to general data storage (“Cross Subspace Alignment and the Asymptotic Capacity of X-Secure T-Private Information Retrieval” by Jia, Sun, Jafar).

Then we’ll discuss method of adapting a coding strategy for straggler mitigation (“Polynomial codes: an optimal design for high-dimensional coded matrix multiplication” by Yu, Qian, Maddah-Ali, Avestimehr) in matrix multiplication to instead add security or privacy (“On the capacity of secure distributed matrix multiplication” by Chang, Tandon and “Private Coded Matrix Multiplication” by Kim, Yang, Lee)

Throughout this post we will use variations on the following communication model:

The data in the grey box is only given to the master, so workers only have access to what they receive (via green arrows). Later on we will also suppose the workers have a shared library not available to the master. The workers do not communicate with each other as part of the computation, but we want to prevent them from figuring out anything about the data if they do talk to each other.

This model is related to *private computation* but not exactly the same. We assume the servers are “honest but curious”, meaning they won’t introduce malicious computations. We also only require the master to receive the final result, and don’t need to protect any data from the master. This is close to the BGW scheme ([Ben-Or, Goldwasser, Wigderson ’88]), but we do not allow workers to communicate with each other as part of the computation of the result.

We consider *unconditional* or *information-theoretic* security, meaning the data is protected even if the workers have unbounded computational power. Furthermore, we will consider having *perfect secrecy*, in which the mutual information between the information revealed to the workers and the actual messages is zero.

Before we get into matrix-matrix multiplication, consider the problem of storing information on the workers to be retrieved by the master, such that it is “protected.” What do we mean by that? [Jia, Sun, and Jafar ’19] define X-secure T-private information retrieval as follows:

Let be a data set of messages, such that each consists of random bits. A storage scheme of on nodes is

1.

X-secureif any set of up to servers cannot determine anything about any and2.

[Jia, Sun, and Jafar ’19]T-privateif given a query from the user to retrieve some data element , any set of up to users cannot determine the value of .

Letting be the set of queries sent to each node and be the information stored on each node (all vectors of length L), we depict this as:

The information theoretic requirements of this system to be correct can be summarized as follows (using notation for set ):

Property | Information Theoretic Requirement |

Data messages are size bits | |

Data messages are independent | |

Data can be determined from the stored information | |

User has no prior knowledge of server data | |

X-Security | , |

T-Privacy | |

Nodes answer only based on their data and received query | |

User can decode desired message from answers |

Given these constraints, Jia et al. give bounds on the capacity of the system. Capacity is the maximum rate achievable, where rate is defined as bits requested by the worker (, the length of a single message) divided by the number of bits downloaded by the worker. The bounds are in terms of the capacity of T-Private Information Retrieval, (which is the same as the above definition, with only requirement 2).

If then for arbitrary , .

When :

[Jia, Sun, and Jafar ’19]

Jia et al. give schemes that achieve these bounds while preserving the privacy and security constraints by introducing random noise vectors into how data is stored and queries are constructed. The general scheme for uses *cross subspace alignment*, which essentially chooses how to construct the stored information and the queries such that the added noise mostly “cancels out” when the master combines all the response from the servers. The scheme for is straightforward to explain, and demonstrates the idea of using error correcting codes that treat the random values as the message and the actual data as the “noise.”

For this scheme, the message length is set to (the number of nodes , minus the maximum number of colluding servers ). First, we generate random bit vectors of length :

Next, apply an MDS code to to get , which are encoded vectors of length :

For our data , we pad each vector with zeros to get of length :

Now that the dimensions line up, we can add the two together and store each column at the node:

To access the data, the user downloads all bits. The length string downloaded from row can be used to decode : are all zero, so columns through have the values of . This gives the user values from the MDS code used on each row, so they can decode and get and . Then a subtraction from the downloaded data gives . Because of the MDS property of the code used to get , this scheme is X-secure and because the user downloads all bits, it is T-private.

We now move on to the task of matrix-matrix multiplication. The methods for secure and private distributed matrix multiplication we will discuss shortly are based on *polynomial codes*, used by [Yu, Maddah-Ali, Avestimehr ’17] for doing distributed matrix multiplications robust to stragglers. Suppose the master has matrices and for some finite field , and . Assume and are divisible by , so we can represent the matrices divided into submatrices:

and

So to recover , the master needs each entry of:

The key idea of polynomial codes is to encode and in polynomials and to be sent to the worker, where they are multiplied and the result is returned. The goal of Yu et al. was to create robustness to stragglers, and so they add redundancy in this process so that not all workers need to return a result for the master to be able to determine . In particular, only returned values are needed, so servers can be slow or fail completely without hurting the computation. This method can be thought of as setting up the encodings of and so that the resulting multiplications are evaluations of a polynomial with coefficients at different points — equivalent to a Reed-Solomon code.

This idea is adapted by [Chang, Tandon ’18] to protect the data from colluding servers: noise is incorporated into the encodings such that the number of encoded matrices required to determine anything about the data is greater than the security threshold . Since the master receives all responses it is able to decode the result of , but no set of nodes can decode , , or . Similarly, [Kim, Yang, Li ’19] adapts this idea to impose privacy on a matrix-matrix multiplication: workers are assumed to have a shared library , and the user would like to multiply for some without revealing the value of to the workers. The workers encode the entire library such that when the encoding is multiplied by an encoded input from the master, the result is useful to the master in decoding .

Chang and Tandon consider the following two privacy models, where up to servers may collude. The master also has (and in the second model, ), which are matrices of random values with the same dimensions as (and ). These are used in creating the encodings (and ).

is public, is private:

Both private:

Kim, Yang, and Lee take a similar approach of applying the method of polynomial code to *private* matrix multiplication. As before, there are workers, but now the master wants to multiply with some in shared library (all the workers have the shared library).

Since the master isn’t itself encoding it has to tell the workers how to encode the library so that it can reconstruct the desired product. This is done by having the master tell the workers what values of they should use to evaluate the polynomial that corresponds to encoding each library matrix. We denote the encoding of the library done by each worker as the multivariate polynomial which is evaluated at and the node-specific vector to get the node’s encoding, . The worker multiplies this with the encoding of it receives, and returns the resulting value . All together, we get the following communication model:

As we’ve seen, coding techniques originally designed to add redundancy and protect against data loss can also be used to intentionally incorporate noise for data protection. In particular, this can be done when out-sourcing matrix multiplications, making it a useful technique in many data processing and machine learning applications.

References:

- Jia, Zhuqing, Hua Sun, and Syed Ali Jafar. “Cross Subspace Alignment and the Asymptotic Capacity of X-Secure T-Private Information Retrieval.”
*IEEE Transactions on Information Theory*65.9 (2019): 5783-5798. - Yu, Qian, Mohammad Maddah-Ali, and Salman Avestimehr. “Polynomial codes: an optimal design for high-dimensional coded matrix multiplication.”
*Advances in Neural Information Processing Systems*. 2017. - Chang, Wei-Ting, and Ravi Tandon. “On the capacity of secure distributed matrix multiplication.”
*2018 IEEE Global Communications Conference (GLOBECOM)*. IEEE, 2018. - Kim, Minchul, Heecheol Yang, and Jungwoo Lee. “Private Coded Matrix Multiplication.”
*IEEE Transactions on Information Forensics and Security*(2019).