For years now—especially since the landmark work of Krishevsky et. al.—learning deep neural networks has been a method of choice in prediction and regression tasks, especially in perceptual domains found in computer vision and natural language processing. How effective might it be for solving *theoretical* tasks?

Specifically, focusing on supervised learning:

Can a deep neural network, paired with a stochastic gradient method, be shown to PAC learn any interesting concept class in polynomial time?

Depending on assumptions, and on one’s definition of “interesting,” present-day learning theory gives answers ranging from “no, that would solve hard problems,” to, more recently:

Theorem:Networks with depth between 2 and ,^{1}having standard activation functions,^{2}with weights initialized at random and trained with stochastic gradient descent, learn, in polynomial time, constant degree large margin polynomial thresholds.

Learning constant-degree polynomials can also be done simply *with a linear predictor* over a polynomial embedding, or, in other words, by learning a halfspace. That said, what a linear predictor can do is also *essentially the state of the art* in PAC learning, so this result pushes neural net learning at least as far as one might hope at first. We will return to this point later, and discuss some limitations of PAC analysis once they are more apparent. In this sense, this post will turn out to be as much an overview of some PAC learning theory as it is about neural networks.

Naturally, there is a wide variety of theoretical perspectives on neural network analysis, especially in the past couple of years. Our goal in this post is not to survey or cover any extensive body of work, but simply to summarize our own recent line (from two papers: DFS’16 and D’17), and to highlight the interaction with PAC learning.

First, let’s define a learning task. To keep things simple, we’ll focus on binary classification over the boolean cube, without noise. Formally:

(Binary classification.)Given examples of the form , where is sampled from some unknown distribution on , and is some unknown function (the one that we wish to learn), find a function whose error, , is small.

Second, define a neural network formally as a directed acyclic graph whose vertices are called neurons. Of them, are input neurons, one is an output neuron, and the rest are called hidden neurons.^{3} A network together with a weight vector defines a predictor whose prediction is computed by propagating forward through the network. Concretely:

- For an input neuron , is the corresponding coordinate in .
- For a hidden neuron , defineThe scalar weight is called a “bias.” In this post, the function is the ReLU activation , though others are possible as well.
- For the output neuron , we drop the activation: .

Finally, let . This computes a real-valued function, so where we’d like to use it for classification, we do so by thresholding, and abuse the notation to mean .

Some intuition for this definition would come from verifying that:

- Any function can be computed by a network of depth two and hidden neurons.
- The parity function can be computed by a network of depth two and hidden neurons. (NB: this one is a bit more challenging.)

In practice, the network architecture (this DAG) is designed based on some domain knowledge, and its design can impact the predictor that’s later selected by SGD. One default architecture, useful in the absence of domain knowledge, is the multi-layer perceptron, comprised of layers of complete bipartite graphs:

Convolutional nets capture the notion of spatial input locality in signals such as images and audio.^{4} In the toy example drawn, each clustered triple of neurons is a so-called convolution filter applied to two components below it. In image domains, convolutions filters are two-dimensional and capture responses to spatial 2-D patches of the image or of an intermediate layer.

Training a neural net comprises (i) initialization, and (ii) iterative optimization run until for sufficiently many examples . The initialization step sets the starting values of the weights at random:

(Glorot initialization.)Draw weights from centered Gaussians with variance and biases from independent standard Gaussians.^{5}

While other initialization schemes exists, this one is canonical, simple, and, as the reader can verify, satisfies for every neuron and input .

The optimization step is essentially a local search method from the initial point, using stochastic gradient descent (SGD) or a variant thereof.^{6} To apply SGD, we need a function suitable for descent, and we’ll use the commonplace logistic loss , which bounds the zero-one loss from above:

Define . Note that , so finding weights for which the upper bound is small enough implies low error in turn. Meanwhile, is amenable to iterative gradient-based minimization.

Given samples from , stochastic gradient descent creates an unbiased estimate of the gradient at each step by drawing a batch of i.i.d. samples from . The gradient at a point can be computed efficiently by the backpropagation algorithm.

In more complete detail, our prototypical neural network training algorithm is as follows. On input a network , an iteration count , a batch size , and a step size :

**Algorithm: SGDNN**

- Let be random weights sampled per Glorot initialization
- For :
- Sample a batch , where are i.i.d. samples from .
- Update , where.

- Output

Learning a predictor from example data is a general task, and a hard one in the worst case. We cannot efficiently (i.e. in time) compute, let alone learn, general functions from to . In fact, any learning algorithm that is guaranteed to succeed in general (i.e. with any target predictor over any data distribution ) runs, in the worst case, in time exponential in . This is true even for rather weak definitions of “success,” such as finding a predictor with error less than , i.e. one that slightly outperforms a random guess.

While it is impossible to efficiently learn general functions under general distributions, it might still be possible to learn efficiently under some assumptions on the target or the distribution . Charting out such assumptions is the realm of learning theorists: by now, they’ve built up a broad catalog of function classes, and have studied the complexity of learning when the target function is in each such class. Although their primary aim has been to develop theory, the potential guidance for practice is easy to imagine: if one’s application domain happens to be modeled well by one of these easily-learnable function classes, there’s a corresponding learning algorithm to consider as well.

The vanilla PAC model makes no assumptions on the data distribution , but it does assume the target belongs to some simple, predefined class . Formally, a *PAC learning problem* is defined by a function class^{7} . A learning algorithm *learns* the class if, whenever , and provided , it runs in time , and returns a function of error at most , with probability at least 0.9. Note that:

- The learning algorithm need not return a function from the learnt class.
- The polynomial-time requirement means in particular that the learning algorithm cannot output a complete truth table, as its size would be exponential. Instead, it must output a short description of a hypothesis that can be evaluated in polynomial time.

For a taste of the computational learning theory literature, here are some of the function classes studied by theorists over the years:

*Linear thresholds (halfspaces):*functions that map a halfspace to 1 and its complement to -1. Formally, functions of the form for some , where when and when .*Large-margin linear thresholds:*forthe class*Intersections of halfspaces:*functions that map an intersection of polynomially many halfspaces to and its complement to .*Polynomial threshold functions:*thresholds of constant-degree polynomials.*Large-margin polynomial threshold functions:*the class

*Decision trees*,*deterministic automata*, and*DNF formulas*of polynomial size.*Monotone conjunctions:*functions that, for some map to if for all , and to otherwise.*Parities:*functions of the form for some .*Juntas:*functions that depend on at most variables.

Learning theorists look at these function classes and work to distinguish those that are efficiently learnable from those that are *hard* to learn. They establish hardness results by reduction from other computational problems that are conjectured to be hard, such as random XOR-SAT (though none today are conditioned outright on NP hardness); see for example these two results. Meanwhile, halfspaces are learnable by linear programming. Parities, or more generally, -linear functions for a field , are learnable by Gaussian elimination. In turn, via reductions, many other classes are efficiently learnable. This includes polynomial thresholds, decision lists, and more. To give an idea of what’s known in the literature, here is an artist’s depiction of some of what’s currently known:

At a high-level, the upshot from all of this—and if you take away just one thing from this quick tour of PAC—is that:

Barring a small handful of exceptions, all known efficiently learnable classes can be reduced to halfspaces or -linear functions.

Or, to put it more bluntly, **the state of the art in PAC-learnability is essentially linear prediction**.

Research in algorithms and complexity often follows these steps:

- define a computational problem,
- design an algorithm that solves it, and then
- establish bounds on the resource requirements of that algorithm.

A bound on the algorithm’s performance forms, in turn, a bound on the *computational problem’s* inherent complexity.

By contrast, we have already decided on our SGDNN algorithm, and we’d like to attain some grasp on its capabilities. So we’d like to do things in a different order:

- define an
*algorithm*(done), - design a computational problem to which the algorithm can be applied, and then
- establish bounds on the resource requirements of the algorithm in solving the problem.

Our computational problem will be a PAC learning problem, corresponding to a function class. For SGDNN, an ambitious function class we might consider is the class of all functions realizable by the network. But if we were to follow this approach, we would run up against the same hardness results mentioned before.

So instead, we’ve established the theorem stated at the top of this post. That is, that SGDNN, over a range of network configurations, learns a class that we *already know* to be learnable: large margin polynomial thresholds. Restated:

Theorem, again:There is a choice of SGDNN step size and number of steps , as well as a with parameter , where , such that SGDNN on a multi-layer perceptron of depth between 2 and , and of width^{8}, learns large magin polynomials.

How rich are large margin polynomials? They contain disjunctions, conjunctions, DNF and CNF formulas with a constant many terms, DNF and CNF formulas with a constant many literals in each term. By corollary, SGDNN can PAC learn these classes as well. And at this point, we’ve covered a considerable fraction of the function classes known to be poly-time PAC learnable by *any* method.

Exceptions include constant-degree polynomial thresholds with no restriction on the coefficients, decision lists, and parities. It is well known that SGDNN cannot learn parities, and in ongoing work with Vitaly Feldman, we show that SGDNN cannot learn decision lists nor constant-degree polynomial thresholds with unrestricted coefficients. So the picture becomes more clear:

The theorem above runs SGDNN with a multi-layer perceptron. What happens if we change the network architecture? It can be shown then that SGDNN learns a qualitatively different function class. For instance, with convolutional networks, the learnable functions include certain polynomials of *super-constant* degree.

The path to the theorem traverses two papers. There’s a corresponding outline for the proof.

The first step is to show that, with high probability, the Glorot random initialization renders the network in a state where the final hidden layer (just before the output node) is rich enough to approximate all large-margin polynomial threshold functions (LMPTs). Namely, every LMPT can be approximated by the network up to some setting of the weights that enter the output neuron (all remaining weights random). The tools for this part of the proof include (i) the connection between kernels and random features, (ii) a characterization of symmetric kernels of the sphere, and (iii) a variety of properties of Hermite polynomials. It’s described in our 2016 paper.

An upshot of this correspondence is that if we run SGD *only on the top layer* of a network, leaving the remaining weights as they were randomly initialized, we learn LMPTs. (Remember when we said that we won’t beat what a linear predictor can do? There it is again.) The second step of the proof, then, is to show that the correspondence continues to hold even if we train all the weights. In the assumed setting (e.g. provided at most logarithmic depth, sufficient width, and so forth), what’s represented in the final hidden layer changes sufficiently slowly that, over the course of SGDNN’s iterations, it *remains* rich enough to approximate all LMPTs. The final layer does the remaining work of picking out the right LMPT. The argument is in Amit’s 2017 paper.

To what extent should we be satisfied, knowing that our algorithm of interest (SGDNN) can solve a (computationally) easy problem?

On the positive side, we’ve managed to say something at all about neural network training in the PAC framework. Roughly speaking, some class of non-trivially layered neural networks, trained as they typically are, learns any known learnable function class that isn’t “too sensitive.” It’s also appealing that the function classes vary across different architectures.

On the pessimistic side, we’re confronted to a major limitation on the “function class” perspective, prevalent in PAC analysis and elsewhere in learning theory. All of the classes that SGDNN learns, *under the assumptions* touched on in this post, are so-called large-margin classes. Large-margin classes are essentially linear predictors over a *fixed and data-independent* embedding of input examples, as alluded to before. These are inherently “shallow models.”

That seems rather problematic in pursuing any kind of theory for learning layered networks, where the entire working premise is that a deep network uses its hidden layers to learn a representation adapted to the example domain. Our analysis—both its goal and its proof—clash with this intuition: it works out that a “shallow model” can be learned when assumptions imply that “not too much” change takes place in hidden layers. It seems that the representation learning phenomenon is what’s interesting, yet the typical PAC approach, as well as the analysis touched on in this post, all avoid capturing it.

- Here is the dimension of the instance space.
- For instance, ReLU activations, of the form .
- Recurrent networks allow for cycles, but in this post we stick to DAGs.
- Convolutional networks often also constrain subsets of their weights to be equal; that turns out not to bear much on this post.
- Although not essential to the results described, it also simplifies this post to zero the weights on edges incident to the output node as part of the initialization.
- Variants of SGD are used in practice, including algorithms used elsewhere in optimization (e.g. SGD with momentum, AdaGrad) or techniques developed more specifically for neural nets (e.g. RMSprop, Adam, batch norm). We’ll stick to plain SGD.
- More accurately, a sequence of function classes for .
- The width of a multi-layer perceptron is the number of neurons in each hidden layer.

The call for nomination for the 2019 Gödel Prize is out and the deadline is February 15th. For all awards, we sometimes have the tendency to think that worthy candidates have surely been nominated by others. Often it is not the case (and thus worthy candidates are often left behind). So if there is a paper or papers deserving nomination, please nominate! The call for nomination is below.

Deadline: February 15, 2019

The Gödel Prize for outstanding papers in the area of theoretical computer science is sponsored jointly by the European Association for Theoretical Computer Science (EATCS) and the Association for Computing Machinery, Special Interest Group on Algorithms and Computation Theory (ACM SIGACT). The award is presented annually, with the presentation taking place alternately at the International Colloquium on Automata, Languages, and Programming (ICALP) and the ACM Symposium on Theory of Computing (STOC). The 27th Gödel Prize will be awarded at 51st Annual ACM Symposium on the Theory of Computing to be held during June 23-26, 2019 in Phoenix, AZ. The Prize is named in honor of Kurt Gödel in recognition of his major contributions to mathematical logic and of his interest, discovered in a letter he wrote to John von Neumann shortly before von Neumann’s death, in what has become the famous “P versus NP” question. The Prize includes an award of USD 5,000.

**Award Committee: **The 2019 Award Committee consists of Anuj Dawar (Cambridge University), Robert Krauthgamer (Weizmann Institute), Joan Feigenbaum (Yale University), Giuseppe Persiano (Università di Salerno), Omer Reingold (Chair, Stanford University) and Daniel Spielman (Yale University).

**Eligibility:** The 2019 Prize rules are given below and they supersede any different interpretation of the generic rule to be found on websites of both SIGACT and EATCS. Any research paper or series of papers by a single author or by a team of authors is deemed eligible if: – The main results were not published (in either preliminary or final form) in a journal or conference proceedings before January 1st, 2006. – The paper was published in a recognized refereed journal no later than December 31, 2018. The research work nominated for the award should be in the area of theoretical computer science. Nominations are encouraged from the broadest spectrum of the theoretical computer science community so as to ensure that potential award winning papers are not overlooked. The Award Committee shall have the ultimate authority to decide whether a particular paper is eligible for the Prize.

**Nominations:**

Nominations for the award should be submitted by email to the Award Committee Chair: reingold@stanford.edu. Please make sure that the Subject line of all nominations and related messages begin with “Goedel Prize 2019.” To be considered, nominations for the 2019 Prize must be received by February 15, 2019.

A nomination package should include:

1. A printable copy (or copies) of the journal paper(s) being nominated, together with a complete citation (or citations) thereof.

2. A statement of the date(s) and venue(s) of the first conference or workshop publication(s) of the nominated work(s) or a statement that no such publication has occurred.

3. A brief summary of the technical content of the paper(s) and a brief explanation of its significance.

4. A support letter or letters signed by at least two members of the scientific community.

Additional support letters may also be received and are generally useful. The nominated paper(s) may be in any language. However, if a nominated publication is not in English, the nomination package must include an extended summary written in English.

Those intending to submit a nomination should contact the Award Committee Chair by email well in advance. The Chair will answer questions about eligibility, encourage coordination among different nominators for the same paper(s), and also accept informal proposals of potential nominees or tentative offers to prepare formal nominations. The committee maintains a database of past nominations for eligible papers, but fresh nominations for the same papers (especially if they highlight new evidence of impact) are always welcome.

**Selection Process:**

The Award Committee is free to use any other sources of information in addition to the ones mentioned above. It may split the award among multiple papers, or declare no winner at all. All matters relating to the selection process left unspecified in this document are left to the discretion of the Award Committee.

**Recent Winners**

(all winners since 1993 are listed at http://www.sigact.org/Prizes/Godel/ and http://eatcs.org/index.php/goedel-prize):

**2018:** Oded Regev, On lattices, learning with errors, random linear codes, and cryptography, Journal of the ACM (JACM), Volume 56 Issue 6, 2009 (preliminary version in Symposium on Theory of Computing, STOC 2005).

**2017:** Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith, Calibrating Noise to Sensitivity in Private Data Analysis, Journal of Privacy and Confidentiality, Volume 7, Issue 3, 2016 (preliminary version in Theory of Cryptography, TCC 2006).

**2016:** Stephen Brookes, A Semantics for Concurrent Separation Logic. Theoretical Computer Science 375(1-3): 227-270 (2007). Peter W. O’Hearn, Resources, Concurrency, and Local Reasoning. Theoretical Computer Science 375(1-3): 271-307 (2007).

**2015:** Dan Spielman and Shang-Hua Teng, Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems, Proc. 36th ACM Symposium on Theory of Computing, pp. 81-90, 2004; Spectral sparsification of graphs, SIAM J. Computing 40:981-1025, 2011; A local clustering algorithm for massive graphs and its application to nearly linear time graph partitioning, SIAM J. Computing 42:1-26, 2013; Nearly linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems, SIAM J. Matrix Anal. Appl. 35:835-885, 2014.

**2014: **Ronald Fagin, Amnon Lotem, and Moni Naor, Optimal Aggregation Algorithms for Middleware, Journal of Computer and System Sciences 66(4): 614–656, 2003.

**2013: **Antoine Joux, A one round protocol for tripartite Diffie-Hellman, J. Cryptology 17(4): 263-276, 2004. Dan Boneh and Matthew K. Franklin, Identity-Based Encryption from the Weil pairing, SIAM J. Comput. 32(3): 586-615, 2003.

**Shafi Goldwasser **for the Motwani colloquium, telling us about *Pseudo Deterministic Algorithms and Proofs, ***Avishay Tal ** about *Oracle Separation of BQP and the Polynomial Hierarchy, and ***Badih Ghazi **about *Resource-Efficient Common Randomness and Secret Key Generation. *We will also have student talks, food and drink and a great and diverse group of theoreticians as usual.

I will devote several posts (by myself and others) to the (beautiful) “emerging theory of algorithmic fairness.” Most of these posts will be more technical, but I’d like to devote today’s post to a short discussion of what theoreticians can contribute to this multidisciplinary effort.

My own belief is that computer scientists cannot solve Algorithmic Fairness (and privacy in data analysis or any other issue of this sort) on their own. On the other hand, these issues, in their current computation-driven large-scale incarnation, cannot be seriously addressed without major involvement of computer scientists. Furthermore, what is needed (as I will try to demonstrate in future posts) is a true collaboration, rather than a division of work, where one community sub-contracts another for specific expertise.

One of the reasons the Theory of Computing is particularly suited to this challenge is our basic optimism in the face of complexities and even impossibilities. The topic of Algorithmic Fairness seems to be particularly entangled with such complexities. This is the source of a line of criticism on the inherent limitations of the “tech solutionist” approach to Algorithmic Fairness. For example, “discrimination is the result of biases in the data and cannot be addressed at the level of machine learning.” Another example: “unless we understand the causal structure we are analyzing, fairness cannot be obtained.” These criticisms (while not as devastating as they are sometimes presented) are not without merit, and they deserve a much more technical discussion (that will hopefully come in future posts). At this point I’d like to make two comments:

- The computational lens has served us well in the study of Cryptography, Game Theory, Learning , Privacy and beyond. There is already evidence that it is serving us well in the study of Algorithmic Fairness. I believe that the pessimistic view of what I would call “all-or-nothing-ism” ignores an incredible track record of Theory of Computing in addressing complicated human-involving subject areas, and ignores the progress already made on Algorithmic Fairness.
- Furthermore, no one is planning to stop analyzing data (for example in medical research) because our data is imperfect or because we didn’t figure out causality, Algorithmic Fairness requires both the best solutions we can come up with right now, and a concerted research effort to guarantee better fairness in the future.

While all too common, the term “technologists” in this context is unfortunate. Who are those mysterious “technologists?” Are they software engineers? Are they computer scientists? (and which sub area: Machine Learning? Theory? Others?) Or perhaps CEOs of technology companies? Or perhaps this refers to the investment firms and Wall Street, who seem to have such a huge sway over technology companies? Perhaps users of technology are to blame? Each of those is a completely different group of individuals with completely different sets of constraints and incentives. Lumping them all together is close to meaningless.

In a sequence of posts (by me and others and of increasing level of technical details), I hope to discuss the role of Theory of Computing in the study of the particularly important societal issue of Algorithmic Fairness. In this post, I’d like to briefly discuss the role of Academia more generally.

**The power and weakness of education**

An idea that is getting traction is that ethics and the societal impact of computation should be embedded in essentially all Computer Science courses. I am all for it! (In fact, ethics should be a major part of every curriculum on campus, not just Computer Science). As these days a huge fraction of students take some Computer Science courses, this will improve the awareness of technology consumers to ethics in computation. It will also improve the awareness of software engineers and eventually also the leadership of technology companies and as importantly that of policy makers.

But awareness, in itself, may not have much of an impact. Software engineers often have very little flexibility in shaping the products they develop, even when it comes to topics that more clearly affect the bottom line of their companies (this has to do with the quick pace and incentive structure of companies). Even the most philanthropic CEOs seem to run companies that violate basic ethical considerations. Here too, the incentive structure is much more to blame than lack of awareness. And even consumers that want to punish violators, often do not, as many software companies are to a large extent a monopoly. In other cases, violators operate behind the scenes, hidden from consumers.

**Developing the Knowledge and Tools **

I would also add that topics like privacy and algorithmic fairness require significant sophistication and much of the required knowledge and tools are yet to be developed. This means that academia (and funding agencies) should perform and support much more research. But (big) companies (that make their living exploiting sensitive data) should also hire many more researchers (of various disciplines) to develop the tools they need.

The great breakthroughs in Machine Learning within industry did not occur because the employees of those companies increased their awareness to the importance of data analysis. It happened because those companies employed talented and knowledgeable individuals and poured a lot of money into machine learning. Unless companies invest much more resources in their ethics, we are going to see the same recurring failures in protecting their users.

**Regulations**

As we already mentioned that users are very limited in punishing big companies, it is unlikely that we will see the needed investment across the board (some companies are much better in this regard, but those companies are the exception rather than the rule). In addition to education, we need to enforce good behavior through legislation and regulation. Unfortunately, the direction of the current administration is to remove protections for consumers. Still, we can hope that Europe (as well as some of the more progressive U.S. states), will come to our rescue once again. As far of the role of scientists, we should work with policy makers to develop and advocate for the “right” regulations.

]]>

*Search problem* means that we’re looking for something. *Total *means that what we’re looking for is guaranteed to exist. A famous example is Nash equilibrium: every game has at least one equilibrium, but [Daskalakis, Goldberg, Papadimitriou 2009] proved that it is PPAD-hard to find any of them.

A specific focus of this workshop is on connections to different sub-fields of Theory of Computer Science. We’ve seen exciting progress on those recently, and we hope the workshop can further bridge together all of the above (actually, all of TCS). Which brings me to my next point…

You!

We already have some fantastic speakers confirmed (schedule and workshop website coming soon), including an opening overview talk by Costis Daskalakis, who recently received the Nevalinna Prize (in part) for his work on total search problems.

By the way, if you have something interesting to tell the community about total search problems, and we haven’t contacted you yet about giving a talk, please let us know. We can probably still accommodate you in the schedule, even if you don’t have a Nevalinna Prize.

Looking forward to seeing y’all there – it will be totally awesome!

(Sorry- I couldn’t resist the bad pun…)

]]>

By Scarlett Sozzani

In response to and in support of recent activism around sexual assault and inclusivity at large, I want to take an opportunity to argue that issues of harassment, discrimination, bullying, and other egregiously insupportable actions can only exist when the victims are perceived to be weak, vulnerable, and powerless. And this kind of perception is often (and unwittingly) perpetuated through microagressions by many members of our academic community. Even though microaggressions are arguably even more frequent than outright forms of discrimination and harassment, this issue remains largely unaddressed.

Microaggression is formally defined (on dictionary.com) as:

*a statement, action, or incident regarded as an instance of indirect, subtle, or unintentional discrimination against members of a marginalized group such as a racial or ethnic minority.*

A microaggression is difficult to identify because it is so subtle – what one person may consider a microaggression may seem like merely a rude or tactless comment to another person. And comments that do not overtly mention race, gender, class, sexual orientation, etc. are more difficult to directly attribute as an act of microaggression. Furthermore, microaggressions are sometimes unintentional, so the perpetrator might not even realize they are committing a microaggression against someone. It’s a very personal judgment, so perhaps a good question to ask is: “What is the likelihood that the perpetrator would have made this same comment to a person who identifies with the privileged majority?”

Microagressions also come in many forms: not just in words, but also in tone, attitude, gestures, writing, and in all forms of interaction. The accumulated damage over many instances, over time, cannot be understated. It elicits an intuitive response in the receivers of such microaggressions – a nagging feeling of self-doubt that one doesn’t belong, or isn’t good enough, or isn’t as good as the rest of the people in the room.

And beyond the predominantly discriminative definition of a microaggression, I would argue that any action that makes a person feel like their contributions are not valuable, and that they are not good enough to be standing where they are, is counterproductive to the collective aspirations of a community, especially an ambitious, high-flying research community.

Here are some examples of a few microagressions that I have felt in my very short time as a graduate student and in my various roles as a colleague, advisee, collaborator, and teaching assistant.

-“It’s a hard paper to read, especially if you don’t have the necessary background.”

-“You should be able to get the fellowship, right? You’re a young girl.”

-“How is that not what I just said?”

-“I erased your Piazza post answering a student’s ask for supplementary materials because I didn’t like the paper you referenced.”

-And the classic: “Hey guys…”

Let’s all aim to do a little bit better. Perhaps even go above and beyond in pushing against the current by acknowledging, highlighting, and talking about hard-earned and worthy contributions from our under-represented mentees and colleagues.

]]>(See this video and this Quanta magazine article for more.)

In this post we want to celebrate another aspect of Costis’s work, on tackling statistical and modeling questions at the intersection of statistics, machine learning, and theory. Indeed, over the past years Costis and his students and collaborators have been at the forefront of some fundamental, yet quite topical questions: *how to make sense of data when we have too little of it, or too little time?*

On topics ranging from the daunting “curse of dimensionality” (how to say something meaningful about high-dimensional data, given limited computational power and/or observations? Under which assumptions, and in which scenarii can one still have a principled and sound approach to hypothesis testing, or density estimation, in this case?) to societal issues such as the tension between efficiency and privacy in hypothesis testing (is such a tension even necessary, or can we sometimes get differential privacy at no cost?), while exploring applications to biology and inference on genomic data, Costis’ contributions to these broad questions have been many.

Eagerly waiting for the next breakthroughs, once again — congratulations on this well-deserved award!

*(Image credit: Sarah A. King, from the MIT Technology Review.)*

Given strings and of characters each, the textbook dynamic programming algorithm finds their edit distance in time (if you haven’t seen this in your undergrad algorithms class, consider asking your university for a refund on your tuition). Recent complexity breakthroughs [1][2] show that under plausible assumptions like SETH, quadratic time is almost optimal for exact algorithms. This is too bad, because we like to compute edit distance between very long strings, like entire genomes of two organisms (see also this article in Quanta). There is a sequence of near-linear time algorithms with improved approximation factors [1][2][3][4], but until now the state of the art was polylogarithmic; actually for near-linear time, this is still the state of the art:

**Open question 1**: Is there a constant-factor approximation to edit distance that runs in near-linear time?

Here is a sketch of *an *algorithm. It is somewhat different from the algorithm in the paper because I wrote this post before finding the full paper online.

We partition each string into *windows*, or consecutive substrings of length each. We then restrict our attention to *window-compatible* matchings: that is, instead of looking for the globally optimum way to transform to , we look for a partial matching between the – and -windows, and transform each -window to its matching -windows (unmatched -windows are deleted, and unmatched -windows are inserted). It turns out that restricting to window-compatible matchings is almost without loss of generality.

In order to find the optimum window-compatible matching, we can find the distances between every pair of windows, and then use a (weighted) dynamic program of size . The reason I call it “Step 0” is because so far we made zero progress on running time: we still have to compute the edit distance between pairs, and each computation takes time , so time in total.

Approximating all the pairwise distances reduces to the following problem: given threshold , compute the bipartite graph over the windows, where two windows and share an edge if . In fact it suffices to compute an approximate , where and may share an edge even if their edit distance is a little more than .

**New Goal**: Compute faster than naively computing all pairwise edit distances.

While there are many edges in , say average degree : Draw a random edge , and let be two other neighbors of , respectively. Applying the **triangle inequality** (twice), we have that , so we can immediately add to . In expectation, have neighbors each, so we discovered a total of pairs; of which we expect that roughly correspond to *new* edges in . Repeat at most times until we discovered almost all the edges in . Notice that each iteration took us time (computing all the edit distances from and ); hence in total only . Thus we reduced to the sparse case in truly subquadratic time.

The algorithm up to this point is actually due to a recent paper by Boroujeni et al; for the case when is relatively sparse, they use Grover Search to discover all the remaining edges in quantum subquadratic time. It remains to see how to do it classically…

The main observation we need for this part is that if windows and are close, then in an optimum window-compatible matching they are probably not matched to -windows that are very far apart. And in the rare event that they are matched to far-apart -windows, the cost of inserting so many characters between and outweighs the cost of completely replacing if we had to. So once we have a candidate list of -windows that might match to, it is safe to only search for good matches for around each of those candidates. But when the graph is sparse, we have such a short list: the neighbors of !

We have to be a bit careful: for example, it is possible that is not matched to any of its neighbors in . But if we sample enough ‘s from some interval around , then either (i) at least one of them is matched to a neighbor in ; or (ii) doesn’t contribute much to reducing the edit distance for this interval, so it’s OK to miss some of those edges.

On the back of my virtual envelope, I think the above ideas give a -approximation. But as far as genomes go, up to a -approximation, you’re as closely related to your dog as to a banana. So it would be great to improve the approximation factor:

**Open question 2**: Is there a -approximation algorithm for edit distance in truly subquadratic (or near-linear) time?

Note that only the sparsification step loses more than in approximation. Also, none of the existing fine-grained hardness results rule out an -approximation, even in linear time!

]]>In the same vein, Michael Ekstrand and Michael Veale (the Publicity Chairs for FAT* 2019) have asked me to disseminate the following announcement and CFP.

———–

We are pleased to announce the Call for Papers for the 2019 ACM Conference on Fairness, Accountability, and Transparency (FAT*), to be held in Atlanta, Georgia in January/February 2019.

FAT* is an interdisciplinary conference to connect social, technical and policy domains around broad questions of fairness, accountability and transparency of machine learning, information retrieval, and other computing systems. The conference this year features tracks on Theory And Security, Statistics, Machine Learning, and Data Mining. The inaugural conference at NYU in February 2018 had an acceptance rate of 25% and was sold-out, with 450 international attendees from across academia, industry and public policy.

Papers (8-10 pages, due August 23) are double-blind peer reviewed and published in conference proceedings in the ACM Digital Library. Authors can also opt for non-archival submission, subject to the same review process but only appearing as an abstract in the proceedings. The theoretical computer science community has been involved in work on algorithmic fairness since its inception, and we hope that you’ll consider FAT* as a venue for your work.

Please forward this call to other people or groups you think may be interested.

For more details, see https://fatconference.org/2019/cfp.html

]]>