On the Importance of Disciplinary Pride for Multidisciplinary Collaboration

I am a big fan of collaborations, even if they come with their own challenges. I always got further and enjoyed research much more because of my collaborators. I’m forever indebted to so many colleagues and dear, dear friends. Each and every one of them was better than me in some ways. To contribute, I had to remember my own strengths and bring them to the table. The premise of this post is that the same holds for collaboration between fields. It should be read as a call for theoreticians to bring the tools and the powerful way of thinking of TOC into collaborations. We shouldn’t be blind to the limitation of our field but obsessing on those limitations is misguided and would only limit our impact. Instead we should bring our best and trust on the other disciplines we collaborate with to do the same (allowing each to complement and compensate for the other).

The context in which these thoughts came to my mind is Algorithmic Fairness. In this and other areas on the interface between society and computing, true collaboration is vital. Not surprisingly, attending multidisciplinary programs on Algorithm Fairness, is a major part of my professional activities these days. And I love it – I get to learn so much from people and disciplines that have been thinking about fairness for many decades and centuries. In addition, the Humanities are simply splendid. Multidisciplinary collaborations come with even more challenges than other collaborations: the language, tools and perspectives are different. But for exactly the same reasons they can be even more rewarding. Nevertheless, my fear and the reason for this post is that my less experienced TOC colleagues might come out from those interdisciplinary meetings frustrated and might lose confidence in what TOC can contribute. It feels to me that old lessons about the value of TOC need to be learned again. There is a lot to be proud of, and holding to this pride would in fact make us better collaborators not worse.

In the context of Algorithmic Fairness, we should definitely acknowledge (as we often do) that science exists within political structures, that algorithms are not objective and that mathematical definitions cannot replace social norms as expressed by policy makers. But let’s not take these as excuses for inaction and let’s not withdraw to the role of spectators. In this era of algorithms, other disciplines need us just as much as we need them .

]]>

**tl;dr:** the ITCS’20 CFP has been posted. Read it, and submit your work there!

We invite you to submit your papers to the 11th Innovations in Theoretical Computer Science (ITCS). The conference will be held at the University of Washington in Seattle, Washington from January 12-14, 2020.

ITCS seeks to promote research that carries a strong conceptual message (e.g., introducing a new concept, model or understanding, opening a new line of inquiry within traditional or interdisciplinary areas, introducing new mathematical techniques and methodologies, or new applications of known techniques). ITCS welcomes both conceptual and technical contributions whose contents will advance and inspire the

greater theory community.

**Important dates**

*Submission deadline:*September 9, 2019 (05:59pm PDT)*Notification to authors:*October 31, 2019*Conference dates:*January 12-14, 2020

See the website at http://itcs-conf.org/itcs20/itcs20-cfp.html for detailed information regarding submissions.

**Program committee**

Nikhil Bansal, CWI + TU Eindhoven

Nir Bitansky, Tel-Aviv University

Clement Canonne, Stanford

Timothy Chan, University of Ilinois at Urbana-Champaign

Edith Cohen, Google and Tel-Aviv University

Shaddin Dughmi, University of Southern California

Sumegha Garg, Princeton

Ankit Garg, Microsoft research

Ran Gelles, Bar-Ilan University

Elena Grigorescu, Purdue

Tom Gur, University of Warwick

Sandy Irani, UC Irvine

Dakshita Khurana, University of Illinois at Urbana-Champaign

Antonina Kolokolova, Memorial University of Newfoundland.

Pravesh Kothari, Carnegie Mellon University

Rasmus Kyng, Harvard

Katrina Ligett, Hebrew University

Nutan Limaye, IIT Bombay

Pasin Manurangsi, UC Berkeley

Tamara Mchedlidze, Karlsruhe Institute of Technology

Dana Moshkovitz, UT Austin

Jelani Nelson, UC Berkeley

Merav Parter, Weizmann Institute

Krzysztof Pietrzak, IST Austria

Elaine Shi, Cornell

Piyush Srivastava, Tata Institute of Fundamental Research, Mumbai

Li-Yang Tan, Stanford

Madhur Tulsiani, TTIC

Gregory Valiant, Stanford

Thomas Vidick, California Institute of Technology (chair)

Virginia Vassilevska Williams, MIT

Ronald de Wolf, CWI and University of Amsterdam

David Woodruff, Carnegie Mellon University

http://acm-stoc.org/stoc2019/callforworkshops.html

Submission deadline: | March 24, 2019 |

Notification: | April 5, 2019 |

Workshop Day: | Sunday June 23, 2019 |

**STOC 2019 will hold a Workshop and Tutorial Day on Sunday June 23. We invite groups of interested researchers to submit workshop or tutorial proposals.**

The STOC 2019 Workshop and Tutorial Day provides an informal forum for researchers to discuss important research questions, directions, and challenges of the field. We also encourage workshops that focus on connections between theoretical computer science and other areas, topics that are not well represented at STOC, new directions, and open problems. The program may also include tutorials, each consisting of a few survey talks on a particular area.

**Format**: We have room for three workshops/tutorials running in parallel for the course of the day with a break for lunch. In addition, workshops or tutorials may be full-day (6 hrs) or half-day (3 hrs).

Workshop and tutorial proposals should, ideally, fit one page. Please include a list of names and email addresses of the organizers, a brief description of the topic and the goals of the workshop or tutorial, the proposed workshop format (invited talks, contributed talks, contributed posters, panel, etc.), and proposed or tentatively confirmed speakers if known. Please also indicate the preferred length of time (3 hrs or 6 hrs) for your workshop/tutorial. If your proposal is accepted and you wish to solicit contributed talks (which we strongly encourage), we can link to your call-for-contributions from the STOC 2019 page. Feel free to contact the Chair of the Workshop and Tutorials Committee directly or at the email address below if you have any questions.

Proposals should be submitted by **March 24, 2019** via email to stoc2019events@gmail.com. Proposers will be notified by April 5, 2019 whether their proposals have been accepted.

**Workshop and Tutorials Committee:** Moses Charikar (Chair)

Here is the official website for registration (free!) and other useful infromation:

https://sites.google.com/view/stoca19/home

For years now—especially since the landmark work of Krishevsky et. al.—learning deep neural networks has been a method of choice in prediction and regression tasks, especially in perceptual domains found in computer vision and natural language processing. How effective might it be for solving *theoretical* tasks?

Specifically, focusing on supervised learning:

Can a deep neural network, paired with a stochastic gradient method, be shown to PAC learn any interesting concept class in polynomial time?

Depending on assumptions, and on one’s definition of “interesting,” present-day learning theory gives answers ranging from “no, that would solve hard problems,” to, more recently:

Theorem:Networks with depth between 2 and ,^{1}having standard activation functions,^{2}with weights initialized at random and trained with stochastic gradient descent, learn, in polynomial time, constant degree large margin polynomial thresholds.

Learning constant-degree polynomials can also be done simply *with a linear predictor* over a polynomial embedding, or, in other words, by learning a halfspace. That said, what a linear predictor can do is also *essentially the state of the art* in PAC learning, so this result pushes neural net learning at least as far as one might hope at first. We will return to this point later, and discuss some limitations of PAC analysis once they are more apparent. In this sense, this post will turn out to be as much an overview of some PAC learning theory as it is about neural networks.

Naturally, there is a wide variety of theoretical perspectives on neural network analysis, especially in the past couple of years. Our goal in this post is not to survey or cover any extensive body of work, but simply to summarize our own recent line (from two papers: DFS’16 and D’17), and to highlight the interaction with PAC learning.

First, let’s define a learning task. To keep things simple, we’ll focus on binary classification over the boolean cube, without noise. Formally:

(Binary classification.)Given examples of the form , where is sampled from some unknown distribution on , and is some unknown function (the one that we wish to learn), find a function whose error, , is small.

Second, define a neural network formally as a directed acyclic graph whose vertices are called neurons. Of them, are input neurons, one is an output neuron, and the rest are called hidden neurons.^{3} A network together with a weight vector defines a predictor whose prediction is computed by propagating forward through the network. Concretely:

- For an input neuron , is the corresponding coordinate in .
- For a hidden neuron , defineThe scalar weight is called a “bias.” In this post, the function is the ReLU activation , though others are possible as well.
- For the output neuron , we drop the activation: .

Finally, let . This computes a real-valued function, so where we’d like to use it for classification, we do so by thresholding, and abuse the notation to mean .

Some intuition for this definition would come from verifying that:

- Any function can be computed by a network of depth two and hidden neurons.
- The parity function can be computed by a network of depth two and hidden neurons. (NB: this one is a bit more challenging.)

In practice, the network architecture (this DAG) is designed based on some domain knowledge, and its design can impact the predictor that’s later selected by SGD. One default architecture, useful in the absence of domain knowledge, is the multi-layer perceptron, comprised of layers of complete bipartite graphs:

Convolutional nets capture the notion of spatial input locality in signals such as images and audio.^{4} In the toy example drawn, each clustered triple of neurons is a so-called convolution filter applied to two components below it. In image domains, convolutions filters are two-dimensional and capture responses to spatial 2-D patches of the image or of an intermediate layer.

Training a neural net comprises (i) initialization, and (ii) iterative optimization run until for sufficiently many examples . The initialization step sets the starting values of the weights at random:

(Glorot initialization.)Draw weights from centered Gaussians with variance and biases from independent standard Gaussians.^{5}

While other initialization schemes exists, this one is canonical, simple, and, as the reader can verify, satisfies for every neuron and input .

The optimization step is essentially a local search method from the initial point, using stochastic gradient descent (SGD) or a variant thereof.^{6} To apply SGD, we need a function suitable for descent, and we’ll use the commonplace logistic loss , which bounds the zero-one loss from above:

Define . Note that , so finding weights for which the upper bound is small enough implies low error in turn. Meanwhile, is amenable to iterative gradient-based minimization.

Given samples from , stochastic gradient descent creates an unbiased estimate of the gradient at each step by drawing a batch of i.i.d. samples from . The gradient at a point can be computed efficiently by the backpropagation algorithm.

In more complete detail, our prototypical neural network training algorithm is as follows. On input a network , an iteration count , a batch size , and a step size :

**Algorithm: SGDNN**

- Let be random weights sampled per Glorot initialization
- For :
- Sample a batch , where are i.i.d. samples from .
- Update , where.

- Output

Learning a predictor from example data is a general task, and a hard one in the worst case. We cannot efficiently (i.e. in time) compute, let alone learn, general functions from to . In fact, any learning algorithm that is guaranteed to succeed in general (i.e. with any target predictor over any data distribution ) runs, in the worst case, in time exponential in . This is true even for rather weak definitions of “success,” such as finding a predictor with error less than , i.e. one that slightly outperforms a random guess.

While it is impossible to efficiently learn general functions under general distributions, it might still be possible to learn efficiently under some assumptions on the target or the distribution . Charting out such assumptions is the realm of learning theorists: by now, they’ve built up a broad catalog of function classes, and have studied the complexity of learning when the target function is in each such class. Although their primary aim has been to develop theory, the potential guidance for practice is easy to imagine: if one’s application domain happens to be modeled well by one of these easily-learnable function classes, there’s a corresponding learning algorithm to consider as well.

The vanilla PAC model makes no assumptions on the data distribution , but it does assume the target belongs to some simple, predefined class . Formally, a *PAC learning problem* is defined by a function class^{7} . A learning algorithm *learns* the class if, whenever , and provided , it runs in time , and returns a function of error at most , with probability at least 0.9. Note that:

- The learning algorithm need not return a function from the learnt class.
- The polynomial-time requirement means in particular that the learning algorithm cannot output a complete truth table, as its size would be exponential. Instead, it must output a short description of a hypothesis that can be evaluated in polynomial time.

For a taste of the computational learning theory literature, here are some of the function classes studied by theorists over the years:

*Linear thresholds (halfspaces):*functions that map a halfspace to 1 and its complement to -1. Formally, functions of the form for some , where when and when .*Large-margin linear thresholds:*forthe class*Intersections of halfspaces:*functions that map an intersection of polynomially many halfspaces to and its complement to .*Polynomial threshold functions:*thresholds of constant-degree polynomials.*Large-margin polynomial threshold functions:*the class

*Decision trees*,*deterministic automata*, and*DNF formulas*of polynomial size.*Monotone conjunctions:*functions that, for some map to if for all , and to otherwise.*Parities:*functions of the form for some .*Juntas:*functions that depend on at most variables.

Learning theorists look at these function classes and work to distinguish those that are efficiently learnable from those that are *hard* to learn. They establish hardness results by reduction from other computational problems that are conjectured to be hard, such as random XOR-SAT (though none today are conditioned outright on NP hardness); see for example these two results. Meanwhile, halfspaces are learnable by linear programming. Parities, or more generally, -linear functions for a field , are learnable by Gaussian elimination. In turn, via reductions, many other classes are efficiently learnable. This includes polynomial thresholds, decision lists, and more. To give an idea of what’s known in the literature, here is an artist’s depiction of some of what’s currently known:

At a high-level, the upshot from all of this—and if you take away just one thing from this quick tour of PAC—is that:

Barring a small handful of exceptions, all known efficiently learnable classes can be reduced to halfspaces or -linear functions.

Or, to put it more bluntly, **the state of the art in PAC-learnability is essentially linear prediction**.

Research in algorithms and complexity often follows these steps:

- define a computational problem,
- design an algorithm that solves it, and then
- establish bounds on the resource requirements of that algorithm.

A bound on the algorithm’s performance forms, in turn, a bound on the *computational problem’s* inherent complexity.

By contrast, we have already decided on our SGDNN algorithm, and we’d like to attain some grasp on its capabilities. So we’d like to do things in a different order:

- define an
*algorithm*(done), - design a computational problem to which the algorithm can be applied, and then
- establish bounds on the resource requirements of the algorithm in solving the problem.

Our computational problem will be a PAC learning problem, corresponding to a function class. For SGDNN, an ambitious function class we might consider is the class of all functions realizable by the network. But if we were to follow this approach, we would run up against the same hardness results mentioned before.

So instead, we’ve established the theorem stated at the top of this post. That is, that SGDNN, over a range of network configurations, learns a class that we *already know* to be learnable: large margin polynomial thresholds. Restated:

Theorem, again:There is a choice of SGDNN step size and number of steps , as well as a with parameter , where , such that SGDNN on a multi-layer perceptron of depth between 2 and , and of width^{8}, learns large magin polynomials.

How rich are large margin polynomials? They contain disjunctions, conjunctions, DNF and CNF formulas with a constant many terms, DNF and CNF formulas with a constant many literals in each term. By corollary, SGDNN can PAC learn these classes as well. And at this point, we’ve covered a considerable fraction of the function classes known to be poly-time PAC learnable by *any* method.

Exceptions include constant-degree polynomial thresholds with no restriction on the coefficients, decision lists, and parities. It is well known that SGDNN cannot learn parities, and in ongoing work with Vitaly Feldman, we show that SGDNN cannot learn decision lists nor constant-degree polynomial thresholds with unrestricted coefficients. So the picture becomes more clear:

The theorem above runs SGDNN with a multi-layer perceptron. What happens if we change the network architecture? It can be shown then that SGDNN learns a qualitatively different function class. For instance, with convolutional networks, the learnable functions include certain polynomials of *super-constant* degree.

The path to the theorem traverses two papers. There’s a corresponding outline for the proof.

The first step is to show that, with high probability, the Glorot random initialization renders the network in a state where the final hidden layer (just before the output node) is rich enough to approximate all large-margin polynomial threshold functions (LMPTs). Namely, every LMPT can be approximated by the network up to some setting of the weights that enter the output neuron (all remaining weights random). The tools for this part of the proof include (i) the connection between kernels and random features, (ii) a characterization of symmetric kernels of the sphere, and (iii) a variety of properties of Hermite polynomials. It’s described in our 2016 paper.

An upshot of this correspondence is that if we run SGD *only on the top layer* of a network, leaving the remaining weights as they were randomly initialized, we learn LMPTs. (Remember when we said that we won’t beat what a linear predictor can do? There it is again.) The second step of the proof, then, is to show that the correspondence continues to hold even if we train all the weights. In the assumed setting (e.g. provided at most logarithmic depth, sufficient width, and so forth), what’s represented in the final hidden layer changes sufficiently slowly that, over the course of SGDNN’s iterations, it *remains* rich enough to approximate all LMPTs. The final layer does the remaining work of picking out the right LMPT. The argument is in Amit’s 2017 paper.

To what extent should we be satisfied, knowing that our algorithm of interest (SGDNN) can solve a (computationally) easy problem?

On the positive side, we’ve managed to say something at all about neural network training in the PAC framework. Roughly speaking, some class of non-trivially layered neural networks, trained as they typically are, learns any known learnable function class that isn’t “too sensitive.” It’s also appealing that the function classes vary across different architectures.

On the pessimistic side, we’re confronted to a major limitation on the “function class” perspective, prevalent in PAC analysis and elsewhere in learning theory. All of the classes that SGDNN learns, *under the assumptions* touched on in this post, are so-called large-margin classes. Large-margin classes are essentially linear predictors over a *fixed and data-independent* embedding of input examples, as alluded to before. These are inherently “shallow models.”

That seems rather problematic in pursuing any kind of theory for learning layered networks, where the entire working premise is that a deep network uses its hidden layers to learn a representation adapted to the example domain. Our analysis—both its goal and its proof—clash with this intuition: it works out that a “shallow model” can be learned when assumptions imply that “not too much” change takes place in hidden layers. It seems that the representation learning phenomenon is what’s interesting, yet the typical PAC approach, as well as the analysis touched on in this post, all avoid capturing it.

- Here is the dimension of the instance space.
- For instance, ReLU activations, of the form .
- Recurrent networks allow for cycles, but in this post we stick to DAGs.
- Convolutional networks often also constrain subsets of their weights to be equal; that turns out not to bear much on this post.
- Although not essential to the results described, it also simplifies this post to zero the weights on edges incident to the output node as part of the initialization.
- Variants of SGD are used in practice, including algorithms used elsewhere in optimization (e.g. SGD with momentum, AdaGrad) or techniques developed more specifically for neural nets (e.g. RMSprop, Adam, batch norm). We’ll stick to plain SGD.
- More accurately, a sequence of function classes for .
- The width of a multi-layer perceptron is the number of neurons in each hidden layer.

The call for nomination for the 2019 Gödel Prize is out and the deadline is February 15th. For all awards, we sometimes have the tendency to think that worthy candidates have surely been nominated by others. Often it is not the case (and thus worthy candidates are often left behind). So if there is a paper or papers deserving nomination, please nominate! The call for nomination is below.

Deadline: February 15, 2019

The Gödel Prize for outstanding papers in the area of theoretical computer science is sponsored jointly by the European Association for Theoretical Computer Science (EATCS) and the Association for Computing Machinery, Special Interest Group on Algorithms and Computation Theory (ACM SIGACT). The award is presented annually, with the presentation taking place alternately at the International Colloquium on Automata, Languages, and Programming (ICALP) and the ACM Symposium on Theory of Computing (STOC). The 27th Gödel Prize will be awarded at 51st Annual ACM Symposium on the Theory of Computing to be held during June 23-26, 2019 in Phoenix, AZ. The Prize is named in honor of Kurt Gödel in recognition of his major contributions to mathematical logic and of his interest, discovered in a letter he wrote to John von Neumann shortly before von Neumann’s death, in what has become the famous “P versus NP” question. The Prize includes an award of USD 5,000.

**Award Committee: **The 2019 Award Committee consists of Anuj Dawar (Cambridge University), Robert Krauthgamer (Weizmann Institute), Joan Feigenbaum (Yale University), Giuseppe Persiano (Università di Salerno), Omer Reingold (Chair, Stanford University) and Daniel Spielman (Yale University).

**Eligibility:** The 2019 Prize rules are given below and they supersede any different interpretation of the generic rule to be found on websites of both SIGACT and EATCS. Any research paper or series of papers by a single author or by a team of authors is deemed eligible if: – The main results were not published (in either preliminary or final form) in a journal or conference proceedings before January 1st, 2006. – The paper was published in a recognized refereed journal no later than December 31, 2018. The research work nominated for the award should be in the area of theoretical computer science. Nominations are encouraged from the broadest spectrum of the theoretical computer science community so as to ensure that potential award winning papers are not overlooked. The Award Committee shall have the ultimate authority to decide whether a particular paper is eligible for the Prize.

**Nominations:**

Nominations for the award should be submitted by email to the Award Committee Chair: reingold@stanford.edu. Please make sure that the Subject line of all nominations and related messages begin with “Goedel Prize 2019.” To be considered, nominations for the 2019 Prize must be received by February 15, 2019.

A nomination package should include:

1. A printable copy (or copies) of the journal paper(s) being nominated, together with a complete citation (or citations) thereof.

2. A statement of the date(s) and venue(s) of the first conference or workshop publication(s) of the nominated work(s) or a statement that no such publication has occurred.

3. A brief summary of the technical content of the paper(s) and a brief explanation of its significance.

4. A support letter or letters signed by at least two members of the scientific community.

Additional support letters may also be received and are generally useful. The nominated paper(s) may be in any language. However, if a nominated publication is not in English, the nomination package must include an extended summary written in English.

Those intending to submit a nomination should contact the Award Committee Chair by email well in advance. The Chair will answer questions about eligibility, encourage coordination among different nominators for the same paper(s), and also accept informal proposals of potential nominees or tentative offers to prepare formal nominations. The committee maintains a database of past nominations for eligible papers, but fresh nominations for the same papers (especially if they highlight new evidence of impact) are always welcome.

**Selection Process:**

The Award Committee is free to use any other sources of information in addition to the ones mentioned above. It may split the award among multiple papers, or declare no winner at all. All matters relating to the selection process left unspecified in this document are left to the discretion of the Award Committee.

**Recent Winners**

(all winners since 1993 are listed at http://www.sigact.org/Prizes/Godel/ and http://eatcs.org/index.php/goedel-prize):

**2018:** Oded Regev, On lattices, learning with errors, random linear codes, and cryptography, Journal of the ACM (JACM), Volume 56 Issue 6, 2009 (preliminary version in Symposium on Theory of Computing, STOC 2005).

**2017:** Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith, Calibrating Noise to Sensitivity in Private Data Analysis, Journal of Privacy and Confidentiality, Volume 7, Issue 3, 2016 (preliminary version in Theory of Cryptography, TCC 2006).

**2016:** Stephen Brookes, A Semantics for Concurrent Separation Logic. Theoretical Computer Science 375(1-3): 227-270 (2007). Peter W. O’Hearn, Resources, Concurrency, and Local Reasoning. Theoretical Computer Science 375(1-3): 271-307 (2007).

**2015:** Dan Spielman and Shang-Hua Teng, Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems, Proc. 36th ACM Symposium on Theory of Computing, pp. 81-90, 2004; Spectral sparsification of graphs, SIAM J. Computing 40:981-1025, 2011; A local clustering algorithm for massive graphs and its application to nearly linear time graph partitioning, SIAM J. Computing 42:1-26, 2013; Nearly linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems, SIAM J. Matrix Anal. Appl. 35:835-885, 2014.

**2014: **Ronald Fagin, Amnon Lotem, and Moni Naor, Optimal Aggregation Algorithms for Middleware, Journal of Computer and System Sciences 66(4): 614–656, 2003.

**2013: **Antoine Joux, A one round protocol for tripartite Diffie-Hellman, J. Cryptology 17(4): 263-276, 2004. Dan Boneh and Matthew K. Franklin, Identity-Based Encryption from the Weil pairing, SIAM J. Comput. 32(3): 586-615, 2003.

**Shafi Goldwasser **for the Motwani colloquium, telling us about *Pseudo Deterministic Algorithms and Proofs, ***Avishay Tal ** about *Oracle Separation of BQP and the Polynomial Hierarchy, and ***Badih Ghazi **about *Resource-Efficient Common Randomness and Secret Key Generation. *We will also have student talks, food and drink and a great and diverse group of theoreticians as usual.

I will devote several posts (by myself and others) to the (beautiful) “emerging theory of algorithmic fairness.” Most of these posts will be more technical, but I’d like to devote today’s post to a short discussion of what theoreticians can contribute to this multidisciplinary effort.

My own belief is that computer scientists cannot solve Algorithmic Fairness (and privacy in data analysis or any other issue of this sort) on their own. On the other hand, these issues, in their current computation-driven large-scale incarnation, cannot be seriously addressed without major involvement of computer scientists. Furthermore, what is needed (as I will try to demonstrate in future posts) is a true collaboration, rather than a division of work, where one community sub-contracts another for specific expertise.

One of the reasons the Theory of Computing is particularly suited to this challenge is our basic optimism in the face of complexities and even impossibilities. The topic of Algorithmic Fairness seems to be particularly entangled with such complexities. This is the source of a line of criticism on the inherent limitations of the “tech solutionist” approach to Algorithmic Fairness. For example, “discrimination is the result of biases in the data and cannot be addressed at the level of machine learning.” Another example: “unless we understand the causal structure we are analyzing, fairness cannot be obtained.” These criticisms (while not as devastating as they are sometimes presented) are not without merit, and they deserve a much more technical discussion (that will hopefully come in future posts). At this point I’d like to make two comments:

- The computational lens has served us well in the study of Cryptography, Game Theory, Learning , Privacy and beyond. There is already evidence that it is serving us well in the study of Algorithmic Fairness. I believe that the pessimistic view of what I would call “all-or-nothing-ism” ignores an incredible track record of Theory of Computing in addressing complicated human-involving subject areas, and ignores the progress already made on Algorithmic Fairness.
- Furthermore, no one is planning to stop analyzing data (for example in medical research) because our data is imperfect or because we didn’t figure out causality, Algorithmic Fairness requires both the best solutions we can come up with right now, and a concerted research effort to guarantee better fairness in the future.

While all too common, the term “technologists” in this context is unfortunate. Who are those mysterious “technologists?” Are they software engineers? Are they computer scientists? (and which sub area: Machine Learning? Theory? Others?) Or perhaps CEOs of technology companies? Or perhaps this refers to the investment firms and Wall Street, who seem to have such a huge sway over technology companies? Perhaps users of technology are to blame? Each of those is a completely different group of individuals with completely different sets of constraints and incentives. Lumping them all together is close to meaningless.

In a sequence of posts (by me and others and of increasing level of technical details), I hope to discuss the role of Theory of Computing in the study of the particularly important societal issue of Algorithmic Fairness. In this post, I’d like to briefly discuss the role of Academia more generally.

**The power and weakness of education**

An idea that is getting traction is that ethics and the societal impact of computation should be embedded in essentially all Computer Science courses. I am all for it! (In fact, ethics should be a major part of every curriculum on campus, not just Computer Science). As these days a huge fraction of students take some Computer Science courses, this will improve the awareness of technology consumers to ethics in computation. It will also improve the awareness of software engineers and eventually also the leadership of technology companies and as importantly that of policy makers.

But awareness, in itself, may not have much of an impact. Software engineers often have very little flexibility in shaping the products they develop, even when it comes to topics that more clearly affect the bottom line of their companies (this has to do with the quick pace and incentive structure of companies). Even the most philanthropic CEOs seem to run companies that violate basic ethical considerations. Here too, the incentive structure is much more to blame than lack of awareness. And even consumers that want to punish violators, often do not, as many software companies are to a large extent a monopoly. In other cases, violators operate behind the scenes, hidden from consumers.

**Developing the Knowledge and Tools **

I would also add that topics like privacy and algorithmic fairness require significant sophistication and much of the required knowledge and tools are yet to be developed. This means that academia (and funding agencies) should perform and support much more research. But (big) companies (that make their living exploiting sensitive data) should also hire many more researchers (of various disciplines) to develop the tools they need.

The great breakthroughs in Machine Learning within industry did not occur because the employees of those companies increased their awareness to the importance of data analysis. It happened because those companies employed talented and knowledgeable individuals and poured a lot of money into machine learning. Unless companies invest much more resources in their ethics, we are going to see the same recurring failures in protecting their users.

**Regulations**

As we already mentioned that users are very limited in punishing big companies, it is unlikely that we will see the needed investment across the board (some companies are much better in this regard, but those companies are the exception rather than the rule). In addition to education, we need to enforce good behavior through legislation and regulation. Unfortunately, the direction of the current administration is to remove protections for consumers. Still, we can hope that Europe (as well as some of the more progressive U.S. states), will come to our rescue once again. As far of the role of scientists, we should work with policy makers to develop and advocate for the “right” regulations.

]]>