# the MAD+ Seminar

We are excited to announce the MaD+ (math and data plus) seminar, jointly organized between NYU and ETH. For the time being it is a virtual seminar series, and depending on logistics and interest potentially evolve to a multi-location seminar once the world returns to normal (with talks hosted in one of the locations, and streamed). The topics will be in line with the physical MaD seminar hosted at the Center for Data Science, which features leading specialists at the interface of Applied Mathematics, Statistics and Machine Learning. We took inspiration from the fantastic virtual seminar TCS+ from our friends in Theoretical Computer Science.

This semester it runs on Wednesdays in two time slots (each week one time slot) 4pm CET (10am EST) and and 2pm EST (8pm CET), in an attempt to accommodate the various working hours of more researchers around the world. Stay tuned for more great names that will be speaking this semester!

MaD seminars are recorded and streamed live. Links to the videos are available below.
You can subscribe to a calendar here, and to our **Youtube channel** here.

### Schedule with Confirmed Speakers

Date | Speaker | Title | Stream |
---|---|---|---|

3/25, 10am EST | Jon Niles-Weed (NYU) | Matrix Concetration for Products | Zoom Meeting |

4/1, 10am EST | Weinan E (Princeton) | A Mathematical Perspective of Machine Learning | Zoom Meeting |

4/8, 10am EST | Philippe Rigollet (MIT) | Statistical and Computational aspects of Wasserstein Barycenters | Zoom Meeting |

4/22, 10am EST | David Gamarnik (MIT) | Overlap Gap Property: a Provable Barrier to Fast Optimization in Probabilistic Combinatorial Structures | Zoom Meeting |

4/29, 10am EST | Sara Van de Geer (ETH Zurich) | Total Variation Regularization | Zoom |

5/13 2pm EST | Emmanuel Candes (Stanford) | Reliable Predictions? Equitable Treatment? Some recent progress in predictive inference | Zoom |

5/20, 10am EST | Francis Bach (INRIA/ENS) | On the Effectiveness of Richardson Extrapolation in Machine Learning | Zoom |

5/27 | Lenka Zdeborova (CNRS) | TBA | |

6/3, 10am EST | Ingrid Daubechies (Duke) | TBA | |

6/10 10am EST | Andrea Montanari (Stanford) | TBA | |

6/17 10am EST | Aviv Regev (Broad Institute, MIT/Harvard) | TBA |

### Abstracts

#### Jon Niles-Weed: Matrix Concentration for Products

We develop nonasymptotic concentration bounds for products of independent random matrices. Such products arise in the study of stochastic algorithms, linear dynamical systems, and random walks on groups. Our bounds exactly match those available for scalar random variables and continue the program, initiated by Ahlswede-Winter and Tropp, of extending familiar concentration bounds to the noncommutative setting. Our proof technique relies on geometric properties of the Schatten trace class. Joint work with D. Huang, J. A. Tropp, and R. Ward.

#### Weinan E: A Mathematical Perspective of Machine Learning

The heart of modern machine learning is the approximation of high dimensional functions. Traditional approaches, such as approximation by piecewise polynomials, wavelets, or other linear combinations of fixed basis functions, suffer from the curse of dimensionality. We will discuss representations and approximations that overcome this difficulty, as well as gradient flows that can be used to find the optimal approximation. We will see that at the continuous level, machine learning can be formulated as a series of reasonably nice variational and PDE-like problems. Modern machine learning models/algorithms, such as the random feature and shallow/deep neural network models, can be viewed as special discretizations of such continuous problems. At the theoretical level, we will present a framework that is suited for analyzing machine learning models and algorithms in high dimension, and present results that are free of the curse of dimensionality. Finally, we will discuss the fundamental reasons that are responsible for the success of modern machine learning, as well as the subtleties and mysteries that still remain to be understood

#### Philippe Rigollet: Statistical and Computational aspects of Wasserstein Barycenters

The notion of average is central to most statistical methods. In this talk we study a generalization of this notion over the non-Euclidean space of probability measures equipped with a certain Wasserstein distance. This generalization is often called Wasserstein Barycenters and empirical evidence suggests that these barycenters allow to capture interesting notions of averages in graphics, data assimilation and morphometrics. However the statistical (rates of convergence) and computational (efficient algorithms) for these Wasserstein barycenters are largely unexplored. The goal of this talk is to review two recent results: 1. Fast rates of convergence for empirical barycenters in general geodesic spaces, and, 2. Provable guarantees for gradient descent and stochastic gradient descent to compute Wasserstein barycenters. Both results leverage geometric aspects of optimal transport. Based on joint works (arXiv:1908.00828, arXiv:2001.01700) with Chewi, Le Gouic, Maunu, Paris, and Stromme.

#### David Gamarnik: Overlap Gap Property: a Provable Barrier to Fast Optimization in Probabilistic Combinatorial Structures

Many combinatorial optimization problems defined on random instances exhibit an apparent gap between the optimal values, which can be computed by non-constructive means, and the best values achievable by fast (polynomial time) algorithms. Through a combined effort of mathematicians, computer scientists and statistical physicists, it became apparent that a potential barrier for designing fast algorithms bridging this gap is an intricate topology of nearly optimal solutions, in particular the presence of the Overlap Gap Property (OGP), which we will introduce in this talk. We will discuss how for many such problems the onset of the OGP phase transition introduces indeed a provable barrier to a broad class of polynomial time algorithms. Examples of such problems include the problem of finding a largest independent set of a random graph, finding a largest cut in a random hypergrah, the problem of finding a ground state of a p-spin model, and also many problems in high-dimensional statistics field. In this talk we will demonstrate in particular why OGP is a barrier for three classes of algorithms designed to find a near ground state in p-spin models arising in the field of spin glass theory: Approximate Message Passing algorithms, algorithms based on low-degree polynomial and Langevin dynamics. Joint work with Aukosh Jagannath and Alex Wein

#### Sara van de Geer: Total variation Regularization

Let Y be a n-dimensional vector of independent observations with unknown mean f^0 := E Y. We consider the estimator f_D that solves the ``analysis problem” min_f {|| Y - f ||_2^2/ n + 2 lambda || D f ||_1 } , where D is a given , m times n matrix and lambda>0 is a tuning parameter. An example for the matrix D is the (first order) difference operator (Df )*i= f*{i}- f_{i-1} , \ i \in [2:n] in which case || Df ||_1 = TV(f) is the (first order) total variation of the vector f. Other examples include higher order discrete derivatives, total variation on graphs and total variation in higher dimensions. Our aim is to show that the estimator f_D is adaptive. For example, when f^0 is a piecewise linear function, we show that the analysis estimator f_D, with D the second order differences operator, adapts to the number of kinks of f^0. As is the case with the Lasso, the theory for the analysis estimator f_D requires a form of “restricted eigenvalue” condition. We will show that this can be established using interpolating vectors. We will illustrate this (with drawings) for the various examples.

#### Emmanuel Candes: Reliable predictions? Equitable treatment? Some recent progress in predictive inference

Recent progress in machine learning (ML) provides us with many potentially effective tools to learn from datasets of ever increasing sizes and make useful predictions. How do we know that these tools can be trusted in critical and high-sensitivity systems? If a learning algorithm predicts the GPA of a prospective college applicant, what guarantees do I have concerning the accuracy of this prediction? How do we know that it is not biased against certain groups of applicants? This talk introduces statistical ideas to ensure that the learned models satisfy some crucial properties, especially reliability and fairness (in the sense that the models need to apply to individuals in an equitable manner). To achieve these important objectives, we shall not ‘open up the black box’ and try understanding its underpinnings. Rather we discuss broad methodologies — conformal inference, quantile regression, the Jackknife+ — that can be wrapped around any black box as to produce results that can be trusted and are equitable.

#### Francis Bach: On the effectiveness of Richardson Extrapolation in Machine Learning

Richardson extrapolation is a classical technique from numerical analysis that can improve the approximation error of an estimation method by combining linearly several estimates obtained from different values of one of its hyperparameters, without the need to know in details the inner structure of the original estimation method. The main goal of this presentation is to study when Richardson extrapolation can be used within machine learning. We identify two situations where Richardson interpolation can be useful: (1) when the hyperparameter is the number of iterations of an existing iterative optimization algorithm, with applications to averaged gradient descent and Frank-Wolfe algorithms, and (2) when it is a regularization parameter, with applications to Nesterov smoothing techniques for minimizing non-smooth functions and ridge regression. In all these cases, I will show that extrapolation techniques come with no significant loss in performance, but with sometimes strong gains.