# the MAD+ Seminar

We are excited to announce the MaD+ (math and data plus) seminar, jointly organized between NYU and ETH. For the time being it is a virtual seminar series, and depending on logistics and interest potentially evolve to a multi-location seminar once the world returns to normal (with talks hosted in one of the locations, and streamed). The topics will be in line with the physical MaD seminar hosted at the Center for Data Science, which features leading specialists at the interface of Applied Mathematics, Statistics and Machine Learning. We took inspiration from the fantastic virtual seminar TCS+ from our friends in Theoretical Computer Science.

This semester it runs on Wednesdays in two time slots (each week one time slot) 4pm CET (10am EST) and and 2pm EST (8pm CET), in an attempt to accommodate the various working hours of more researchers around the world. Stay tuned for more great names that will be speaking this semester!

MaD seminars are recorded and streamed live. Links to the videos are available below.
You can subscribe to a calendar here, and to our **Youtube channel** here.

### Schedule with Confirmed Speakers

### Abstracts

#### Jon Niles-Weed: Matrix Concentration for Products

We develop nonasymptotic concentration bounds for products of independent random matrices. Such products arise in the study of stochastic algorithms, linear dynamical systems, and random walks on groups. Our bounds exactly match those available for scalar random variables and continue the program, initiated by Ahlswede-Winter and Tropp, of extending familiar concentration bounds to the noncommutative setting. Our proof technique relies on geometric properties of the Schatten trace class. Joint work with D. Huang, J. A. Tropp, and R. Ward.

#### Weinan E: A Mathematical Perspective of Machine Learning

The heart of modern machine learning is the approximation of high dimensional functions. Traditional approaches, such as approximation by piecewise polynomials, wavelets, or other linear combinations of fixed basis functions, suffer from the curse of dimensionality. We will discuss representations and approximations that overcome this difficulty, as well as gradient flows that can be used to find the optimal approximation. We will see that at the continuous level, machine learning can be formulated as a series of reasonably nice variational and PDE-like problems. Modern machine learning models/algorithms, such as the random feature and shallow/deep neural network models, can be viewed as special discretizations of such continuous problems. At the theoretical level, we will present a framework that is suited for analyzing machine learning models and algorithms in high dimension, and present results that are free of the curse of dimensionality. Finally, we will discuss the fundamental reasons that are responsible for the success of modern machine learning, as well as the subtleties and mysteries that still remain to be understood

#### Philippe Rigollet: Statistical and Computational aspects of Wasserstein Barycenters

The notion of average is central to most statistical methods. In this talk we study a generalization of this notion over the non-Euclidean space of probability measures equipped with a certain Wasserstein distance. This generalization is often called Wasserstein Barycenters and empirical evidence suggests that these barycenters allow to capture interesting notions of averages in graphics, data assimilation and morphometrics. However the statistical (rates of convergence) and computational (efficient algorithms) for these Wasserstein barycenters are largely unexplored. The goal of this talk is to review two recent results: 1. Fast rates of convergence for empirical barycenters in general geodesic spaces, and, 2. Provable guarantees for gradient descent and stochastic gradient descent to compute Wasserstein barycenters. Both results leverage geometric aspects of optimal transport. Based on joint works (arXiv:1908.00828, arXiv:2001.01700) with Chewi, Le Gouic, Maunu, Paris, and Stromme.

#### David Gamarnik: Overlap Gap Property: a Provable Barrier to Fast Optimization in Probabilistic Combinatorial Structures

Many combinatorial optimization problems defined on random instances exhibit an apparent gap between the optimal values, which can be computed by non-constructive means, and the best values achievable by fast (polynomial time) algorithms. Through a combined effort of mathematicians, computer scientists and statistical physicists, it became apparent that a potential barrier for designing fast algorithms bridging this gap is an intricate topology of nearly optimal solutions, in particular the presence of the Overlap Gap Property (OGP), which we will introduce in this talk. We will discuss how for many such problems the onset of the OGP phase transition introduces indeed a provable barrier to a broad class of polynomial time algorithms. Examples of such problems include the problem of finding a largest independent set of a random graph, finding a largest cut in a random hypergrah, the problem of finding a ground state of a p-spin model, and also many problems in high-dimensional statistics field. In this talk we will demonstrate in particular why OGP is a barrier for three classes of algorithms designed to find a near ground state in p-spin models arising in the field of spin glass theory: Approximate Message Passing algorithms, algorithms based on low-degree polynomial and Langevin dynamics. Joint work with Aukosh Jagannath and Alex Wein

#### Sara van de Geer: Total variation Regularization

Let Y be a n-dimensional vector of independent observations with unknown mean f^0 := E Y. We consider the estimator f_D that solves the ``analysis problem” min_f {|| Y - f ||_2^2/ n + 2 lambda || D f ||_1 } , where D is a given , m times n matrix and lambda>0 is a tuning parameter. An example for the matrix D is the (first order) difference operator (Df )*i= f*{i}- f_{i-1} , \ i \in [2:n] in which case || Df ||_1 = TV(f) is the (first order) total variation of the vector f. Other examples include higher order discrete derivatives, total variation on graphs and total variation in higher dimensions. Our aim is to show that the estimator f_D is adaptive. For example, when f^0 is a piecewise linear function, we show that the analysis estimator f_D, with D the second order differences operator, adapts to the number of kinks of f^0. As is the case with the Lasso, the theory for the analysis estimator f_D requires a form of “restricted eigenvalue” condition. We will show that this can be established using interpolating vectors. We will illustrate this (with drawings) for the various examples.

#### Emmanuel Candes: Reliable predictions? Equitable treatment? Some recent progress in predictive inference

Recent progress in machine learning (ML) provides us with many potentially effective tools to learn from datasets of ever increasing sizes and make useful predictions. How do we know that these tools can be trusted in critical and high-sensitivity systems? If a learning algorithm predicts the GPA of a prospective college applicant, what guarantees do I have concerning the accuracy of this prediction? How do we know that it is not biased against certain groups of applicants? This talk introduces statistical ideas to ensure that the learned models satisfy some crucial properties, especially reliability and fairness (in the sense that the models need to apply to individuals in an equitable manner). To achieve these important objectives, we shall not ‘open up the black box’ and try understanding its underpinnings. Rather we discuss broad methodologies — conformal inference, quantile regression, the Jackknife+ — that can be wrapped around any black box as to produce results that can be trusted and are equitable.

#### Francis Bach: On the effectiveness of Richardson Extrapolation in Machine Learning

Richardson extrapolation is a classical technique from numerical analysis that can improve the approximation error of an estimation method by combining linearly several estimates obtained from different values of one of its hyperparameters, without the need to know in details the inner structure of the original estimation method. The main goal of this presentation is to study when Richardson extrapolation can be used within machine learning. We identify two situations where Richardson interpolation can be useful: (1) when the hyperparameter is the number of iterations of an existing iterative optimization algorithm, with applications to averaged gradient descent and Frank-Wolfe algorithms, and (2) when it is a regularization parameter, with applications to Nesterov smoothing techniques for minimizing non-smooth functions and ridge regression. In all these cases, I will show that extrapolation techniques come with no significant loss in performance, but with sometimes strong gains.

#### Lenka Zdeborova: Understanding machine learning via exactly solvable statistical physics models

The affinity between statistical physics and machine learning has a long history, this is reflected even in the machine learning terminology that is in part adopted from physics. I will describe the main lines of this long-lasting friendship in the context of current theoretical challenges and open questions about deep learning. Theoretical physics often proceeds in terms of solvable synthetic models, I will describe the related line of work on solvable models of simple feed-forward neural networks. I will highlight a path forward to capture the subtle interplay between the structure of the data, the architecture of the network, and the learning algorithm.

#### Ingrid Daubechies: Diffusion Methods in Manifold and Fibre Bundle Learning

Diffusion methods help understand and denoise data sets; when there is additional structure (as is often the case), one can use (and get additional benefit from) a fiber bundle model.

#### BLACK LIVES MATTER

The MaD+ seminar is postponed in support of the Strike for Black Lives. We encourage participants to use today to focus on anti-racist education and advocacy, to help combat anti-Blackness in academia and elsewhere.

#### Andrea Montanari: The generalization error of overparametrized models: Insights from exact asymptotics

In a canonical supervised learning setting, we are given n data samples, each comprising a feature vector and a label, or response variable. We are asked to learn a function f that can predict the the label associated to a new –unseen– feature vector. How is it possible that the model learnt from observed data generalizes to new points? Classical learning theory assumes that data points are drawn i.i.d. from a common distribution and argue that this phenomenon is a consequence of uniform convergence: the training error is close to its expectation uniformly over all models in a certain class. Modern deep learning systems appear to defy this viewpoint: they achieve training error that is significantly smaller than the test error, and yet generalize well to new data. I will present a sequence of high-dimensional examples in which this phenomenon can be understood in detail. [Based on joint work with Song Mei, Feng Ruan, Youngtak Sohn, Jun Yan]

#### Mahdi Soltanolkotabi: Learning via early stopping and untrained neural nets

Modern neural networks are typically trained in an over-parameterized regime where the parameters of the model far exceed the size of the training data. Such neural networks in principle have the capacity to (over)fit any set of labels including significantly corrupted ones. Despite this (over)fitting capacity, over-parameterized networks have an intriguing robustness capability: they are surprisingly robust to label noise when first order methods with early stopping are used to train them. Even more surprising, one can remove noise and corruption from a natural image without using any training data what-so-ever, by simply fitting (via gradient descent) a randomly initialized, over-parameterized convolutional generator to a single corrupted image. In this talk I will first present theoretical results aimed at explaining the robustness capability of neural networks when trained via early-stopped gradient descent. I will then present results towards demystifying untrained networks for image reconstruction/restoration tasks such as denoising and those arising in inverse problems such as compressive sensing.

#### Samory Kpotufe: Some Recent Insights on Transfer-Learning

A common situation in Machine Learning is one where training data is not fully representative of a target population due to bias in the sampling mechanism or high costs in sampling the target population; in such situations, we aim to ’transfer’ relevant information from the training data (a.k.a. source data) to the target application. How much information is in the source data? How much target data should we collect if any? These are all practical questions that depend crucially on ‘how far’ the source domain is from the target. However, how to properly measure ‘distance’ between source and target domains remains largely unclear.

In this talk we will argue that much of the traditional notions of ‘distance’ (e.g. KL-divergence, extensions of TV such as D_A discrepancy, density-ratios, Wasserstein distance) can yield an over-pessimistic picture of transferability. Instead, we show that some new notions of ‘relative dimension’ between source and target (which we simply term ‘transfer-exponents’) capture a continuum from easy to hard transfer. Transfer-exponents uncover a rich set of situations where transfer is possible even at fast rates, encode relative benefits of source and target samples, and have interesting implications for related problems such as multi-task or multi-source learning.

In particular, in the case of multi-source learning, we will discuss (if time permits) a strong dichotomy between minimax and adaptive rates: no adaptive procedure can achieve a rate better than single source rates, although minimax (oracle) procedures can.

The talk is based on earlier work with Guillaume Martinet, and ongoing work with Steve Hanneke.

#### Ahmed El Alaoui: Optimization of mean-field spin glass Hamiltonians

We consider the question of computing an approximate ground state configuration of an Ising (mixed) p-spin Hamiltonian H_N from a bounded number of gradient evaluations.

I will present an efficient algorithm which exploits the ultrametric structure of the superlevel sets of H_N in order to achieve an energy E_* characterized via an extended Parisi variational principle. This energy E_* is optimal when the model satisfies a `no overlap gap’ condition. At the heart of this algorithmic approach is a stochastic control problem, whose dual turns out to be the Parisi formula, thereby shedding new light on the nature of the latter.

This is joint work with Andrea Montanari and Mark Sellke.

#### Giulio Biroli: On the benefit of over-parametrization and the origin of double descent curves in artificial neural networks

Deep neural networks have triggered a revolution in machine learning, and more generally in computer science. Understanding their remarkable performance is a key scientific challenge with many open questions. For instance, practitioners find that using massively over-parameterised networks is beneficial to learning and generalization ability. This fact goes against standard theories, and defies intuition. In this talk I will address this issue. I will first contrast standard expectations based on variance-bias trade-off to the results of numerical experiments on deep neural networks, which display a “double-descent” behavior of the test error when increasing the number of parameters instead of the traditional U-curve. I will then discuss a theory of this phenomenon based on the solution of simplified models of deep neural networks by statistical physics methods.