# the MAD+ Seminar

We are excited to announce the MaD+ (math and data plus) seminar, jointly organized between NYU and ETH. For the time being it is a virtual seminar series, and depending on logistics and interest potentially evolve to a multi-location seminar once the world returns to normal (with talks hosted in one of the locations, and streamed). The topics will be in line with the physical MaD seminar hosted at the Center for Data Science, which features leading specialists at the interface of Applied Mathematics, Statistics and Machine Learning. We took inspiration from the fantastic virtual seminar TCS+ from our friends in Theoretical Computer Science;

MaD seminars are recorded and streamed live. Links to the videos are available below.
You can subscribe to a calendar here, and to our **Youtube channel** here.

### Spring 2021 Speakers

Date | Speaker | Title | Stream | Video |
---|---|---|---|---|

4/16 11am EST | Galen Reeves (Duke University) | Statistical Limits for the Matrix Tensor Product | Zoom | Video |

4/23 11am EST | Lorin Crawford (Microsoft Research & Brown University) | Statistical Frameworks for Mapping 3D Shape Variation onto Genotypic and Phenotypic Variation | Zoom | Video |

5/7 11am EST | Inbar Seroussi (Weizmann Institute of Science) | How well can we generalize in high dimension? | Zoom | Video |

5/14 11am EST | Yi Ma (UC Berkeley) | Deep (Convolution) Networks from First Principles | Zoom | Video |

5/28 11am EST | Robert Nowak (University of Wisconsin-Madison) | Banach Space Representer Theorems for Neural Networks | Zoom | Video |

### Fall 2020 Speakers

Date | Speaker | Title | Stream | Video |
---|---|---|---|---|

10/7 11.15am EST | Jelani Nelson (UC Berkeley) | Optimal bounds for approximate counting | Zoom | Video |

10/14 11.15am EST | Gabor Lugosi (Pompeu Fabra University) | On estimating the mean of a random vector | Zoom | Video |

10/21 11.15am EST | Elizaveta (Liza) Levina (University of Michigan) | Hierarchical community detection by recursive partitioning | Zoom | Video |

11/11 11.15am EST | Yuanzhi Li (CMU) | Backward Feature Correction: How can Deep Learning perform Deep Learning? | Zoom | Video |

11/25 11.15am EST | Constantinos Daskalakis (MIT) | Equilibrium Computation and the Foundations of Deep Learning | Zoom | Video |

12/2 11.15am EST | Yury Polyanskiy (MIT) | Self-regularizing Property of Nonparametric Maximum Likelihood Estimator in Mixture Models | Zoom | Video |

12/9 11.15am EST | Anna Seigal (University of Oxford) | Algebraic methods in statistics and data analysis | Zoom | Video |

12/16 11.15am EST | Amin Coja-Oghlan (Goethe University Frankfurt) | Optimal group testing | Zoom | Video |

### Past Speakers

### Abstracts

#### Jon Niles-Weed: Matrix Concentration for Products

We develop nonasymptotic concentration bounds for products of independent random matrices. Such products arise in the study of stochastic algorithms, linear dynamical systems, and random walks on groups. Our bounds exactly match those available for scalar random variables and continue the program, initiated by Ahlswede-Winter and Tropp, of extending familiar concentration bounds to the noncommutative setting. Our proof technique relies on geometric properties of the Schatten trace class. Joint work with D. Huang, J. A. Tropp, and R. Ward.

#### Weinan E: A Mathematical Perspective of Machine Learning

The heart of modern machine learning is the approximation of high dimensional functions. Traditional approaches, such as approximation by piecewise polynomials, wavelets, or other linear combinations of fixed basis functions, suffer from the curse of dimensionality. We will discuss representations and approximations that overcome this difficulty, as well as gradient flows that can be used to find the optimal approximation. We will see that at the continuous level, machine learning can be formulated as a series of reasonably nice variational and PDE-like problems. Modern machine learning models/algorithms, such as the random feature and shallow/deep neural network models, can be viewed as special discretizations of such continuous problems. At the theoretical level, we will present a framework that is suited for analyzing machine learning models and algorithms in high dimension, and present results that are free of the curse of dimensionality. Finally, we will discuss the fundamental reasons that are responsible for the success of modern machine learning, as well as the subtleties and mysteries that still remain to be understood

#### Philippe Rigollet: Statistical and Computational aspects of Wasserstein Barycenters

The notion of average is central to most statistical methods. In this talk we study a generalization of this notion over the non-Euclidean space of probability measures equipped with a certain Wasserstein distance. This generalization is often called Wasserstein Barycenters and empirical evidence suggests that these barycenters allow to capture interesting notions of averages in graphics, data assimilation and morphometrics. However the statistical (rates of convergence) and computational (efficient algorithms) for these Wasserstein barycenters are largely unexplored. The goal of this talk is to review two recent results: 1. Fast rates of convergence for empirical barycenters in general geodesic spaces, and, 2. Provable guarantees for gradient descent and stochastic gradient descent to compute Wasserstein barycenters. Both results leverage geometric aspects of optimal transport. Based on joint works (arXiv:1908.00828, arXiv:2001.01700) with Chewi, Le Gouic, Maunu, Paris, and Stromme.

#### David Gamarnik: Overlap Gap Property: a Provable Barrier to Fast Optimization in Probabilistic Combinatorial Structures

Many combinatorial optimization problems defined on random instances exhibit an apparent gap between the optimal values, which can be computed by non-constructive means, and the best values achievable by fast (polynomial time) algorithms. Through a combined effort of mathematicians, computer scientists and statistical physicists, it became apparent that a potential barrier for designing fast algorithms bridging this gap is an intricate topology of nearly optimal solutions, in particular the presence of the Overlap Gap Property (OGP), which we will introduce in this talk. We will discuss how for many such problems the onset of the OGP phase transition introduces indeed a provable barrier to a broad class of polynomial time algorithms. Examples of such problems include the problem of finding a largest independent set of a random graph, finding a largest cut in a random hypergrah, the problem of finding a ground state of a p-spin model, and also many problems in high-dimensional statistics field. In this talk we will demonstrate in particular why OGP is a barrier for three classes of algorithms designed to find a near ground state in p-spin models arising in the field of spin glass theory: Approximate Message Passing algorithms, algorithms based on low-degree polynomial and Langevin dynamics. Joint work with Aukosh Jagannath and Alex Wein

#### Sara van de Geer: Total variation Regularization

Let Y be a n-dimensional vector of independent observations with unknown mean f^0 := E Y. We consider the estimator f_D that solves the ``analysis problem” min_f {|| Y - f ||_2^2/ n + 2 lambda || D f ||_1 } , where D is a given , m times n matrix and lambda>0 is a tuning parameter. An example for the matrix D is the (first order) difference operator (Df )*i= f*{i}- f_{i-1} , \ i \in [2:n] in which case || Df ||_1 = TV(f) is the (first order) total variation of the vector f. Other examples include higher order discrete derivatives, total variation on graphs and total variation in higher dimensions. Our aim is to show that the estimator f_D is adaptive. For example, when f^0 is a piecewise linear function, we show that the analysis estimator f_D, with D the second order differences operator, adapts to the number of kinks of f^0. As is the case with the Lasso, the theory for the analysis estimator f_D requires a form of “restricted eigenvalue” condition. We will show that this can be established using interpolating vectors. We will illustrate this (with drawings) for the various examples.

#### Emmanuel Candes: Reliable predictions? Equitable treatment? Some recent progress in predictive inference

Recent progress in machine learning (ML) provides us with many potentially effective tools to learn from datasets of ever increasing sizes and make useful predictions. How do we know that these tools can be trusted in critical and high-sensitivity systems? If a learning algorithm predicts the GPA of a prospective college applicant, what guarantees do I have concerning the accuracy of this prediction? How do we know that it is not biased against certain groups of applicants? This talk introduces statistical ideas to ensure that the learned models satisfy some crucial properties, especially reliability and fairness (in the sense that the models need to apply to individuals in an equitable manner). To achieve these important objectives, we shall not ‘open up the black box’ and try understanding its underpinnings. Rather we discuss broad methodologies — conformal inference, quantile regression, the Jackknife+ — that can be wrapped around any black box as to produce results that can be trusted and are equitable.

#### Francis Bach: On the effectiveness of Richardson Extrapolation in Machine Learning

Richardson extrapolation is a classical technique from numerical analysis that can improve the approximation error of an estimation method by combining linearly several estimates obtained from different values of one of its hyperparameters, without the need to know in details the inner structure of the original estimation method. The main goal of this presentation is to study when Richardson extrapolation can be used within machine learning. We identify two situations where Richardson interpolation can be useful: (1) when the hyperparameter is the number of iterations of an existing iterative optimization algorithm, with applications to averaged gradient descent and Frank-Wolfe algorithms, and (2) when it is a regularization parameter, with applications to Nesterov smoothing techniques for minimizing non-smooth functions and ridge regression. In all these cases, I will show that extrapolation techniques come with no significant loss in performance, but with sometimes strong gains.

#### Lenka Zdeborova: Understanding machine learning via exactly solvable statistical physics models

The affinity between statistical physics and machine learning has a long history, this is reflected even in the machine learning terminology that is in part adopted from physics. I will describe the main lines of this long-lasting friendship in the context of current theoretical challenges and open questions about deep learning. Theoretical physics often proceeds in terms of solvable synthetic models, I will describe the related line of work on solvable models of simple feed-forward neural networks. I will highlight a path forward to capture the subtle interplay between the structure of the data, the architecture of the network, and the learning algorithm.

#### Ingrid Daubechies: Diffusion Methods in Manifold and Fibre Bundle Learning

Diffusion methods help understand and denoise data sets; when there is additional structure (as is often the case), one can use (and get additional benefit from) a fiber bundle model.

#### BLACK LIVES MATTER

The MaD+ seminar is postponed in support of the Strike for Black Lives. We encourage participants to use today to focus on anti-racist education and advocacy, to help combat anti-Blackness in academia and elsewhere.

#### Andrea Montanari: The generalization error of overparametrized models: Insights from exact asymptotics

In a canonical supervised learning setting, we are given n data samples, each comprising a feature vector and a label, or response variable. We are asked to learn a function f that can predict the the label associated to a new –unseen– feature vector. How is it possible that the model learnt from observed data generalizes to new points? Classical learning theory assumes that data points are drawn i.i.d. from a common distribution and argue that this phenomenon is a consequence of uniform convergence: the training error is close to its expectation uniformly over all models in a certain class. Modern deep learning systems appear to defy this viewpoint: they achieve training error that is significantly smaller than the test error, and yet generalize well to new data. I will present a sequence of high-dimensional examples in which this phenomenon can be understood in detail. [Based on joint work with Song Mei, Feng Ruan, Youngtak Sohn, Jun Yan]

#### Mahdi Soltanolkotabi: Learning via early stopping and untrained neural nets

Modern neural networks are typically trained in an over-parameterized regime where the parameters of the model far exceed the size of the training data. Such neural networks in principle have the capacity to (over)fit any set of labels including significantly corrupted ones. Despite this (over)fitting capacity, over-parameterized networks have an intriguing robustness capability: they are surprisingly robust to label noise when first order methods with early stopping are used to train them. Even more surprising, one can remove noise and corruption from a natural image without using any training data what-so-ever, by simply fitting (via gradient descent) a randomly initialized, over-parameterized convolutional generator to a single corrupted image. In this talk I will first present theoretical results aimed at explaining the robustness capability of neural networks when trained via early-stopped gradient descent. I will then present results towards demystifying untrained networks for image reconstruction/restoration tasks such as denoising and those arising in inverse problems such as compressive sensing.

#### Samory Kpotufe: Some Recent Insights on Transfer-Learning

A common situation in Machine Learning is one where training data is not fully representative of a target population due to bias in the sampling mechanism or high costs in sampling the target population; in such situations, we aim to ’transfer’ relevant information from the training data (a.k.a. source data) to the target application. How much information is in the source data? How much target data should we collect if any? These are all practical questions that depend crucially on ‘how far’ the source domain is from the target. However, how to properly measure ‘distance’ between source and target domains remains largely unclear.

In this talk we will argue that much of the traditional notions of ‘distance’ (e.g. KL-divergence, extensions of TV such as D_A discrepancy, density-ratios, Wasserstein distance) can yield an over-pessimistic picture of transferability. Instead, we show that some new notions of ‘relative dimension’ between source and target (which we simply term ‘transfer-exponents’) capture a continuum from easy to hard transfer. Transfer-exponents uncover a rich set of situations where transfer is possible even at fast rates, encode relative benefits of source and target samples, and have interesting implications for related problems such as multi-task or multi-source learning.

In particular, in the case of multi-source learning, we will discuss (if time permits) a strong dichotomy between minimax and adaptive rates: no adaptive procedure can achieve a rate better than single source rates, although minimax (oracle) procedures can.

The talk is based on earlier work with Guillaume Martinet, and ongoing work with Steve Hanneke.

#### Ahmed El Alaoui: Optimization of mean-field spin glass Hamiltonians

We consider the question of computing an approximate ground state configuration of an Ising (mixed) p-spin Hamiltonian H_N from a bounded number of gradient evaluations.

I will present an efficient algorithm which exploits the ultrametric structure of the superlevel sets of H_N in order to achieve an energy E_* characterized via an extended Parisi variational principle. This energy E_* is optimal when the model satisfies a `no overlap gap’ condition. At the heart of this algorithmic approach is a stochastic control problem, whose dual turns out to be the Parisi formula, thereby shedding new light on the nature of the latter.

This is joint work with Andrea Montanari and Mark Sellke.

#### Giulio Biroli: On the benefit of over-parametrization and the origin of double descent curves in artificial neural networks

Deep neural networks have triggered a revolution in machine learning, and more generally in computer science. Understanding their remarkable performance is a key scientific challenge with many open questions. For instance, practitioners find that using massively over-parameterised networks is beneficial to learning and generalization ability. This fact goes against standard theories, and defies intuition. In this talk I will address this issue. I will first contrast standard expectations based on variance-bias trade-off to the results of numerical experiments on deep neural networks, which display a “double-descent” behavior of the test error when increasing the number of parameters instead of the traditional U-curve. I will then discuss a theory of this phenomenon based on the solution of simplified models of deep neural networks by statistical physics methods.

#### Jelani Nelson: Optimal bounds for approximate counting

Counting up to N deterministically of course takes Theta(log N) bits. In the first ever streaming algorithm, Morris in 1978 gave a randomized algorithm that improved upon this bound exponentially. The best known analysis of his algorithm shows that it gives a 1+eps approximation to N with probability at least 1-delta using O(loglog N + log(1/eps) + log(1/delta)) bits with high probability (the space usage is itself a random variable). We show that a very slight (but necessary) tweak of his algorithm actually achieves the better bound O(loglog N + log(1/eps) + loglog(1/delta)) bits, and we also show a new matching lower bound, establishing optimality. Joint work with Huacheng Yu.

#### Gabor Lugosi: On estimating the mean of a random vector

One of the most basic problems in statistics is the estimation of the mean of a random vector, based on independent observations. This problem has received renewed attention in the last few years, both from statistical and computational points of view. In this talk we review some recent results on the statistical performance of mean estimators that allow heavy tails and adversarial contamination in the data. The basic punchline is that one can construct estimators that, under minimal conditions, achieve the same kind of performance as if the data was Gaussian. The material of this talk is based on a series of joint papers with Shahar Mendelson.

#### Elizaveta (Liza) Levina: Hierarchical community detection by recursive partitioning

Community detection in networks has been extensively studied in the form of finding a single partition into a “correct” number of communities. In large networks, however, a multi-scale hierarchy of communities is much more realistic. We show that a hierarchical tree of communities, obviously more interpretable, is also potentially more accurate and more computationally efficient. We construct this tree with a simple top-down recursive algorithm, at each step splitting the nodes into two communities with a non-iterative spectral algorithm, until a stopping rule suggests there are no more communities. The algorithm is model-free, extremely fast, and requires no tuning other than selecting a stopping rule. We propose a natural model for this setting, a binary tree stochastic block model, and prove that the algorithm correctly recovers the entire community tree under relatively mild assumptions. As a by-product, we obtain explicit and intuitive results for fitting the stochastic block model under model misspecification. We illustrate the algorithm on a statistics papers dataset constructing a highly interpretable tree of statistics research communities, and on a network based on gene co-occurrence in research papers on anemia.

Joint work with Tianxi Li, Lihua Lei, Sharmodeep Bhattacharyya, Koen van de Berge, Purnamrita Sarkar, and Peter Bickel.

#### Yuanzhi Li (CMU): Backward Feature Correction: How can Deep Learning perform Deep Learning?

How does a 110-layer ResNet learn a high-complexity classifier using relatively few training examples and short training time? We present a theory towards explaining this deep learning process in terms of hierarchical learning. We refer to hierarchical learning as the learner learns to represent a complicated target function by decomposing it into a sequence of simpler functions, to reduce sample and time complexity.

This work formally analyzes how multi-layer neural networks can perform such hierarchical learning efficiently and automatically simply by applying stochastic gradient descent (SGD) to the training objective, especially when other “shallow” models provably fail to learn the concept class efficiently due to the lack of hierarchy.

In particular, we establish a new principle called “backward feature correction” to show how the features in the lower-level layers in the network can also be improved via training together with higher-level layers, which we believe is the key to understand the deep learning process in multi-layer neural networks.

We also present empirical evidences supporting our theorem, in particular, we show “how much, how deep” the “backwards” (i.e. the improvement of lower level layers in a neural network due to the gradient from higher level layers) in a multi-layer neural network needs to be, and which part of the lower level features are getting improved through the “backwards”.

#### Constantinos Daskalakis (MIT): Equilibrium Computation and the Foundations of Deep Learning

Deep Learning has recently yielded important advances in single-agent learning challenges, much of that progress being fueled by the empirical success of gradient descent and its variants in computing local optima of non-convex optimization problems. In multi-agent learning applications, the role of single-objective optimization is played by equilibrium computation, yet our understanding of its complexity in settings that are relevant for Deep Learning remains sparse. In this talk we focus on min-max optimization of nonconvex-nonconcave objectives, which has found applications in GANs, and other adversarial learning problems. Here, not only are there no known gradient-descent based methods converging to even local and approximate min-max equilibria, but the computational complexity of identifying them remains poorly understood. We show that finding approximate local min-max equilibria of Lipschitz and smooth objectives requires a number of queries to the function and its gradient that is exponential in the relevant parameters, in sharp contrast to the polynomial number of queries required to find approximate local minima of non-convex objectives. Our oracle lower bound is a byproduct of a complexity-theoretic result showing that finding approximate local min-max equilibria is computationally equivalent to finding Brouwer fixed points, and Nash equilibria in non zero-sum games, and thus PPAD-complete. Minimal complexity theory knowledge will be assumed in the talk.

Joint work with Stratis Skoulakis and Manolis Zampetakis.

#### Yury Polyanskiy (MIT): Self-regularizing Property of Nonparametric Maximum Likelihood Estimator in Mixture Models

Introduced by Kiefer and Wolfowitz 1956, the nonparametric maximum likelihood estimator (NPMLE) is a widely used methodology for learning mixture models and empirical Bayes estimation. Sidestepping the non-convexity in mixture likelihood, the NPMLE estimates the mixing distribution by maximizing the total likelihood over the space of probability measures, which can be viewed as an extreme form of over parameterization.

In this work we discover a surprising property of the NPMLE solution. Consider, for example, a Gaussian mixture model on the real line with a subgaussian mixing distribution. Leveraging complex-analytic techniques, we show that with high probability the NPMLE based on a sample of size n has O(\log n) atoms (mass points), significantly improving the deterministic upper bound of n due to Lindsay (1983). Notably, any such Gaussian mixture is statistically indistinguishable from a finite one with O(\log n) components (and this is tight for certain mixtures). Thus, absent any explicit form of model selection, NPMLE automatically chooses the right model complexity, a property we term self-regularization. Extensions to other exponential families are given. As a statistical application, we show that this structural property can be harnessed to bootstrap existing Hellinger risk bound of the (parametric) MLE for finite Gaussian mixtures to the NPMLE for general Gaussian mixtures, recovering a result of Zhang (2009). Time permitting, we will discuss connections to approaching the optimal regret in empirical Bayes. This is based on joint work with Yihong Wu (Yale).

#### Anna Seigal (University of Oxford): Algebraic methods in statistics and data analysis

Algebraic structure is at the heart of many problems in statistics and data analysis. We aim to fit data to a model, or to approximate data by a point on some locus of interest. I will discuss how algebraic structure can be used to capture the existence and uniqueness of a solution to these problems, as well as to suggest suitable algorithms. I will first consider parameter estimation in statistical models via maximum likelihood estimation. We will see connections between maximum likelihood estimation and invariant theory. I will then discuss tensors, the higher dimensional analogues of matrices. The loci of tensors that are of interest in applications often define semi-algebraic sets, given by polynomial equations and inequalities. One example is the signature, tensors that can be used to encode a path of time series data. We will see the algebraic structure that relates a path to its signature.

#### Amin Coja-Oghlan (Goethe University Frankfurt): Optimal group testing

In group testing the aim is to identify a small set of infected individuals within a large population. At our disposal we have a test procedure that can be applied to a group of individuals, with the test returning a positive result if at least one individual in the group is infected. All tests are conducted in parallel. The aim is to devise a (possibly randomised) test design with as few tests as possible that identifies the infected individuals with high probability. We show that there occurs a sharp information-theoretic/algorithmic phase transition as the number of tests passes an explicit threshold. The talk is based on joint work with Oliver Gebhard, Max Hahn-Klimroth and Philipp Loick.

#### Galen Reeves (Duke University): Statistical Limits for the Matrix Tensor Product

High-dimensional models involving the products of large random matrices include the spiked matrix models appearing in principle component analysis and the stochastic block model appearing in network analysis. In this talk I will present recent theoretical work that provides an asymptotically exact characterization of the fundamental limits of inference for a broad class of these models. The first part of the talk will introduce the “matrix tensor product” model and describe some implications of the theory for community detection in correlated networks. The second part will highlight some of the ideas in the analysis, which builds upon ideas from information theory and statistical physics.

The material in this talk is appears in following paper:

Information-Theoretic Limits for the Matrix Tensor Product, Galen Reeves https://arxiv.org/abs/2005.11273

#### Lorin Crawford (Microsoft Research & Brown University): Statistical Frameworks for Mapping 3D Shape Variation onto Genotypic and Phenotypic Variation

The recent curation of large-scale databases with 3D surface scans of shapes has motivated the development of tools that better detect global-patterns in morphological variation. Studies which focus on identifying differences between shapes have been limited to simple pairwise comparisons and rely on pre-specified landmarks (that are often known). In this talk, we present SINATRA: a statistical pipeline for analyzing collections of shapes without requiring any correspondences. Our method takes in two classes of shapes and highlights the physical features that best describe the variation between them.

The SINATRA pipeline implements four key steps. First, SINATRA summarizes the geometry of 3D shapes (represented as triangular meshes) by a collection of vectors (or curves) that encode changes in their topology. Second, a nonlinear Gaussian process model, with the topological summaries as input, classifies the shapes. Third, an effect size analog and corresponding association metric is computed for each topological feature used in the classification model. These quantities provide evidence that a given topological feature is associated with a particular class. Fourth, the pipeline iteratively maps the topological features back onto the original shapes (in rank order according to their association measures) via a reconstruction algorithm. This highlights the physical (spatial) locations that best explain the variation between the two groups.

We use a rigorous simulation framework to assess our approach, which themselves are a novel contribution to 3D image analysis. Lastly, as a case study, we use SINATRA to analyze mandibular molars from four different suborders of primates and demonstrate its ability recover known morphometric variation across phylogenies.

#### Dr. Inbar Seroussi (Weizmann Institute of Science): How well can we generalize in high dimension?

Deep learning algorithms operate in regimes that defy classical learning theory. Neural networks architectures often contain more parameters than training samples. Despite their huge complexity, the generalization error achieved on real data is small. In this talk, we aim to study generalization properties of algorithms in high dimension. Interestingly, we show that algorithms in high dimension require a small bias for good generalization. We show that this is indeed the case for deep neural networks in the overparametrized regime. In addition, we provide lower bounds on the generalization error in various settings for any algorithm. We calculate such bounds using random matrix theory (RMT). We will review the connection between deep neural network and RMT and existing results. These bounds are particularly useful when the analytic evaluation of standard performance bounds is not possible due to the complexity and nonlinearity of the model. The bounds can serve as a benchmark for testing performance and optimizing the design of actual learning algorithms. (Joint work with Prof. Ofer Zeitouni)

#### Yi Ma (UC Berkeley): Deep (Convolution) Networks from First Principles

In this talk, we offer an entirely “white box’’ interpretation of deep (convolution) networks from the perspective of data compression (and group invariance). In particular, we show how modern deep layered architectures, linear (convolution) operators and nonlinear activations, and even all parameters can be derived from the principle of maximizing rate reduction (with group invariance). All layers, operators, and parameters of the network are explicitly constructed via forward propagation, instead of learned via back propagation. All components of so-obtained network, called ReduNet, have precise optimization, geometric, and statistical interpretation. There are also several nice surprises from this principled approach: it reveals a fundamental tradeoff between invariance and sparsity for class separability; it reveals a fundamental connection between deep networks and Fourier transform for group invariance – the computational advantage in the spectral domain (why spiking neurons?); this approach also clarifies the mathematical role of forward propagation (optimization) and backward propagation (variation). In particular, the so-obtained ReduNet is amenable to fine-tuning via both forward and backward (stochastic) propagation, both for optimizing the same objective.

This is joint work with students Yaodong Yu, Ryan Chan, Haozhi Qi of Berkeley, Dr. Chong You now at Google Research, and Professor John Wright of Columbia University.

#### Robert D. Nowak (University of Wisconsin-Madison): Banach Space Representer Theorems for Neural Networks

This talk presents a variational framework to understand the properties of functions learned by neural networks fit to data. The framework is based on total variation semi-norms defined in the Radon domain, which is naturally suited to the analysis of neural activation functions (ridge functions). Finding a function that fits a dataset while having a small semi-norm is posed as an infinite dimensional variational optimization. We derive a representer theorem showing that finite-width neural networks are solutions to the variational problem. The representer theorem is reminiscent of the classical reproducing kernel Hilbert space representer theorem, but we show that neural networks are solutions in a non-Hilbertian Banach space. While the learning problems are posed in an infinite dimensional function space, similar to kernel methods, they can be recast as finite-dimensional neural network training problems. These neural network training problems have regularizers which are related to the well-known weight decay and path-norm regularizers. Thus, the results provide new insight into functional characteristics of overparameterized neural networks and also into the design neural network regularizers. Our results also provide new theoretical support for a number of empirical findings in deep learning architectures including the benefits of “skip connections”, sparsity, and low-rank structures. This is joint work with Rahul Parhi.

Bio: Robert D. Nowak holds the Nosbusch Professorship in Engineering at the University of Wisconsin-Madison, where his research focuses on signal processing, machine learning, optimization, and statistics.