# the MIC Seminar

The Mathematics, Information and Computation (MIC) Seminar runs at irregular intervals and covers specific aspects at the interface of applied maths, information theory and theory of computation.

### Schedule Spring 19

Date Speaker Title Room
Feb 8, 2pm Ilias Zadik (MIT) Algorithms and Algorithmic Intractability in High Dimensional Linear Regression WWH 1314
Feb 11, 1pm Boris Hanin (Texas A&M) Complexity of Linear Regions in Deep Networks WWH 705
Apr 3, 2pm Cong Ma (Princeton) Noisy Matrix Completion: Understanding Statistical Guarantees for Convex Relaxation via Nonconvex Optimization WWH 1314
Apr 9, 1pm Jonathan Shi (Cornell) The “Method of Pseudo-Moments”: Tensor Decomposition for Mixture Models WWH 705

### Abstracts

#### Jonathan Shi: The “Method of Pseudo-Moments”: Tensor Decomposition for Mixture Models

Tensors are versatile mathematical objects, generalizing matrices to capture polynomial relationships between vectors rather than linear ones. In statistical learning especially, the moments of a distribution are naturally interpreted as tensors. This gives tensor techniques a central position in the method of moments—the estimation of hidden parameters of a distribution from observations of that distribution’s moments.

We describe a new class of algorithms for one such technique—tensor rank decomposition—based on the sum-of-squares meta-algorithm. In particular, these algorithms feature an improved provable level of robustness against error. Under this new robustness, statistical mixture models previously considered distinct—including independent component analysis and analytically sparse dictionary learning—may be unified as special cases of a more general mixture model. We also describe progress in designing fast spectral algorithms guided by sum-of-squares analyses, which may achieve nearly the same guarantees in the same settings.

Based on joint work with Sam Hopkins, Tengyu Ma, Tselil Schramm, David Steurer.

#### Cong Ma: Noisy Matrix Completion: Understanding Statistical Guarantees for Convex Relaxation via Nonconvex Optimization

This talk is concerned with noisy low-rank matrix completion: given partial and corrupted entries of a large low-rank matrix, the goal is to estimate the underlying matrix faithfully and efficiently. Arguably one of the most popular paradigms to tackle this problem is convex relaxation, which achieves remarkable efficacy in practice. However, the theoretical support of this approach is still far from optimal in the noisy setting, falling short of explaining the empirical success. We make progress towards demystifying the practical efficacy of convex relaxation vis-à-vis random noise. When the rank of the unknown matrix is a constant, we demonstrate that the convex programming approach achieves near-optimal estimation errors — in terms of the Euclidean loss, the entrywise loss, and the spectral norm loss — for a wide range of noise levels. All of this is enabled by bridging convex relaxation with the nonconvex Burer–Monteiro approach, a seemingly distinct algorithmic paradigm that is provably robust against noise. More specifically, we show that an approximate critical point of the nonconvex formulation serves as an extremely tight approximation of the convex solution, allowing us to transfer the desired statistical guarantees of the nonconvex approach to its convex counterpart.

#### Boris Hanin: Complexity of Linear Regions in Deep Networks

I will present several new results, joint with David Rolnick, about the number of linear regions and the sizes of the boundaries of linear regions in a network N with piecewise linear activations and random weights/biases.

I will discuss a new formula for the average complexity of linear regions that holds even for highly correlated weights and biases, and hence is valid throughout training. It shows, for example, that at initialization, the number of regions along any 1D line grows like the number of neurons in N. In particular, perhaps surprisingly, it is this number is not exponential in the depth of the network.

I will explain the analog of this result for higher input dimension and will report on a number of experiments, which demonstrate empirically that our precise theorems at initialization can be expected to hold qualitatively throughout training.

Bio: Boris Hanin is a mathematician work on deep learning and mathematical physics. Before joining the faculty in the Math Department at Texas A&M in 2017, he was an NSF Postdoc in Math at MIT. He is currently a Visiting Scientist at Facebook AI Research in NYC.

#### Ilias Zadik: Algorithms and Algorithmic Intractability in High Dimensional Linear Regression

In this talk we will focus on the high dimensional linear regression problem. The goal is to recover a hidden k-sparse binary vector \beta under n noisy linear observations Y=X\beta+W where X is an n \times p matrix with iid N(0,1) entries and W is an n-dimensional vector with iid N(0,\sigma^2) entries. In the literature of the problem, an apparent asymptotic gap is observed between the optimal sample size for information-theoretic recovery, call it n*, and for computationally efficient recovery, call it n_alg.

We will discuss several new contributions on studying this gap. We first identify tightly the information limit of the problem using a novel analysis of the Maximum Likelihood Estimator (MLE) performance. Furthermore, we establish that the algorithmic barrier n_alg coincides with the phase transition point for the appearance of a certain Overlap Gap Property (OGP) over the space of k-sparse binary vectors. The presence of such an Overlap Gap Property phase transition, which originates in spin glass theory, is known to provide evidence of an algorithmic hardness. Finally, we show that in the extreme case where the noise level is zero, i.e. \sigma=0, the computational-statistical gap closes by proposing an optimal polynomial-time algorithm using the Lenstra-Lenstra-Lov\‘asz lattice basis reduction algorithm.

This is joint work with David Gamarnik.

### Schedule Fall 18

During the Fall the MIC Seminar will usually be in room 312 Tuesdays 6-7p.

Date Speaker Title Room
Oct 16, 6p David Shirokoff (NJIT) Convex relaxations for variational problems arising from models of self-assembly WWH 312
Oct 23, 6p Didier Henrion (CNRS) Solving nonlinear PDEs with the Lasserre hierarchy WWH 312
Nov 2, 1:30p Anna Seigal (Berkeley) Structured Tensors and the Geometry of Data WWH 202
Nov 12, 12:15p Marylou Gabrie (ENS) Entropy and mutual information in models of deep neural networks CDS 650
Nov 13, 6:00p Aukosh Jagannath (Harvard) Algorithmic thresholds for tensor principle component analysis WWH 312

### Abstracts

#### Aukosh Jagannath: Algorithmic thresholds for tensor principle component analysis

Consider the problem of recovering a rank 1 tensor of order k that has been subject to Gaussian noise. The log-likelihood for this problem is highly non-convex. It is information theoretically possible to recover the tensor with a finite number of samples via maximum likelihood estimation, however, it is expected that one needs a polynomially diverging number of samples to efficiently recover it. What is the cause of this large statistical–to–algorithmic gap? To study this question, we investigate the thresholds for efficient recovery for a simple family of algorithms, Langevin dynamics and gradient descent. We view this problem as a member of a broader class of problems which correspond to recovering a signal from a non-linear observation that has been perturbed by isotropic Gaussian noise. We propose a mechanism for success/failure of recovery of such algorithms in terms of the strength of the signal on the high entropy region of the initialization. Joint work with G. Ben Arous (NYU) and R. Gheissari (NYU).

#### Marylou Gabrie: Entropy and mutual information in models of deep neural networks

The successes and the multitude of applications of deep learning methods have spurred efforts towards quantitative modeling of the performance of deep neural networks. In particular, an information-theoretic approach linking generalization capabilities to compression has been receiving increasing interest. Nevertheless, it is in practice computationally intractable to compute entropies and mutual informations in industry-sized neural networks. In this talk, we will consider instead a class of models of deep neural networks, for which an expression for these information-theoretic quantities can be derived from the replica method. We will examine how mutual informations between hidden and input variables can be reported along the training of such neural networks on synthetic datasets. This work was done in collaboration with Andre Manoel (Owkin), Clément Luneau (EPFL), Jean Barbier (EPFL), Nicolas Macris (EPFL), Florent Krzakala (LPS ENS) and Lenka Zdeborova (IPHT CEA).

#### Anna Seigal: Structured Tensors and the Geometry of Data

Abstract: Tensors are higher dimensional analogues of matrices; they are used to record data with multiple changing variables. Interpreting tensor data requires finding low rank structure, and the structure depends on the application or context. In this talk, we describe four projects in the study of structured tensors. Often tensors of interest define semi-algebraic sets, given by polynomial equations and inequalities. We give a characterization of the set of tensors of real rank two, and answer questions about statistical models using probability tensors and semi-algebraic statistics. We also study cubic surfaces as symmetric tensors, and describe work on learning a path from its three dimensional signature tensor. This talk is based on joint work with Guido Montúfar, Max Pfeffer, and Bernd Sturmfels.

#### David Shirokoff: Convex relaxations for variational problems arising from models of self-assembly

We examine the problem of minimizing a class of nonlocal, nonconvex variational problems that arise from modeling a large number of pairwise interacting particles in the presence of thermal noise (i.e. molecular dynamics). Although finding and verifying local minima to these functionals is relatively straightforward, computing and verifying global minima is much more difficult. Global minima (ground states) are important as they characterize the structure of matter in models of self-assembly. We discuss how minimizing the functionals can be viewed as testing whether an associated bilinear form is co-positive. We then develop sufficient conditions for global optimality (which in some cases are provably sharp) obtained through a convex relaxation related to the cone of co-positive functionals. The resulting convex relaxation results in a conic variational problem with an infinite number of Fourier constraints, and leads to a variety of computational challenges. Pending time, we provide details on matrix-free interior point algorithms that alleviate some of the computational difficulties (i.e. solutions may be Dirac masses) associated with the large-scale problems.

### Schedule Summer 18

Date & Time Speaker Title Room
May 30, 10:15am Yu Guang Wang (ICERM and The University of New South Wales, Sydney) Tight framelets and fast framelet filter bank transforms on manifolds WWH 317
June 6, 10:30am Roman Vershynin (UCI) From number theory to machine learning: a hunt for smooth Boolean functions WWH 317
June 13, 11:00am Aida Khajavirad (CMU and NYU) The multilinear polytope for acyclic Hypergraphs WWH 317
July 3, 10:30am Chiheon Kim (MIT) Statistical Limits of Graphical Channel Models WWH 317

### Abstracts

#### Yu Guang Wang: Tight framelets and fast framelet filter bank transforms on manifolds

Data in practical application with some structure can be viewed as sampled from a manifold, for instance, data on a graph and in astrophysics. A smooth and compact Riemannian manifold M, including examples of spheres, tori, cubes and graphs, is an important geometric struncture. In this work, we construct a type of tight framelets using quadrature rules on M to represent the data (or a function) and to exploit the derived framelets to process the data (for example, image and signal processing on the sphere or graphs).

One critical computation for framelets is to compute, from the framelet coefficients for the input data (which are assumed at the highest level), the framelet coefficients at lower levels, and also to evaluate the function values at new nodes using the framelet representation. We design an efficient computational strategy, which we call fast framelet filter bank transform (FMT), to compute the framelet coefficients and to recover the function. Assuming the fast Fourier transform (FFT) and using quadrature rules on the manifold M, the FMT has the same computational complexity as the FFT. Numerical examples illustrate the efficiency and accuracy of the algorithm for the framelets. This is joint work with Q. T. Le Gia, Ian Sloan and Rob Womersley (UNSW Sydney), Houying Zhu (University of Melbourne) and Xiaosheng Zhuang (City University of Hong Kong).

#### Roman Vershynin: From number theory to machine learning: a hunt for smooth Boolean functions

The most fundamental kind of functions studied in computer science are Boolean functions. They take n bits as an input and return one bit as an output. Most Boolean functions oscillate a lot, which is analogous to the fact that “most” continuous functions on R are nowhere differentiable. If we want to generate a “smooth” Boolean function, we can take the sign of some polynomial of low degree in n variables. Such functions are called polynomial threshold functions, and they are widely used in machine learning as classification devices. Surprisingly, we do not know how many polynomial threshold functions there are with a given degree! Even an approximate answer to this question has been known only for polynomials of degree 1, i.e. for linear functions. In a very recent joint work with Pierre Baldi, we found a way to approximately count polynomial threshold functions of any fixed degree. This solves a problem of M. Saks that goes back to 1993 and earlier. Our argument draws ideas from analytical number theory, additive combinatorics, enumerative combinatorics, probability and discrete geometry. I will describe some of these connections, focusing particularly on a beautiful interplay of zeta and Mobius funcitons in number theory, hyperplane arrangements in enumerative combinatorics and random tensors in probability theory.

#### Aida Khajavirad: The multilinear polytope for acyclic Hypergraphs

We consider the multilinear polytope defined as the convex hull of the set of binary points satisfying a collection of multilinear equations. Such sets are of fundamental importance in many types of mixed-integer nonlinear optimization problems, such as binary polynomial optimization. Utilizing an equivalent hypergraph representation, we study the facial structure of the multilinear polytope in conjunction with the acyclicity degree of the underlying hypergraph. We derive various types of facet-defining inequalities and provide explicit characterizations for the multilinear polytope of Berge-acylic and gamma-acyclic hypergraphs. As an important byproduct, we present a new class of cutting planes for constructing stronger polyhedral relaxations of mixed-integer nonlinear optimization problems with multilinear sub-expressions. Finally, we detail on the complexity of corresponding separation problems and embed the proposed cut generation algorithm at every node of the branch-and-reduce global solver BARON. Extensive computational results will be presented.

#### Chiheon Kim: Statistical Limits of Graphical Channel Models

We investigate the exact recovery problem in graphical channel models. Graphical channel model is defined as following: given a hypergraph H=(V,E) and a hidden labeling x of V, we observe mutually independent random values {y_e: e in E} where y_e is generated from a distribution which only depends on the labels {x_u: u in e}. This model encompasses many statistical models such as the stochastic block model, spiked Gaussian tensor model, and sparse graph codes. We consider the problem of exactly recovering the ground truth x from a sample y, and prove that under mild conditions on the channel, it exhibits a sharp phase transition behavior. Precisely, we find the explicit constant I (depending on the channel) such that the exact recovery is achievable w.h.p. if I > 1 and it is not achievable if I < 1. Joint work with Afonso S. Bandeira and Michel X. Goemans.

### Schedule Spring 18

Date & Time Speaker Title Room
Jan 31, 10:15am Mariano Tepper (Simons Foundation) Clustering is semidefinitely not that hard WWH 905
Feb 7, 10:15am Luca Venturi (NYU) Connectivity of Neural Networks Optimization Landscapes WWH 905
Feb 21, 10:15am Joao Pereira Estimation in multireference alignment and generalizations. WWH 905
Mar 7 11 am Levent Sagun Comparing Dynamics: Deep Neural Networks versus Glassy Systems WWH 905
Mar 27 1 pm Hugo Lavenant Harmonic mappings valued in the Wasserstein space CDS C15
Apr 30 11 am Jean-Luc Bouchot Compressed sensing Petrov-Galerkin: When data science helps solve problem in applied mathematics WWH 202

### Abstracts

#### Jean-Luc Bouchot: Compressed sensing Petrov-Galerkin: When data science helps solve problem in applied mathematics

Motivated by problems in uncertainty quantification, we introduce a scheme for the uniform approximation of high-dimensional parametric PDEs. Exploiting an analytic dependence of certain PDEs in the parameters, allows to show some convergence rates for non-linear approximation. Building on this remark, one computes (or has access to) independent snapshots of solutions for random parameters and use them in a weighted sparse recovery framework. This allows to approximate the solution map in a polynomial chaos in a number of snapshots that scales optimally (up to log factors) with the intrinsic sparsity of the solution. A further extension based on multi-level decomposition allows for efficient computation and can be shown to deliver uniform approximation (in the parameter space) in a computing time in the order of the approximation of a single solution.

#### Hugo Lavenant: Harmonic mappings valued in the Wasserstein space

The Wasserstein space, which is the space of probability measures endowed with the so-called (quadratic) Wasserstein distance coming from optimal transport, can formally be seen as a Riemannian manifold of infinite dimension. We propose, through a variational approach, a definition of harmonic mappings defined over a domain of R^n and valued in the Wasserstein space. We will show how one can build a fairly satisfying theory which captures some key features of harmonicity and how it is related to the concepts of geodesics (the so-called McCann’s interpolation) and barycenters in the Wasserstein space. Other than a better understanding of the Wasserstein space, the motivation of such a study can be found in geometric data analysis.

#### Joao Pereira: Estimation in multireference alignment and generalizations

In the multireference alignment model, a signal is observed by the action of a random circular translation and the addition of Gaussian noise. Of particular interest is the sample complexity, i.e., the number of observations/samples needed in terms of the signal-to-noise ratio (SNR), the signal energy divided by the noise variance, in order to drive the mean-square error to zero. Previous work showed that if the translations are drawn from the uniform distribution, then, in the low SNR regime, the sample complexity of the problem scales as $\omega(1/\SNR^3)$. We show, that if however the translation distribution is aperiodic, the sample complexity in the same regime drops down to $\omega(1/\SNR^2)$. This rate is achieved by a simple spectral algorithm. The lower bound follows from a generalization of the Chapman-Robbins bound for orbits and an expansion of the $\chi^2$ divergence at low SNR, and can be generalized for general group actions and projections. This suggests the method of moments is optimal in the low SNR regime.

Joint work with Emmanuel Abbe, Tamir Bendory, William Leeb, Nir Sharon and Amit Singer.

#### Luca Venturi: Connectivity of Neural Networks Optimization Landscapes

We study connectivity of sub-level sets of the square loss function of two-layers neural networks. This property implies abscence of poor local minima. In particular, we explore the hypothesis that overparametrisation convexifies the functional space of neural network architectures. I will start by extending the existing results about non-existence of bad local minima in the optimization of linear neural networks. We then move to study non-linear activations using Reproducing Kernel Hilbert Spaces, deriving general results that include Empirical Risk Minimization.
In the rest of the talk, I will focus on quadratic activations, and I will show how we can significatively improve the general RKHS bounds by exploiting the particular geometry of positive semidefinite matrices. I will conclude by discussing some directions of possible future exploration. Joint work with: A. Bandeira, J. Bruna.

#### Mariano Tepper: Clustering is semidefinitely not that hard

In recent years, semidefinite programs (SDP) have been the subject of interesting research in the field of clustering. In many cases, these convex programs deliver the same answers as non-convex alternatives and come with a guarantee of optimality. In this talk, I will argue that SDP-KM, a popular semidefinite relaxation of K-means, can learn manifolds present in the data, something not possible with the original K-means formulation. To build an intuitive understanding of SDP-KM’s manifold learning capabilities, I will present a theoretical analysis on an idealized dataset. Additionally, SDP-KM even segregates linearly non-separable manifolds. As generic SDP solvers are slow on large datasets, I will present a new, convex and yet efficient, solver for SDP-KM. Our results render SDP-KM a versatile, understandable, and powerful tool for manifold learning.

### Schedule Spring 17

Date & Time Speaker Title Room
Mar 21, 2:30pm Liza Rebrova (U Michigan) Local and global obstructions for the random matrix norm regularization CDS 650

### Abstracts

#### Liza Rebrova: Local and Global obstructions for the random matrix norm regularization

We study large n by n random matrices A with i.i.d. entries. If the distribution of the entries have mean zero and at least gaussian decay, then the operator norm ||A|| is at most of order sqrt(n) with high probability. However, for the distributions with heavier tails we cannot expect the same norm bound any more. So, we are motivated by the question: under what conditions operator norm of a heavy-tailed matrix can be improved by modifying just a small fraction of its entries (a small sub-matrix of A)? I will explain why this happens exactly when the entries of A have zero mean and bounded variance. I will also discuss the almost optimal dependence between the size of the removed sub-matrix and the resulting operator norm that we’ve obtained. This is a joint work with Roman Vershynin, inspired by the methods developed recently by Can Le and R. Vershynin and in our joint work with Konstantin Tikhomirov. Room: Center for Data Science, NYU, 60 5th ave, room 650 Time: 2:30pm-3:30pm.