# the MAD Seminar

The MaD seminar features leading specialists at the interface of Applied Mathematics, Statistics and Machine Learning.

**Room:** Auditorium Hall 150, Center for Data Science, NYU, 60 5th ave.

**Time:** 2:00pm-3:00pm

**Subscribe to the Seminar Mailing list here**

### Schedule with Confirmed Speakers

### Abstracts

#### Jin-Peng Liu: Quantum for Science: Efficient Quantum Algorithms for Linear and Nonlinear Dynamics

Fault-tolerant quantum computers are expected to excel in simulating unitary dynamics, such as the dynamics of a quantum state under a Hamiltonian. Most applications in scientific and engineering computations involve non-unitary and/or nonlinear dynamics. Therefore, efficient quantum algorithms are the key for unlocking the full potential of quantum computers to achieve comparable speedup in these general tasks.

First, we propose a simple method for simulating a general class of non-unitary dynamics as a linear combination of Hamiltonian simulation (LCHS) problems. The LCHS method can achieve optimal cost in terms of state preparation [1]. Second, we give the first efficient (polynomial time) quantum algorithm for nonlinear differential equations with sufficiently strong dissipation. This is an exponential improvement over the best previous quantum algorithms, whose complexity is exponential in the evolution time [2]. Our work shows that fault-tolerant quantum computing can potentially address complex non-unitary and nonlinear phenomena in natural and data sciences with provable efficiency [3].

References:

[1] Linear combination of Hamiltonian simulation for non-unitary dynamics with optimal state preparation cost. Physical Review Letters, 131(15):150603 (2023).

[2] Efficient quantum algorithm for dissipative nonlinear differential equations. Proceedings of the National Academy of Science 118, 35 (2021).

[3] Towards provably efficient quantum algorithms for large-scale machine learning models. Nature Communications 15, 434 (2024).

#### Yaqi Duan: Taming “data-hungry” reinforcement learning? Stability in continuous state-action spaces

We introduce a novel framework for analyzing reinforcement learning (RL) in continuous state-action spaces, and use it to prove fast rates of convergence in both off-line and on-line settings. Our analysis highlights two key stability properties, relating to how changes in value functions and/or policies affect the Bellman operator and occupation measures. We argue that these properties are satisfied in many continuous state-action Markov decision processes, and demonstrate how they arise naturally when using linear function approximation methods. Our analysis offers fresh perspectives on the roles of pessimism and optimism in off-line and on-line RL, and highlights the connection between off-line RL and transfer learning.

#### Cun-Hui Zhang: Adaptive Inference in Sequential Experiments

Sequential data collection has emerged as a widely adopted technique for enhancing the efficiency of data gathering processes. Despite its advantages, such data collection mechanism often introduces complexities to the statistical inference procedure. For instance, the ordinary least squares estimator in an adaptive linear regression model can exhibit non-normal asymptotic behavior, posing challenges for accurate inference and interpretation. We propose a general method for constructing debiased estimator which remedies this issue. The idea is to make use of adaptive linear estimating equations. We establish theoretical guarantees of asymptotic normality, supplemented by discussions on achieving near-optimal asymptotic variance. This talk is based on joint work with Mufang Ying and Koulik Khamaru.

#### Zach Izzo: Data-driven Subgroup Identification

Medical studies frequently require to extract the relationship between each covariate and the outcome with statistical confidence measures. To do this, simple parametric models are frequently used (e.g. coefficients of linear regression) but usually fitted on the whole dataset. However, it is common that the covariates may not have a uniform effect over the whole population and thus a unified simple model can miss the heterogeneous signal. For example, a linear model may be able to explain a subset of the data but fail on the rest due to the nonlinearity and heterogeneity in the data. In this talk, I will discuss DDGroup (data-driven subgroup discovery), a data-driven method to effectively identify subgroups in the data with a uniform linear relationship between the features and the label. DDGroup outputs an interpretable region in which the linear model is expected to hold. It is simple to implement and computationally tractable for use. It also comes with statistical guarantees: given a large enough sample, DDGroup recovers a region where a single linear model with low variance is well-specified (if one exists), and experiments on real-world medical datasets confirm that it can discover regions where a local linear model has improved performance. Our experiments also show that DDGroup can uncover subgroups with qualitatively different relationships which are missed by simply applying parametric approaches to the whole dataset. Time permitting, I will also discuss the challenges of extending DDGroup to other models.

#### Gabriel Peyré: Conservation Laws for Gradient Flows

Understanding the geometric properties of gradient descent dynamics is a key ingredient in deciphering the recent success of very large machine learning models. A striking observation is that trained over-parameterized models retain some properties of the optimization initialization. This “implicit bias” is believed to be responsible for some favorable properties of the trained models and could explain their good generalization properties. In this talk I will first rigorously expose the definition and basic properties of “conservation laws”, which are maximal sets of independent quantities conserved during gradient flows of a given model (e.g. of a ReLU network with a given architecture) with any training data and any loss. Then I will explain how to find the exact number of these quantities by performing finite-dimensional algebraic manipulations on the Lie algebra generated by the Jacobian of the model. In the specific case of linear and ReLu networks, this procedure recovers the conservation laws known in the literature, and prove that there are no other laws.

#### Tristan Buckmaster: Singularities in fluid: Self-similar analysis, computer assisted proofs and neural networks

In this presentation, I will provide an overview of how techniques involving self-similar analysis, computer assisted proofs and neural networks can be employed to investigate singularity formation in the context of fluids.

#### Matus Telgarsky: Feature learning and margin maximization via mirror descent

This whiteboard talk will motivate and describe the margin maximization problem in neural networks, and show how it can be solved via a simple refinement of the standard mirror descent analysis. In detail, the first part of the talk will explain how margin maximization is an approach to establishing feature learning; as an example, it can achieve sample complexities better than any kernel (e.g., the NTK). The talk will then show how standard gradient descent can be analyzed via an alternative implicit mirror descent, and leads to margin maximization. Time permitting, the talk will also show other consequences of this refined mirror descent, for instance into the study of arbitrarily large steps sizes with logistic regression.

The main content is joint work with Danny Son, and will appear on arXiv in May. The “time permitting” part is joint work with Jingfeng Wu, Peter Bartlett, and Bin Yu, and can be found at https://arxiv.org/abs/2402.15926.

#### Zhuoran Yang: Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality

I this talk, we will examine the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression. We establish the global convergence of gradient flow under suitable choices of initialization. In addition, we prove that an interesting “task allocation” phenomenon emerges during the gradient flow dynamics, where each attention head focuses on solving a single task of the multi-task model. Specifically, we prove that the gradient flow dynamics can be split into three phases – a warm-up phase where the loss decreases rather slowly and the attention heads gradually build up their inclination towards individual tasks, an emergence phase where each head selects a single task and the loss rapidly decreases, and a convergence phase where the attention parameters converge to a limit. Furthermore, we prove the optimality of gradient flow in the sense that the limiting model learned by gradient flow is on par with the best possible multi-head softmax attention model up to a constant factor. Our analysis also delineates a strict separation in terms of the prediction accuracy of ICL between single-head and multi-head attention models. The key technique for our convergence analysis is to map the gradient flow dynamics in the parameter space to a set of ordinary differential equations in the spectral domain, where the relative magnitudes of the semi-singular values of the attention weights determines task allocation. To our best knowledge, our work provides the first convergence result for the multi-head softmax attention model. This is joint work with Siyu Chen, Heejune Sheen, Tianhao Wang.

#### Guy Bresler: On The Fourier Coefficients of High-Dimensional Random Geometric Graphs

The random geometric graph RGG(n,d,p) on the d-dimensional sphere is formed by sampling n i.i.d. vectors uniformly and placing an edge between pairs of vertices i and j closer than some threshold chosen so that the marginal edge probability is p. We study the low-degree Fourier coefficients of this distribution and prove sharp bounds on them. Our main conceptual contribution is a novel probabilistic representation for the RGG: we prove that the dependence among edges can be localized to few fragile edges and then show how randomness in the remaining edges acts as a noise operator. The results have implications both for hypothesis testing and for computational-statistical gaps. Joint work with Kiril Bangachev. https://arxiv.org/abs/2402.12589