# the MaD Seminar

The MaD seminar features leading specialists at the interface of Applied Mathematics, Statistics and Machine Learning. It is partly supported by the Moore-Sloan Data Science Environment at NYU.

MaD seminars are recorded and streamed live. Links to the videos are available below.

**Room:** Auditorium Hall 150, Center for Data Science, NYU, 60 5th ave.

**Time:** 2:00pm-3:00pm, Reception will follow.

**Subscribe to the Seminar Mailing list here**

### Schedule with Confirmed Speakers

Date | Speaker | Title | Live Stream |
---|---|---|---|

Jan 23 | Paromita Dubey (UC Davis) | Fréchet Change Point Detection | |

Jan 30 | Yaniv Romano (Stanford) | Reliability, Equity, and Reproducibility in Modern Machine Learning | video |

Feb 6 | Kaizheng Wang (Princeton) | Latent variable models: spectral methods and non-convex optimization | |

Feb 13 | Laure Zanna (NYU) | Blending machine learning and physics to improve climate modeling | |

Feb 20 | Yash Deshpande (MIT) | ||

Feb 27 | Becca Willett (UChicago) | video | |

Mar 5 | Stefanie Jegelka (MIT) | video | |

Mar 12 | Samory Kpotufe (Columbia) | video | |

Mar 19 | (spring break) | ||

Mar 26 | Weijie Su (UPenn) | video | |

Apr 2 | Flori Bunea (Cornell) | video | |

Apr 9 | Yurii Nesterov (UCLouvain) | video | |

Apr 23 | Jiaming Xu (Duke) | video | |

Apr 30 | Sham Kakade (UW) | video |

### Abstracts

#### Laure Zanna: Blending machine learning and physics to improve climate modeling

Numerical simulations used for weather and climate predictions solve approximations of the governing laws of fluid motions. The computational cost of these simulations limits the accuracy of the predictions. Uncertainties in the simulations and predictions ultimately originate from the poor or lacking representation of processes, such as turbulence, that are not resolved on the numerical grid of global climate models. I will show that machine learning algorithms with imposed physical constraints are good candidates to improve the representation of processes that occur below the scales resolved by global models. Specifically, I will propose new representations of ocean turbulence derived using relevance vector machines and convolutional neural networks trained on data from high-resolution idealized simulations. The new models of turbulent processes are interpretable and/or encapsulate physics, and lead to improved simulations of the ocean. Our results simultaneously open the door to the discovery of new physics from data and the improvement of numerical simulations of oceanic and atmospheric flows.

#### Kaizheng Wang: Latent variable models: spectral methods and non-convex optimization

Latent variable models lay the statistical foundation for data science problems with unstructured, incomplete and heterogeneous information. For the sake of computational efficiency, heuristic algorithms are proposed to extract the latent low-dimensional structures for downstream tasks. Despite their huge success in practice, theoretical understanding is lagging far behind and that hinders further advancement. In this talk, I will first show an L_p theory of eigenvector analysis that yields optimal recovery guarantees for spectral methods in many challenging problems. Then I will present a general framework for clustering based on non-convex optimization, and study its theoretical guarantees under statistical models. The results find applications in dimensionality reduction, mixture models, network analysis, recommendation systems, ranking and beyond.

#### Yaniv Romano: Reliability, Equity, and Reproducibility in Modern Machine Learning

Modern machine learning algorithms have achieved remarkable performance in a myriad of applications, and are increasingly used to make impactful decisions in the hiring process, criminal sentencing, healthcare diagnostics and even to make new scientific discoveries. The use of data-driven algorithms in high-stakes applications is exciting yet alarming: these methods are extremely complex, often brittle, notoriously hard to analyze and interpret. Naturally, concerns have raised about the reliability, fairness, and reproducibility of the output of such algorithms. This talk introduces statistical tools that can be wrapped around any “black-box” algorithm to provide valid inferential results while taking advantage of their impressive performance. We present novel developments in conformal prediction and quantile regression, which rigorously guarantee the reliability of complex predictive models, and show how these methodologies can be used to treat individuals equitably. Next, we focus on reproducibility and introduce an operational selective inference tool that builds upon the knockoff framework and leverages recent progress in deep generative models. This methodology allows for reliable identification of a subset of important features that is likely to explain a phenomenon under-study in a challenging setting where the data distribution is unknown, e.g., mutations that are truly linked to changes in drug resistance.

#### Paromita Dubey: Fréchet Change Point Detection

Change point detection is a popular tool for identifying locations in a data sequence where an abrupt change occurs in the data distribution and has been widely studied for Euclidean data. Modern data very often is non-Euclidean, for example distribution valued data or network data. Change point detection is a challenging problem when the underlying data space is a metric space where one does not have basic algebraic operations like addition of the data points and scalar multiplication.

In this talk, I propose a method to infer the presence and location of change points in the distribution of a sequence of independent data taking values in a general metric space. Change points are viewed as locations at which the distribution of the data sequence changes abruptly in terms of either its Fréchet mean or Fréchet variance or both. The proposed method is based on comparisons of Fréchet variances before and after putative change point locations. First, I will establish that under the null hypothesis of no change point the limit distribution of the proposed scan function is the square of a standardized Brownian Bridge.

It is well known that such convergence is rather slow in moderate to high dimensions. For more accurate results in finite sample applications, I will provide a theoretically justified bootstrap-based scheme for testing the presence of change points. Next, I will show that when a change point exists, (1) the proposed test is consistent under contiguous alternatives and (2) the estimated location of the change-point is consistent. All of the above results hold for a broad class of metric spaces under mild entropy conditions. Examples include the space of univariate probability distributions and the space of graph Laplacians for networks. I will illustrate the efficacy of the proposed approach in empirical studies and in real data applications with sequences of maternal fertility distributions. Finally, I will talk about some future extensions and other related research directions, for instance, when one has samples of dynamic metric space data. This talk is based on joint work with Prof. Hans-Georg Müller.