**Day 1: April 7, 2023**

**8:45-9:00**

Registration and Check-in

**9:00-9:45**

**Nati Srebro **

Learning by Overfitting: A Statistical Learning view of Benign Overfitting

*View Abstract*

Classical theory, conventional wisdom, and all textbooks, tell us to avoid reaching zero training error and overfitting the noise, and instead balance model fit and complexity. Yet, recent empirical and theoretical results suggest that in many cases overfitting is benign, and even interpolating the training data can lead to good generalization. This has caused us to rethink the basic principles underlying statistical learning theory. In this talk I will discuss how much of our theory we can salvage and how much of it needs to be revised, focusing on the role of uniform convergence in understanding interpolation learning.

Based on joint work with Lijia Zhou, Fred Koehler, Danica Sutherland and Pragya Sur.

**9:45-10:30**

**Pragya Sur**

A new central limit theorem for the augmented inverse probability weighting estimator in high dimensions

*View Abstract*

Estimating the average treatment effect (ATE) is a central problem in causal inference. Modern advances in the field studied estimation and inference for the ATE in high dimensions through a variety of approaches. Doubly robust estimators such as the augmented inverse probability weighting (AIPW) form a popular approach in this context. However, the high-dimensional literature surrounding these estimators relies on sparsity conditions, either on the outcome regression (OR) or the propensity score (PS) model. This talk will introduce a new central limit theorem for the classical AIPW estimator, that applies agnostic to such sparsity-type assumptions. Specifically, we will study properties of the cross-fit version of the estimator under well-specified OR and PS models, and the proportional asymptotics regime. Under assumptions on the covariate distribution, our CLT will uncover two crucial phenomena among others: (i) the cross-fit AIPW exhibits a substantial variance inflation that can be quantified in terms of the signal-to-noise ratio and other problem parameters, (ii) the asymptotic covariance between the estimators used while cross-fitting is non-negligible even on the root-n scale. These findings are strikingly different from their classical counterparts, and open a vista of possibilities for studying

similar other high-dimensional effects. On the technical front, our work utilizes a novel interplay between three distinct tools—approximate message passing theory, the theory of deterministic equivalents, and the leave-one-out approach. Time permitting, I will outline some of these techniques.

**10:30-10:45**

Coffee Break

**10:45-11:30**

**Marina Meila **

Manifold Learning 2.0: Explanations and Eigenflows

*View Abstract*

Manifold learning algorithms can recover the underlying low-dimensional parametrization of high-dimensional point clouds. This talk will extend this paradigm in two directions.

First, we ask if it is possible, in the case of scientific data where quantitative prior knowledge is abundant, to explain a data manifold by new coordinates,chosen from a set of scientifically meaningful functions.

Second, I will show how some popular manifold learning tools and applications can be recreated in the space of vector fields and flows on a manifold. Central to this approach is the order 1-Laplacian of a manifold, $\Delta_1$, whose eigen-decomposition, known as the Helmholtz-Hodge Decomposition, provides a basis for all vector fields on a manifold. We present a consistent estimator for $\Delta_1$, which enables visualization, analysis, smoothing and semi-supervised learning of vector fields on a manifold. In topological data analysis, we describe the 1st-order analogue of spectral clustering, which amounts to prime manifold decomposition. Furthermore, from this decomposition a new algorithm for finding shortest independent loops follows.

Joint work with Yu-Chia Chen, Weicheng Wu, Samson Koelle, Hanyu Zhang and Ioannis Kevrekidis.

**11:30-12:30**

Lightning Talks

See works marked with (*) in Friday’s poster session details.

**12:30-14:00**

Lunch Break

**14:00-15:00
**

Coffee and Poster Session

*View Posters*

Dynamic Pricing and Learning with Bayesian Persuasion*

Wei Tang (Columbia DSI)

Score Approximation, Estimation and Distribution Recovery of Diffusion Models on Low-Dimensional Data*

Minshuo Chen (Princeton ECE)

Minimax Optimal Estimation of Stability Under Distribution Shift*

Yuanzhe Ma (Columbia IEOR)

Learning from Similar Linear Representations: Adaptivity, Minimaxity, and Robustness*

Ye Tian (Columbia Statistics)

Diagnosing Model Performance Under Distribution Shift*

Tiffany Cai (Columbia Statistics)

Thompson Sampling itself is Differentially Private*

Tingting Ou (Columbia IEOR)

Transformers learn pair-wise–but not three-wise–functions*

Clayton Sanford (Columbia CS)

Tracking Significant Changes in Bandits*

Joe Suk (Columbia Statistics)

Algorithmic Barriers from Intricate Geometry in Random Computational Problems*

Eren C. Kızıldağ (Columbia Statistics)

High Probability and Risk-Averse Guarantees for Stochastic Saddle Point Problems

Yassine Laguel (Rutgers Management Science)

Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for Full-Batch GD

Konstantinos Nikolakakis (Yale EE)

Memory-Regret Trade-Off in Streaming Multi-Armed Bandits

Arpit Agarwal (Columbia DSI)

A Stochastic Subgradient Method for Distributionally Robust Non-Convex and Non-Smooth Learning

Landi Zhu (Rutgers Management Science)

Hedging against Complexity: Distributionally Robust Optimization with Parametric Approximation

Tianyu Wang (Columbia IEOR)

Beyond IID: Data-Driven Decision-Making in Heterogeneous Environments

Omar Mouchtaki (Columbia DRO)

Materials Representation Learning with Information-theory and Variational Autoencoders

Vahe Gharakhanyan (Columbia Applied Physics/Math)

Continuous LWE

Min Jae Song (NYU Courant)

**15:00-15:45**

**Mengdi Wang**

Policy Gradient: Estimation, Convergence and Beyond Cumulative Rewards

*View Abstract*

In offline RL, estimating the value or gradient of a policy often suffers from distribution shift. We will review the statistical theory of off-policy policy evaluation and consider the problem off-policy policy gradient estimation. We present a Policy Gradient Bellman Equation and shows how to leverage Q function approximator for policy gradient estimation via a double fitted iteration. We show that the double fitted iteration is equivalent to a model-based plug-in estimator. Further, the PG estimation error bound is determined by a restricted chi-square divergence that quantifies the interplay between distribution shift and function approximation. Next, we show how to extend policy gradient method and its convergence analysis to RL beyond cumulative rewards. In particular, consider the objective that is a general concave utility function of the state-action occupancy measure, containing as special cases maximal exploration, imitation and safety constrained RL. Such generality invalidates the Bellman equation and Sutton’s Policy Gradient Theorem. We derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem. We also exploit the hidden convexity of the Markov decision process and prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, though the optimization problem is nonconvex.

**15:45-16:30**

**Adam Klivans**

Testable Learning

*View Abstract*

Almost all provably efficient algorithms for learning basic function classes require a distributional assumption such as Gaussian marginals. These assumptions, however, require exponentially many samples to test. This motivated the model of Testable Learning due to Rubinfeld and Vasilyan where the goal is to replace hard-to-verify distributional assumptions with efficiently testable ones and to require that the learner succeed whenever the unknown distribution passes the corresponding test.

We give a powerful new approach for developing algorithms for testable learning using tools from moment matching and metric distances in probability. We obtain efficient testable learners for any concept class that admits low-degree sandwiching polynomials, capturing most important examples for which we have ordinary agnostic learners.

Surprisingly, we show that the information-theoretic sample complexity of testable learning is tightly characterized by the Rademacher complexity of the concept class, one of the most well-studied measures in statistical learning theory. In particular, uniform convergence is necessary and sufficient for testable learning. This leads to a

fundamental separation from (ordinary) distribution-specific agnostic learning, where uniform convergence is sufficient but not necessary.

**16:30-16:45**

Coffee Break

**16:45-17:30**

**Amin Karbasi**

Statistical Limits of (Interactive) Learning

*View Abstract*

Consider the task of learning an unknown concept from a given concept class; to what extent does interacting with a domain expert accelerate the learning process? It turns out the answer is hidden in a better understanding of the infinite two-player games and who has a winning strategy. So, if you like game theory and statistical learning, then this talk is for you.

**Day 2: April 8, 2023**

**9:00-9:45**

**Vahab Mirrokni
** Privacy, Robustness, and Causal Inference in online advertising

*View Abstract*

We will discuss recent advances and technical problems in developing robust and privacy-aware online advertising. We touch upon topics like robust online optimization, privacy in causal Inference, privacy-aware interest-based advertising, and anonymous learning for ads prediction systems. Time-permitting, we will also discuss a number of causal inference problems.

**9:45-10:30**

**Stefanie Jegelka
**Some benefits of machine learning with invariances

*View Abstract*

In many applications, especially in the sciences, data and tasks have known invariances. Encoding such invariances directly into a machine learning model can improve learning outcomes, while it also poses challenges on efficient model design.

In the first part of the talk, we will focus on the invariances relevant to eigenvectors and eigenspaces being inputs to a neural network. Such inputs are important, for instance, for graph representation learning. We will discuss targeted architectures that can universally express functions with the relevant invariances – sign flips and changes of basis – and their theoretical and empirical benefits.

Second, we will take a broader, theoretical perspective. Empirically, it is known that encoding invariances into the machine learning model can reduce sample complexity. For the simplified setting of kernel ridge regression or random features, we will discuss new bounds that illustrate two ways in which invariances can reduce sample complexity. Our results hold for learning on manifolds and for invariances to (almost) any group action.

**10:30-10:45**

Coffee Break

**10:45-11:30**

**Joan Bruna Estrach**

On Learning sparse high-dimensional functions with shallow neural networks

*View Abstract*

Neural networks are known for their distinctive ability to extract useful representations out of high-dimensional data via gradient-descent. While the full mathematical picture is still full of mysteries and holes, I will discuss recent progress that exploits certain symmetries of the data distribution, e.g. rotational invariance,

as well as structural assumptions about the target function, namely its dependency on a secret low-dimensional subspace.

Joint work with Alberto Bietti, Loucas Pillaud-Vivien, Min Jae Song and Clayton Sanford.

**11:30-12:30**

Lightning Talks

See works marked with (*) in Saturday’s poster session details.

**12:30-14:00**

Lunch Break

**14:00-15:00**

Coffee and Poster Session

*View Posters*

Nonlinear Independent Component Analysis*

Alexander Schell (Columbia Statistics)

Alignment with human representations supports robust few-shot learning*

Ilia Sucholutsky (Princeton CS)

Accelerating Transformers via Kernel Density Estimation*

Insu Han (Yale EE)

Differentially Private Synthetic Control*

Saeyoung Rho (Columbia CS)

Compress Then Test: Powerful Kernel Testing in Near-linear Time*

Carles Domingo-Enrich (NYU Courant)

Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models*

Alex Damian (Princeton Applied Math)

Intrinsic dimensionality and generalization properties of the R-norm inductive bias*

Navid Ardeshir (Columbia Statistics)

Identifying Causal Effects in Hierarchical Models*

Eli N. Weinstein (Columbia DSI)

On the Statistical Benefits of Temporal Difference Learning

David Cheikhi (Columbia DRO)

Neyman-Pearson Approach to Outlier Transfer Learning

Mohammadreza Mousavi Kalan (Columbia Statistics)

Data Banzhaf: A Robust Data Valuation Framework for Machine Learning

Jiachen T. Wang (Princeton ECE)

Enhanced Regret Bounds for Follow-The-Leader and Its Applications in Online Learning

Sudeep Raja (Columbia IEOR)

A closer look at the worst-case behavior of multi-armed bandit algorithms

Anand Kalvit (Columbia DRO)

Demystifying Disagreement-on-the-Line in High Dimensions

Behrad Moniri (Penn ESE)

Statistical Inference Under User-Level Privacy Constraints

Lekshmi Ramesh (Columbia Statistics)

The SSL Interplay: Augmentations, Inductive Bias, and Generalization

Vivien Cabannes (Meta FAIR Labs)

**15:00-15:45**

**Misha Belkin
**Feature learning in neural networks and kernel machines that recursively learn features

*View Abstract*

Neural networks have achieved impressive results on many technological and scientific tasks. Yet, their empirical successes have outpaced our fundamental understanding of their structure and function. By identifying mechanisms driving the successes of neural networks, we can provide principled approaches for improving neural network performance and develop simple and effective alternatives. In this work, we isolate the key mechanism driving feature learning in fully connected neural networks by connecting neural feature learning to the average gradient outer product. We subsequently leverage this mechanism to design \textit{Recursive Feature Machines} (RFMs), which are kernel machines that learn features. We show that RFMs (1) accurately capture features learned by deep fully connected neural networks, (2) close the gap between kernel machines and fully connected networks, and (3) surpass a broad spectrum of models including neural networks on tabular data. Furthermore, we demonstrate that RFMs shed light on recently observed deep learning phenomena such as grokking, lottery tickets, simplicity biases, and spurious features.

Joint work with Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, Mikhail Belkin.

**15:45-16:30**

**Zongming Ma**

Matching and integration of datasets with low-rank signals and applications in single-cell data analysis

*View Abstract*

We study one-way matching of a pair of datasets with low-rank signals. Under a stylized model, we first derive information-theoretic limits of matching under a mismatch proportion loss. We then show that linear assignment with projected data achieves fast rates of convergence and sometimes even rate optimality for this task. Built upon the theoretical understanding, we propose two practical matching algorithms for single-cell data integration in two different scenarios and showcase their power on two real data examples.

**16:30-16:45**

Coffee Break

**16:45-17:30**

**Dana Yang**

Learner-Private Convex Optimization

*View Abstract*

Convex optimization with feedback is a framework where a learner relies on iterative queries and feedback to arrive at the minimizer of a convex function. The paradigm has gained significant popularity recently thanks to its scalability in large-scale optimization and machine learning. The repeated interactions, however, expose the learner to privacy risks from eavesdropping adversaries that observe the submitted queries. In this work, we study how to optimally obfuscate the learner’s queries in convex optimization with first-order feedback, so that their learned optimal value is provably difficult to estimate for the eavesdropping adversary.