- This event has passed.

## April 7, 2023, 8:00 am - April 8, 2023, 5:00 pm EDT

The **Statistical Machine Learning Symposium, **organized by the Department of Statistics with support from the Data Science Institute (DSI), is a two-day workshop that will be held **April 7-8, 2023** at **Columbia University, School of Social Work, ****1255 Amsterdam Ave, Room 311-312**. It will consist of lectures from invited speakers in academia and industry, as well as lightning talks and poster presentations from junior researchers (graduate students and postdocs from the New York area institutions). See here for the **schedule**.

Due to room capacity, the registration is now **closed**.

Organizing committee:

Daniel Hsu, Samory Kpotufe, Arian Maleki, Cindy Rush, Eren Kizildag, Belinda Tzen.

*Invited Speakers*

__Misha Belkin__ (UCSD)

**Title:** Feature learning in neural networks and kernel machines that recursively learn features

*View Abstract*

Neural networks have achieved impressive results on many technological and scientific tasks. Yet, their empirical successes have outpaced our fundamental understanding of their structure and function. By identifying mechanisms driving the successes of neural networks, we can provide principled approaches for improving neural network performance and develop simple and effective alternatives. In this work, we isolate the key mechanism driving feature learning in fully connected neural networks by connecting neural feature learning to the average gradient outer product. We subsequently leverage this mechanism to design \textit{Recursive Feature Machines} (RFMs), which are kernel machines that learn features. We show that RFMs (1) accurately capture features learned by deep fully connected neural networks, (2) close the gap between kernel machines and fully connected networks, and (3) surpass a broad spectrum of models including neural networks on tabular data. Furthermore, we demonstrate that RFMs shed light on recently observed deep learning phenomena such as grokking, lottery tickets, simplicity biases, and spurious features.

Joint work with Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, Mikhail Belkin.

______________________

__Joan Bruna Estrach__ (NYU)

**Title:** On Learning sparse high-dimensional functions with shallow neural networks.

*View Abstract*

Neural networks are known for their distinctive ability to extract useful representations out of high-dimensional data via gradient-descent. While the full mathematical picture is still full of mysteries and holes, I will discuss recent progress that exploits certain symmetries of the data distribution, e.g. rotational invariance,

as well as structural assumptions about the target function, namely its dependency on a secret low-dimensional subspace.

Joint work with Alberto Bietti, Loucas Pillaud-Vivien, Min Jae Song and Clayton Sanford.

______________________

__Stefanie Jegelka__ (MIT)

**Title:** Some benefits of machine learning with invariances

*View Abstract*

In many applications, especially in the sciences, data and tasks have known invariances. Encoding such invariances directly into a machine learning model can improve learning outcomes, while it also poses challenges on efficient model design.

In the first part of the talk, we will focus on the invariances relevant to eigenvectors and eigenspaces being inputs to a neural network. Such inputs are important, for instance, for graph representation learning. We will discuss targeted architectures that can universally express functions with the relevant invariances – sign flips and changes of basis – and their theoretical and empirical benefits.

Second, we will take a broader, theoretical perspective. Empirically, it is known that encoding invariances into the machine learning model can reduce sample complexity. For the simplified setting of kernel ridge regression or random features, we will discuss new bounds that illustrate two ways in which invariances can reduce sample complexity. Our results hold for learning on manifolds and for invariances to (almost) any group action.

______________________

__Amin Karbasi__ (Yale)

**Title:** Statistical Limits of (Interactive) Learning

*View Abstract*

Consider the task of learning an unknown concept from a given concept class; to what extent does interacting with a domain expert accelerate the learning process? It turns out the answer is hidden in a better understanding of the infinite two-player games and who has a winning strategy. So, if you like game theory and statistical learning, then this talk is for you.

**Speaker bio:** Amin Karbasi is currently an associate professor of Electrical Engineering, Computer Science, and Statistics & Data Science at Yale University. He is also a staff scientist at Google NY. He has been the recipient of the National Science Foundation (NSF) Career Award, Office of Naval Research (ONR) Young Investigator Award, Air Force Office of Scientific Research (AFOSR) Young Investigator Award, DARPA Young Faculty Award, National Academy of Engineering Grainger Award, Amazon Research Award, Nokia Bell-Labs Award, Google Faculty Research Award, Microsoft Azure Research Award, Simons Research Fellowship, and ETH Research Fellowship. His work has also been recognized with a number of paper awards, including Graphs in Biomedical Image Analysis (GRAIL), Medical Image Computing and Computer Assisted Interventions Conference (MICCAI), International Conference on Artificial Intelligence and Statistics (AISTATS), IEEE ComSoc Data Storage, International Conference on Acoustics, Speech, and Signal Processing (ICASSP), ACM SIGMETRICS, and IEEE International Symposium on Information Theory (ISIT). His Ph.D. thesis received the Patrick Denantes Memorial Prize from the School of Computer and Communication Sciences at EPFL, Switzerland.

______________________

__Adam Klivans__ (UT Austin)

**Title:** Testable Learning

*View Abstract*

Almost all provably efficient algorithms for learning basic function classes require a distributional assumption such as Gaussian marginals. These assumptions, however, require exponentially many samples to test. This motivated the model of Testable Learning due to Rubinfeld and Vasilyan where the goal is to replace hard-to-verify distributional assumptions with efficiently testable ones and to require that the learner succeed whenever the unknown distribution passes the corresponding test.

We give a powerful new approach for developing algorithms for testable learning using tools from moment matching and metric distances in probability. We obtain efficient testable learners for any concept class that admits low-degree sandwiching polynomials, capturing most important examples for which we have ordinary agnostic learners.

Surprisingly, we show that the information-theoretic sample complexity of testable learning is tightly characterized by the Rademacher complexity of the concept class, one of the most well-studied measures in statistical learning theory. In particular, uniform convergence is necessary and sufficient for testable learning. This leads to a

fundamental separation from (ordinary) distribution-specific agnostic learning, where uniform convergence is sufficient but not necessary.

______________________

__Zongming Ma__ (UPenn)

**Title:** Matching and integration of datasets with low-rank signals and applications in single-cell data analysis

*View Abstract*

We study one-way matching of a pair of datasets with low-rank signals. Under a stylized model, we first derive information-theoretic limits of matching under a mismatch proportion loss. We then show that linear assignment with projected data achieves fast rates of convergence and sometimes even rate optimality for this task. Built upon the theoretical understanding, we propose two practical matching algorithms for single-cell data integration in two different scenarios and showcase their power on two real data examples.

______________________

__Marina Meila__ (UW)

**Title:** Manifold Learning 2.0: Explanations and Eigenflows

*View Abstract*

Manifold learning algorithms can recover the underlying low-dimensional parametrization of high-dimensional point clouds. This talk will extend this paradigm in two directions.

First, we ask if it is possible, in the case of scientific data where quantitative prior knowledge is abundant, to explain a data manifold by new coordinates,chosen from a set of scientifically meaningful functions.

Second, I will show how some popular manifold learning tools and applications can be recreated in the space of vector fields and flows on a manifold. Central to this approach is the order 1-Laplacian of a manifold, $\Delta_1$, whose eigen-decomposition, known as the Helmholtz-Hodge Decomposition, provides a basis for all vector fields on a manifold. We present a consistent estimator for $\Delta_1$, which enables visualization, analysis, smoothing and semi-supervised learning of vector fields on a manifold. In topological data analysis, we describe the 1st-order analogue of spectral clustering, which amounts to prime manifold decomposition. Furthermore, from this decomposition a new algorithm for finding shortest independent loops follows.

Joint work with Yu-Chia Chen, Weicheng Wu, Samson Koelle, Hanyu Zhang and Ioannis Kevrekidis

**Speaker bio:** Marina Meila is Professor of Statistics at the University of Washington. She earned her MS degree in Electrical Engineering from the Polytechnic University of Bucharest in 1985, and her PhD in Computer Science and Electrical Engineering from the Massachusetts Institute of Technology in 1999. She held appointments at the Bucharest Research Institute for Computer Technology, the Polytechnic University of Bucharest, and the Robotics Institute of Carnegie Mellon University. Her long term interest is in machine learning and reasoning for data with high dimensions and with algebraic and geometric structure. Doing “statistics as usual” for structured data is one of her current interests. She is an expert in unsupervised learning paradigms, such as dimension reduction and clustering, where her group is developing methods to obtain guarantees of quality and correctness with minimal or no model assumptions.

______________________

__Vahab Mirrokni__ (Google)

**Title:** Privacy, Robustness, and Causal Inference in online advertising

*View Abstract*

We will discuss recent advances and technical problems in developing robust and privacy-aware online advertising. We touch upon topics like robust online optimization, privacy in causal Inference, privacy-aware interest-based advertising, and anonymous learning for ads prediction systems. Time-permitting, we will also discuss a number of causal inference problems.

______________________

__Nati Srebro__ (TTIC)

**Title:** Learning by Overfitting: A Statistical Learning view of Benign Overfitting

*View Abstract*

Classical theory, conventional wisdom, and all textbooks, tell us to avoid reaching zero training error and overfitting the noise, and instead balance model fit and complexity. Yet, recent empirical and theoretical results suggest that in many cases overfitting is benign, and even interpolating the training data can lead to good generalization. This has caused us to rethink the basic principles underlying statistical learning theory. In this talk I will discuss how much of our theory we can salvage and how much of it needs to be revised, focusing on the role of uniform convergence in understanding interpolation learning.

Based on joint work with Lijia Zhou, Fred Koehler, Danica Sutherland and Pragya Sur

**Speaker bio:** Nati (Nathan) Srebro is a professor at the Toyota Technological Institute at Chicago, with cross-appointments at the University of Chicago’s Department of Computer Science, and Committee on Computational and Applied Mathematics. He obtained his PhD from the Massachusetts Institute of Technology in 2004, and previously was a postdoctoral fellow at the University of Toronto, a visiting scientist at IBM, and an associate professor at the Technion.

Dr. Srebro’s research encompasses methodological, statistical and computational aspects of machine learning, as well as related problems in optimization. Some of Srebro’s significant contributions include work on learning “wider” Markov networks, introducing the use of the nuclear norm for machine learning and matrix reconstruction, work on fast optimization techniques for machine learning, and on the relationship between learning and optimization. His current interests include understanding deep learning through a detailed understanding of optimization, distributed and federated learning, algorithmic fairness and practical adaptive data analysis.

______________________

__Pragya Sur__ (Harvard)

**Title:** A new central limit theorem for the augmented inverse probability weighting estimator in high dimensions

*View Abstract*

Estimating the average treatment effect (ATE) is a central problem in causal inference. Modern advances in the field studied estimation and inference for the ATE in high dimensions through a variety of approaches. Doubly robust estimators such as the augmented inverse probability weighting (AIPW) form a popular approach in this context. However, the high-dimensional literature surrounding these estimators relies on sparsity conditions, either on the outcome regression (OR) or the propensity score (PS) model. This talk will introduce a new central limit theorem for the classical AIPW estimator, that applies agnostic to such sparsity-type assumptions. Specifically, we will study properties of the cross-fit version of the estimator under well-specified OR and PS models, and the proportional asymptotics regime. Under assumptions on the covariate distribution, our CLT will uncover two crucial phenomena among others: (i) the cross-fit AIPW exhibits a substantial variance inflation that can be quantified in terms of the signal-to-noise ratio and other problem parameters, (ii) the asymptotic covariance between the estimators used while cross-fitting is non-negligible even on the root-n scale. These findings are strikingly different from their classical counterparts, and open a vista of possibilities for studying

similar other high-dimensional effects. On the technical front, our work utilizes a novel interplay between three distinct tools—approximate message passing theory, the theory of deterministic equivalents, and the leave-one-out approach. Time permitting, I will outline some of these techniques.

**Speaker bio:** Pragya Sur is an Assistant Professor in the Statistics Department at Harvard. Prior to her current position, she was a postdoctoral fellow at the Center for Research on Computation and Society, Harvard SEAS. She obtained her Ph.D. from Stanford Statistics in 2019. Her research spans high-dimensional statistics and machine learning theory, with focus on high-dimensional regression, classification, causal inference, ensemble learning, and learning under distribution shifts. Her research is supported by an NSF DMS Award and a William F. Milton Fund Award (both solo PI). She is an International Strategy Forum (ISF) 2023 Asia Fellow, chosen by Schmidt Futures, a philanthropic initiative founded by Eric and Wendy Schmidt. As part of this, she currently participates in an 11-month, non-residential fellowship program for rising leaders ages 25 – 35 from across Africa, Asia, North America, and Europe. In Fall, 2021, she was invited to speak at the National Academies’ Board on Mathematical Sciences and Analytics symposium on Mathematical Challenges for Machine Learning and Artificial Intelligence, and also visited the Simons Institute for the Theory of Computing as a long-term participant. In 2019, she received the Theodore W. Anderson Theory of Statistics Dissertation Award for “deep original results in large sample maximum likelihood theory for logistic regression with a large number of covariates”.

______________________

__Mengdi Wang__ (Princeton)

**Title:** Policy Gradient: Estimation, Convergence and Beyond Cumulative Rewards

*View Abstract*

In offline RL, estimating the value or gradient of a policy often suffers from distribution shift. We will review the statistical theory of off-policy policy evaluation and consider the problem off-policy policy gradient estimation. We present a Policy Gradient Bellman Equation and shows how to leverage Q function approximator for policy gradient estimation via a double fitted iteration. We show that the double fitted iteration is equivalent to a model-based plug-in estimator. Further, the PG estimation error bound is determined by a restricted chi-square divergence that quantifies the interplay between distribution shift and function approximation. Next, we show how to extend policy gradient method and its convergence analysis to RL beyond cumulative rewards. In particular, consider the objective that is a general concave utility function of the state-action occupancy measure, containing as special cases maximal exploration, imitation and safety constrained RL. Such generality invalidates the Bellman equation and Sutton’s Policy Gradient Theorem. We derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem. We also exploit the hidden convexity of the Markov decision process and prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, though the optimization problem is nonconvex.

**Speaker bio:** Mengdi Wang is an associate professor at the Department of Electrical and Computer Engineering and Center for Statistics and Machine Learning at Princeton University. She is also affiliated with the Department of Computer Science, Princeton’s ML Theory Group. She was a visiting research scientist at DeepMind, IAS and Simons Institute on Theoretical Computer Science. Her research focuses on machine learning, reinforcement learning, with applications in healthcare, drug discovery, inteligent systems. Mengdi received her PhD in Electrical Engineering and Computer Science from Massachusetts Institute of Technology in 2013. At MIT, Mengdi was affiliated with the Laboratory for Information and Decision Systems and was advised by Dimitri P. Bertsekas. Before that, she got her bachelor degree from the Department of Automation, Tsinghua University. Mengdi received the Young Researcher Prize in Continuous Optimization of the Mathematical Optimization Society in 2016 (awarded once every three years), the Princeton SEAS Innovation Award in 2016, the NSF Career Award in 2017, the Google Faculty Award in 2017, and the MIT Tech Review 35-Under-35 Innovation Award (China region) in 2018, WAIC YunFan Award 2022. She serves as a Program Chair for ICLR 2023 and Senior AC for Neurips 2023, associate editor for Operations Research and Mathematics of Operations Research, area chair for ICML, AISTATS, and is on the editorial board of Journal of Machine Learning Research. Research supported by NSF, AFOSR, NIH, ONR, Google, Microsoft C3.ai DTI, FinUP, RVAC Medicines.

______________________

__Dana Yang__ (Cornell)

**Title:** Learner-Private Convex Optimization

*View Abstract*

Convex optimization with feedback is a framework where a learner relies on iterative queries and feedback to arrive at the minimizer of a convex function. The paradigm has gained significant popularity recently thanks to its scalability in large-scale optimization and machine learning. The repeated interactions, however, expose the learner to privacy risks from eavesdropping adversaries that observe the submitted queries. In this work, we study how to optimally obfuscate the learner’s queries in convex optimization with first-order feedback, so that their learned optimal value is provably difficult to estimate for the eavesdropping adversary.