Seminars & Colloquia Calendar

Download as iCal file

Mathematical Physics Seminar

Stochastic learning dynamics and generalization in neural networks: Can statistical physicists help understand AI?

Yuhai Tu - IBM T. J. Watson Research Center

Location:  Zoom
Date & time: Wednesday, 01 March 2023 at 10:45AM - 11:45AM

Abstract: Despite the great success of deep learning, it remains largely a black box. For example, the main search engine in deep neural networks is based on the Stochastic Gradient Descent (SGD) algorithm, however, little is known about how SGD finds ``good" solutions (low generalization error) in the high-dimensional weight space. In this talk, we will first give a general overview of SGD followed by a more detailed description of our recent work [1,2] on the SGD learning dynamics, the loss function landscape, and their relationship.

More specifically, our study shows that SGD dynamics follows a low-dimensional drift-diffusion motion in the weight space and the loss function is flat in most directions with large values of flatness (small curvatures). Furthermore, our study reveals a robust inverse relation between the weight variance in SGD and the landscape flatness opposite to the fluctuation-response relation in equilibrium systems. We develop a statistical theory of SGD based on properties of the ensemble of minibatch loss functions and show that the noise strength in SGD depends inversely on the landscape flatness, which explains the inverse variance-flatness relation. Our study suggests that SGD serves as an ``intelligent" annealing strategy where the effective anisotropic “temperature” self-adjusts according to the loss landscape in order to find the flat minima that is found to be more generalizable. Finally, we discuss an application of these insights for reducing catastrophic forgetting for sequential multiple tasks learning.

Time permits, we will discuss a more recent work on trying to understand why flat solutions are more generalizable and whether there are other measures for better generalization based on an exact duality relation we found between neuron activity and network weight [3].

[1] “The inverse variance-flatness relation in Stochastic-Gradient-Descent is critical for finding flat minima”, Y. Feng and Y. Tu, PNAS, 118 (9), 2021.

[2] “Phases of learning dynamics in artificial neural networks: in the absence and presence of mislabeled data”, Y. Feng and Y. Tu, Machine Learning: Science and Technology (MLST), July 19, 2021. https://iopscience.iop.org/article/10.1088/2632-2153/abf5b9/pdf

[3] “The activity-weight duality in feed forward neural networks: The geometric determinants of generalization”, Y. Feng and Y. Tu, https://arxiv.org/abs/2203.10736

Special Note to All Travelers

Directions: map and driving directions. If you need information on public transportation, you may want to check the New Jersey Transit page.

Unfortunately, cancellations do occur from time to time. Feel free to call our department: 848-445-6969 before embarking on your journey. Thank you.