Seminars & Colloquia Calendar
Data analysis in high-dimensional spaces
Adi Ben Israel, Rutgers University (RUTCOR)
Location: Zoom
Date & time: Thursday, 06 May 2021 at 5:00PM - 6:00PM
1. The unreliability of the Euclidean distance in high-dimension, making a proximity query meaningless and unstable because there is poor discrimination between the nearest and furthest neighbor [3], see also [4].
2. The uniform probability distribution on the n-dimensional unit sphere S_n, and some non-intuitive results for large $n$. For example, if x is any point in S_n, taken as the "north pole", then most of the area of S_n is concentrated in the "equator".
3. The advantage of the $ell_1$-distance, which is less sensitive to high dimensionality, and has been shown to "provide the best discrimination in high-dimensional data spaces," [1, p. 427].
4. Clustering high-dimensional data using the $ell_1$ distance, [2].
References
[1] C.C. Aggarwal et al, On the surprising behavior of distance metrics in high dimensional space, Lecture Notes in Computer Science, vol 1973(2001), Springer, https://doi.org/10.1007/3-540-44503-X_27
[2] T. Asamov and A. Ben-Israel, A probabilistic $ell_1$ method for clustering high-dimensional data, Probability in the Engineering and Informational Sciences, 2021, 1-16
[3] K. Beyer et al, When is "nearest neighbor" meaningful?, Lecture Notes in Computer Science, vol 1540(1999), Springer, https://doi.org/10.1007/3-540-49257-7_15
[4] J.M. Hammersley, The distribution of distance in a hypersphere, The Annals of Mathematical Statistics 21(1950), 447452.
Password 6564120420