Proceedings of the International scientific and practical conference ―Education and Scientific Progress‖ (April 24-26, 2026) / Publisher website: www.naukainfo.com. – Manchester, United Kingdom, 2026. - 218 p.
202 Table 1. Characteristics of the datasets used Dataset Description n objects Numerical characteristics Categorical Text DS-1 Customer base of the online store (UCI Online Retail) 4338 6 3 0 DS-2 Product reviews (Amazon Product Reviews, sample) 3000 4 2 1 (TF-IDF) DS-3 Medical dataset (UCI Heart Disease, extended) 1025 8 5 0 Table 2 presents the values of clustering quality metrics for four methods across three datasets, assuming the use of optimal hyperparameters. An analysis of the results allows us to identify a number of important observations. Table 2 Clustering quality metrics (SC - Silhouette, CH - Calinski-Harabasz, DB - Davies-Bouldin) Dataset k-means SC k-means CH DBSCAN SC DBSCAN CH GMM SC GMM CH Spec. SC Spec. CH DS-1 0.47 412 0.38 287 0.53 491 0.51 468 DS-2 0.31 218 0.29 174 0.40 304 0.44 352 DS-3 0.42 378 0.35 251 0.49 447 0.48 432 Note - The best SC result for each dataset is highlighted in green. For the DS-1 dataset (which consists exclusively of numerical and categorical data), the GMM algorithm demonstrated the highest performance (SC = 0.53; CH = 491), slightly outperforming spectral clustering (SC = 0.51). The k-means algorithm ranked third (SC = 0.47), which is due to the presence of non-compact clusters in the customer database. The DBSCAN method showed the lowest results (SC = 0.38) due to the uneven density of customer subgroup distribution. For the DS-2 dataset, which includes a text modality, spectral clustering yielded the best results (SC = 0.44). This is consistent with theoretical findings that clusters in TF-IDF space have nonlinear geometry, which is more effectively handled by a
Made with FlippingBook
RkJQdWJsaXNoZXIy MTAxMzIwNA==