Proceedings of the International scientific and practical conference ―Education and Scientific Progress‖ (April 24-26, 2026) / Publisher website: www.naukainfo.com. – Manchester, United Kingdom, 2026. - 218 p.
200 and does not require prior assumptions regarding the statistical distribution or density of the data. Selecting a kernel for constructing a similarity matrix is a critical step. When working with multimodal data, it is advisable to use composite kernels or individual kernelization methods for each feature type, followed by integration of the results [10, p. 1407]. Multimodal datasets include features of various types: numerical (continuous and discrete), categorical, binary, text-based, and others. The main challenge with such data is the heterogeneity of scales and metrics, which makes it impossible to directly apply Euclidean distance to mixed feature types. The following approaches are most commonly used to form a unified feature space [2, p. 43]: one-hot encoding - for processing categorical features; TF-IDF or Word2Vec/FastText - for vector representation of text data; normalization (min-max or z-score) - for scaling numerical features; Gauer‘s metric - as a generalized distance measure for datasets with mixed feature types. The Gauer metric is defined as a weighted average normalized distance, calculated separately for numerical and categorical features [11, p. 860]. For numerical variables, the normalized absolute difference is used, while for categorical variables, the equality indicator is used. Thanks to this approach, the Hauer metric is the de facto standard for clustering mixed data. The quality of clustering is evaluated in the absence of ground-truth labels by using internal indices. This paper employs three metrics [6, p. 80]: The silhouette coefficient (SC) for object i is calculated using the formula: s(i) = (b(i) - a(i)) / max{a(i), b(i)} , where a(i) is the average distance to other objects within the same cluster, and b(i) is the minimum average distance to objects in the neighboring cluster. The SC value lies in the range [-1, 1], with values close to 1 indicating a higher quality of partitioning. The Kalinski-Harabash (CH) index is calculated as the ratio of inter-cluster variance to intra-cluster variance. Higher values of this index correspond to a more
Made with FlippingBook
RkJQdWJsaXNoZXIy MTAxMzIwNA==