A Comparative Analysis of Clustering Algorithms for Expedia’s Travel Dataset

Authors

  • Gregorius Airlangga Information System Study Program, Atma Jaya Catholic University of Indonesia, Jakarta, Indonesia

DOI:

10.33395/sinkron.v9i1.14343

Keywords:

Clustering Algorithms, Travel Data Analytics, Agglomerative, KMeans, DBSCAN

Abstract

The effective segmentation of travel data is crucial for deriving actionable insights in the tourism and hospitality sectors. This study conducts a comprehensive evaluation of four clustering algorithms Agglomerative Clustering, DBSCAN, Gaussian Mixture Models (GMM), and KMeans on a travel dataset, using three widely recognized metrics: Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Score. The dataset was preprocessed through standardization and dimensionality reduction via Principal Component Analysis (PCA) to facilitate visualization and ensure computational efficiency. The results highlight significant differences in the performance of these algorithms. Agglomerative Clustering achieved the highest Silhouette Score, indicating superior cluster cohesion and separation, while KMeans recorded the highest Calinski-Harabasz Score, demonstrating strong inter-cluster variance. In contrast, DBSCAN performed poorly, producing low scores across all metrics, attributed to sensitivity to parameter selection and density irregularities in the dataset. Gaussian Mixture Models exhibited moderate performance but struggled with overlapping clusters due to limitations in modeling non-Gaussian data distributions. Visualization of clustering results confirmed these findings, revealing compact clusters for Agglomerative and KMeans, while DBSCAN and GMM showed less defined structures. This study underscores the importance of selecting clustering algorithms based on dataset characteristics and analysis objectives

GS Cited Analysis

Downloads

Download data is not yet available.

References

Aderline, S. K. X., Ting, H. Y., & Atanda, A. F. (2024). Trends in tourism recommendation systems: a review/Aderline Song Ke Xin, Ting Huong Yong and Abdulwahab Funsho Atanda. Journal of Computing Research and Innovation (JCRINN), 9(2), 85–107.

Aljizawi, J. (2024). Personalized Travel Recommendations and Marketing Automation for Saudi Arabia: Harnessing AI for Enhanced User Experience and Business Growth.

Alqahtani, N. A., & Kalantan, Z. I. (2020). Gaussian mixture models based on principal components and applications. Mathematical Problems in Engineering, 2020(1), 1202307.

Anowar, F., Sadaoui, S., & Selim, B. (2021). Conceptual and empirical comparison of dimensionality reduction algorithms (pca, kpca, lda, mds, svd, lle, isomap, le, ica, t-sne). Computer Science Review, 40, 100378.

Banerjee, S., & George, A. (2024). Identifying overtourism & spill-over tourism using ST-DBSCAN analysis for sustainable management of tourism. Current Issues in Tourism, 1–21.

Bolaños-Martinez, D., Bermudez-Edo, M., & Garrido, J. L. (2024). Clustering pipeline for vehicle behavior in smart villages. Information Fusion, 104, 102164.

Campello, R. J. G. B., Kröger, P., Sander, J., & Zimek, A. (2020). Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(2), e1343.

Chaudhry, M., Shafi, I., Mahnoor, M., Vargas, D. L. R., Thompson, E. B., & Ashraf, I. (2023). A systematic literature review on identifying patterns using unsupervised clustering algorithms: A data mining perspective. Symmetry, 15(9), 1679.

Chen, P., Zhang, X., & Gao, D. (2024). Preference heterogeneity analysis on train choice behaviour of high-speed railway passengers: A case study in China. Transportation Research Part A: Policy and Practice, 188, 104198.

Cherenkov, E., Benga, V., Lee, M., Nandwani, N., Raguin, K., Sueur, M. C., & Sun, G. (2024). From Machine Learning Algorithms to Superior Customer Experience: Business Implications of Machine Learning-Driven Data Analytics in the Hospitality Industry. Journal of Smart Tourism, 4(2), 5–14.

Ezugwu, A. E., Shukla, A. K., Agbaje, M. B., Oyelade, O. N., José-Garc’ia, A., & Agushaka, J. O. (2021). Automatic clustering algorithms: a systematic review and bibliometric analysis of relevant literature. Neural Computing and Applications, 33, 6247–6306.

Ferretti, J. (2022). Expedia Travel Dataset. Retrieved from https://www.kaggle.com/datasets/jacopoferretti/expedia-travel-dataset/data

Geng, S. (2024). Analysis of the Different Statistical Metrics in Machine Learning. Highlights in Science, Engineering and Technology, 88, 350–356.

Guo, Q., Mu, L., & Lou, S. (2024). Revolutionizing travel experiences: An in-depth analysis of intelligent booking systems and behavioral patterns. Intelligent Decision Technologies, 18(2), 1477–1494.

Hajihosseinlou, M., Maghsoudi, A., & Ghezelbash, R. (2024). Intelligent mapping of geochemical anomalies: Adaptation of DBSCAN and mean-shift clustering approaches. Journal of Geochemical Exploration, 107393.

Hamdi, A., Shaban, K., Erradi, A., Mohamed, A., Rumi, S. K., & Salim, F. D. (2022). Spatiotemporal data mining: a survey on challenges and open problems. Artificial Intelligence Review, 1–48.

Herrera, A., Arroyo, Á., Jiménez, A., & Herrero, Á. (2024). Exploratory techniques to analyse Ecuador’s tourism industry. Logic Journal of the IGPL, jzae040.

Ikotun, A. M., Ezugwu, A. E., Abualigah, L., Abuhaija, B., & Heming, J. (2023). K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences, 622, 178–210.

Jia, W., Sun, M., Lian, J., & Hou, S. (2022). Feature dimensionality reduction: a review. Complex & Intelligent Systems, 8(3), 2663–2693.

Li, J., & Cao, B. (2022). Study on tourism consumer behavior and countermeasures based on big data. Computational Intelligence and Neuroscience, 2022(1), 6120511.

Pesce, L., Krzakala, F., Loureiro, B., & Stephan, L. (2023). Are Gaussian data all you need? The extents and limits of universality in high-dimensional generalized linear estimation. International Conference on Machine Learning, 27680–27708.

Ran, X., Xi, Y., Lu, Y., Wang, X., & Lu, Z. (2023). Comprehensive survey on hierarchical clustering algorithms and the recent developments. Artificial Intelligence Review, 56(8), 8219–8264.

Reuvers, S. (2021). Discovering customer clusters using unsupervised machine learning to aid the marketing strategy: a case study with an online retail webshop SME. University of Twente.

Rouhi, A., Bouyer, A., Arasteh, B., & Liu, X. (2024). Two-pronged feature reduction in spectral clustering with optimized landmark selection. Applied Soft Computing, 161, 111775.

Sarker, I. H. (2021). Machine learning: Algorithms, real-world applications and research directions. SN Computer Science, 2(3), 160.

Semwal, R., Ranjan, S., Dhama, A., Chauhan, A., Bairwa, M. K., & Madhav, R. C. (2023). Conceptual Framework: Leveraging Artificial Intelligence for Enhanced Travel Review Analysis and Insights. 2023 6th International Conference on Contemporary Computing and Informatics (IC3I), 6, 2176–2181.

Shukla, R. M., & Sengupta, S. (2020). Scalable and robust outlier detector using hierarchical clustering and long short-term memory (lstm) neural network for the internet of things. Internet of Things, 9, 100167.

Solanki, A. (2021). Classification vs Clustering for Study Selection in Systematic Literature.

Thudumu, S., Branch, P., Jin, J., & Singh, J. (2020). A comprehensive survey of anomaly detection techniques for high dimensional big data. Journal of Big Data, 7, 1–30.

Tian, H., Presa-Reyes, M., Tao, Y., Wang, T., Pouyanfar, S., Miguel, A., … Iyengar, S. S. (2021). Data analytics for air travel data: a survey and new perspectives. ACM Computing Surveys (CSUR), 54(8), 1–35.

Wang, J., & Biljecki, F. (2022). Unsupervised machine learning in urban studies: A systematic review of applications. Cities, 129, 103925.

Wegmann, M., Zipperling, D., Hillenbrand, J., & Fleischer, J. (2021). A review of systematic selection of clustering algorithms and their evaluation. ArXiv Preprint ArXiv:2106.12792.

Yang, W., Zhang, Y., Wang, H., Deng, P., & Li, T. (2021). Hybrid genetic model for clustering ensemble. Knowledge-Based Systems, 231, 107457.

Yates, L. A., Aandahl, Z., Richards, S. A., & Brook, B. W. (2023). Cross validation for model selection: a review with examples from ecology. Ecological Monographs, 93(1), e1557.

Zangerle, E., & Bauer, C. (2022). Evaluating recommender systems: survey and framework. ACM Computing Surveys, 55(8), 1–38.

Downloads


Crossmark Updates

How to Cite

Airlangga, G. (2025). A Comparative Analysis of Clustering Algorithms for Expedia’s Travel Dataset. Sinkron : Jurnal Dan Penelitian Teknik Informatika, 9(1), 476-483. https://doi.org/10.33395/sinkron.v9i1.14343