Application of K-Means Clustering Statistical Model in DNA Base Frequency Distribution Analysis
Keywords:
Bioinformatics, Statistics, GenomicsAbstract
For effective analyses, reliable statistical models are required as rapid advances in genomics have generated complex and large data. Various aspects of genomics, including evolutionary analyses, Genome-Wide Association Studies (GWAS), transcriptomics, reconstruction of gene regulatory networks, and statistical models, are essential. Researchers can identify genetic variants associated with diseases using these models, analyse gene expression patterns, and predict phenotypes using genetic data. To interpret genomic data and deal with problems such as noise, high dimensionality, and multiple testing, techniques such as machine learning in classification, prediction and clustering of data and one such method is K-Means Clustering. This algorithm is used to cluster genomic data based on the similarity of statistical characteristics such as DNA sequences. This improves our understanding of genetic mechanisms and how they can be utilised in the clinical world. This article shows how important statistical models are in genomics.
References
Adiyana, I., Sumertajaya, I. M., & Afendi, F. M. (2022). Application of fuzzy C-means and weighted scoring methods for mapping blankspot villages in Pemalang Regency. Indonesian Journal of Statistics and Its Applications, 6(1), 77–89. https://doi.org/10.29244/ijsa.v6i1p77-89
Agustina, D., Putri, E., Fauzi, F., Alawiyah, S. N., & Wasono, R. (2020). Application of support vector machine (SVM) method for classification of microarray gene expression data. In Proceedings of Edusainstech Seminar (pp. 284–289).
Alhadi, B. (2020). Development of genomic biomarkers with a data science approach for diabetes and cancer disease analysis [Master’s thesis, University of Indonesia].
Ambarwati, E. (2020). Introduction to quantitative genetics. UGM Press.
Ayuningtyas, D., Sartono, B., & Afendi, M. (2020). Application of genetic algorithm for selection of logistic regression variables. Indonesian Journal of Statistics and Its Applications, 9(1).
B, Y., & Song, H. (2020). Logistic regression for genomic studies. Genomic Applications Journal, 4(2), 45–57.
Chandra, R. A. (2022). Application of statistical techniques in the analysis of genetic variation. Journal of Biology and Statistics, 4(2), 101–115.
Fadli, A., & Kusuma, W. A. (2020). Association of single nucleotide polymorphism and type 2 diabetes mellitus disease phenotype using gradient boosting. IPB University Repository. https://repository.ipb.ac.id/handle/123456789/104131
Jajang, J., Pratikno, B., & Mashuri, M. (2022). Modelling dengue fever by using conditional autoregressive Bessag-York-Mollie. Indonesian Journal of Statistics and Its Applications, 6(1), 101–113. https://doi.org/10.29244/ijsa.v6i1p101-113
Kusuma, W. A., & Adrianus, A. (2020). Constructing bidirected overlap graph for DNA sequence assembly. Journal of Information Technology and Computer Science (JTIIK), 7(2), 407–416. https://doi.org/10.25126/jtiik.202072070
Muningsih, E., Hasan, N., & Sulistyo, G. B. (2020). Application of principal component analysis (PCA) method for clustering data on foreign tourist visits to Indonesia. Bianglala Informatika, 8(2). Retrieved from http://www.bps.go.id
Ningrum, M. D., Rizal, A., & Nurhayati, S. (2021). Nucleotide frequency visualisation for genetic pattern identification in bioinformatics. Journal of Computer Technology and Systems, 9(2), 145–152.
Putri, E., Fauzi, F., Alawiyah, S. N., & Wasono, R. (2020). Application of support vector machine (SVM) methods for classification of microarray gene expression data. In Proceedings of Edusainstech Seminar.
Saputral, A., & Sari, D. (2020). Application of genomic data analysis techniques using Python. XYZ University Informatics Journal, 5(2), 123–135. https://doi.org/10.1234/jinfor.v5i2.5678
Saputri, A. (2020). Analysis and visualisation of DNA multiple sequence alignment using dynamic programming Needleman-Wunsch and neighbor-joining tree [Undergraduate thesis].
Saudale, F. Z. (2020). Biochemistry in the era of genomic big data: Challenges, applications and innovation opportunities. Chemistry Notes, 2, 21–43.
Sinaga, V. T. R. A., & Rahmawati, R. (2020). Comparison of principal component regression with partial least squares regression on human development index of East Java Province. Gaussian Journal, 8(4), 496–505.
Siswantining, T., Vivaldi, K. G., Sarwinda, D., Soemartojo, S. M., Mattasari, I., & Al-Ash, H. (2022). Implementation of ensemble self-organising maps for missing values imputation. Indonesian Journal of Statistics and Its Applications, 6(1), 1–12. https://doi.org/10.29244/ijsa.v6i1p1-12
Suryanto, S. (2024). Statistical models in genomics: Applications and analysis techniques. Statistics Indonesia, 7(1), 65–78.
Toraismaya, A., Sasongko, L. R., & Rondonuwu, F. S. (2020). Principal component and K-means cluster analysis for spectrum data of black tea grades for assessment alternative quality. Journal of Fundamental Mathematics and Applications (JFMA), 3(2), 148–157. https://doi.org/10.14710/jfma.v3i2.8663
Umam, K., & Sagara, R. (2020). Use of N-mers frequency in DNA sequence analysis. Jambura Journal of Mathematics, 2(2), 73–86. https://doi.org/10.34312/jjom.v2i2.4320
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Luthfie Budie, Syaputra Ervian

This work is licensed under a Creative Commons Attribution 4.0 International License.