DNA Sequence Classification Using Machine Learning Models Based on k-mer Features
DOI:
https://doi.org/10.56427/jcbd.v4i2.762
Keywords:
DNA Classification, Machine Learning, k-mer, SVM, Random Forest, Deep Neural NetworkAbstract
Cell-free DNA (cfDNA) has emerged as a promising biomarker in various clinical applications, particularly in cancer detection, prenatal diagnostics, and disease monitoring. Accurate classification of cfDNA sequences is crucial for improving diagnostic reliability and enabling timely clinical decisions. This study investigates the application of machine learning models—Decision Tree (DT), Support Vector Machine (SVM), and Deep Neural Network (DNN)—for classifying cfDNA sequences using k-mer-based feature extraction, with k set to 3. A total of 3,000 DNA sequences comprising both normal and tumor-derived samples were transformed into numerical feature vectors based on the frequency of 3-mer patterns. The models were trained and evaluated using standard metrics including accuracy, precision, recall, and F1-score. Experimental results demonstrate that the DNN model achieved the highest classification performance, effectively distinguishing between normal and tumor cfDNA. In contrast, the DT and SVM models exhibited relatively lower performance, particularly in identifying normal sequences. The study also addresses challenges such as class imbalance and limitations of simple k-mer representations. These findings highlight the potential of deep learning approaches in improving cfDNA sequence analysis and open avenues for future research using more complex models, larger datasets, and feature engineering techniques to enhance classification accuracy and clinical applicability.
Downloads
References
X. G. Haoyang Li, Shuye Tian, Yu Li, Qiming Fang, Renbo Tan, Yijie Pan, Chao Huang, Ying Xu, “Modern deep learning in bioinformatics,” J. Mol. Cell Biol., vol. 12, no. 11, pp. 823–827, 2020.
L. W. y Chiang-Ching Huang, Meijun Du, “Bioinformatics Analysis for Circulating Cell-Free DNA in Cancer,” Cancers (Basel), vol. 11, no. 6, pp. 1–15, 2019.
S. Juneja, A. Dhankhar, A. Juneja, and S. Bali, “An Approach to DNA Sequence Classification Through Machine Learning: DNA Sequencing, K Mer Counting, Thresholding, Sequence Analysis,” Int. J. Reliab. Qual. E-Healthcare, vol. 11, no. 2, pp. 1–15, 2022.
G. I. Simon Orozco-Arias, Mariana S Candamil-Cortés, Paula A Jaimes, Johan S Piña, Reinel Tabares-Soto, Romain Guyot, “K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes,” 2021.
Ü. M. Akkaya and H. Kalkan, “Classification of DNA Sequences with k-mers Based Vector Representations,” 2021.
A. Fiannaca, M. La Rosa, R. Rizzo, and A. Urso, “A k-mer-based barcode DNA classification methodology based on spectral representation and a neural gas network,” Inst. High-Performance Comput. Networking, Natl. Res. Counc. Italy, Viale delle Sci. Ed. 11, 90128 Palermo, Italy, 2015.
T. A. S. B. A. Kindhi and M. H. Purnomo, “Optimasi Support Vector Machine (SVM) untuk memprediksi adanya mutasi pada DNA Hepatitis C Virus (HCV),” J. Nas. Tek. Elektro dan Teknol. Inf., vol. 7, no. 3, pp. 1–6, 2018.
D. H. G. M. Rizky, A. Pramuntadi, W. D. Prastowo, “Implementation of Deep Neural Network Method on Classification of Type 2 Diabetes Mellitus Disease,” MALCOM Indones. J. Mach. Learn. Comput. Sci., vol. 4, no. 3, pp. 1043–1050, 2024.
O. A. S. I. & Belal A. Hamed and T. A. El-Hafeez, “Optimizing classification efficiency with machine learning techniques for pattern matching,” J. Big Data, vol. 10, no. 124.
J. A. Malik YOUSEF, “Deep learning in bioinformatics,” Sci. Technol. Res. Counc. Turkey, 2024.
Z. Binhua Tang and A. K. Pan, Kang Yin, “Recent Advances of Deep Learning in Bioinformatics and Computational Biology,” Natl. Cent. Biotechnol. Inf., 2019.
T. L. Kyongsik Yun, Alexander Huyen, “Deep Neural Networks for Pattern Recognition,” 2018.
M. Raheem, “Deep Neural Network to Predict Diabetes: A Data Science Approach,” Int. J. Recent Technol. Eng., vol. 9, no. 6, pp. 1–5, 2021.
K. K, D. R. Edla, and V. Kuppili, “stacked autoencoders in deep neural networks,” Dep. Comput. Sci. Eng. Natl. Inst. Technol. Goa, India, 2019.
S. Ozan, “DNA Sequence Classification with Compressors,” digiMOST GmbH, 2024.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Afthar Kautsar

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.