Abstract: Heart disease is among the top causes of death around the world, and catching it at an early stage can make a real difference in how patients are treated. The problem, though, is that many machine learning models built for this purpose do not work well because they are trained on limited data, and often the dataset itself is skewed — meaning there are far more healthy patients than sick ones. In this work, we tried to fix this by bringing together two ideas: synthetic data generation using CTGAN, and a stacking ensemble classifier. We first used CTGAN to produce new, realistic patient records that mirror the original data's patterns, then trained a stacking model — XGBoost, Random Forest, and Gradient Boosting as base learners, with Logistic Regression on top — on the combined real and synthetic dataset. When we tested it, the model hit 92% accuracy and beat the basic XGBoost model on every metric we checked, including precision, recall, and AUC. The takeaway is simple: adding synthetic data and stacking classifiers together noticeably strengthens cardiovascular risk prediction.

Keywords: Cardiovascular Disease, CTGAN, Synthetic Data Augmentation, Stacking Ensemble, Machine Learning


Downloads: PDF | DOI: 10.17148/IARJSET.2026.133100

How to Cite:

[1] M. Manoj Kumar, Mrs. M. Santhikala, Dr. M. Kaliappan, Dr. E. Mariappan, "GENERATING SYNTHETIC PATIENT RECORDS WITH CTGAN TO IMPROVE CARDIOVASCULAR RISK PREDICTION," International Advanced Research Journal in Science, Engineering and Technology (IARJSET), DOI: 10.17148/IARJSET.2026.133100

Open chat