Skip Navigation
Skip to contents

JPMPH : Journal of Preventive Medicine and Public Health

OPEN ACCESS
SEARCH
Search

Articles

Page Path
HOME > J Prev Med Public Health > Volume 59(2); 2026 > Article
Original Article
Development of Machine Learning Models to Predict Health Insurance Claim Costs Among Older Indonesians: A Retrospective Predictive Modeling Study
Yeni Mahwati1orcid, Dhihram Tenrisau2orcid, Syarif Rahman Hasibuan3,4orcid, Bhirau Wilaksono5orcid, Yeni Indriyani6orcid, Andi Afdal Abdullah7, Halik Malik7orcid, Andi Alfian Zainuddin8orcid
Journal of Preventive Medicine and Public Health 2026;59(2):132-142.
DOI: https://doi.org/10.3961/jpmph.25.350
Published online: January 6, 2026
  • 1,366 Views
  • 220 Download

1Sekolah Tinggi Ilmu Kesehatan Dharma Husada, Bandung, Indonesia

2Public Health Literature Club, Yogyakarta, Indonesia

3Faculty of Medicine, Universitas Pembangunan Nasional Veteran, Jakarta, Indonesia

4Center for Health Administration and Policy Studies, Faculty of Public Health, Universitas Indonesia, Jakarta, Indonesia

5Center for Longevity Research, Faculty of Medicine, Universitas Negeri Makassar, Makassar, Indonesia

6Department of Public Health, Faculty of Health Science, Universitas Muhammadiyah Surakarta, Surakarta, Indonesia

7Social Insurance Administration Organization, Jakarta, Indonesia

8Department of Public Health and Community, Faculty of Medicine, Universitas Hasanuddin, Makassar, Indonesia

Corresponding author: Yeni Mahwati, Sekolah Tinggi Ilmu Kesehatan Dharma Husada, Jl. Terusan Jakarta No.75, Cicaheum, Kiaracondong, Bandung 40282, Indonesia, E-mail: yenimahwati@stikesdhb.ac.id
• Received: May 3, 2025   • Revised: September 28, 2025   • Accepted: October 27, 2025

Copyright © 2026 The Korean Society for Preventive Medicine

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

prev next
  • Objectives
    The objective of this study was to develop machine learning models to predict health insurance claim costs among older adults in Indonesia.
  • Methods
    This study utilized secondary data from the Indonesian National Health Insurance program (Jaminan Kesehatan Nasional [JKN]) spanning 2017 to 2023. Three modeling techniques—linear regression, random forest, and XGBoost—were employed to predict individual claim costs. Model performance was assessed using the root mean square error (RMSE), coefficient of determination (R2), and mean absolute error (MAE). Additionally, variable importance analysis was conducted to identify key predictors.
  • Results
    XGBoost with 500 boosting rounds yielded the best performance, with an RMSE of 11 360 283, an R2 of 0.81, and an MAE of 4 485 917, outperforming both linear regression (RMSE, 13 710 035; R2=0.72) and random forest (RMSE, 12 434 238; R2=0.78). Notably, outpatient care was identified as the most consistent predictor across all models. Other significant predictors included length of stay (LOS), diagnosis type (International Classification of Diseases, 10th revision chapter), facility type, facility classification, and severity of illness, particularly for moderate cases. Although LOS and diagnosis type were important predictors, these findings should be interpreted in the context of Indonesia’s fixed Indonesian Case-Based Groups payment system.
  • Conclusions
    XGBoost provides reliable predictions of claim costs among older adults, capturing clinical, utilization, and structural drivers. These findings can inform targeted interventions, improve chronic disease management, optimize the referral system, and support integration of predictive tools into JKN to enhance responsiveness and promote sustainable, equitable financing.
Indonesia’s population of older adults is growing: 11.75% are aged 60 years or older, the morbidity rate is 19.72%, and 41.49% report health issues, including chronic diseases and disabilities (Survei Sosial Ekonomi Nasional, Indonesian National Socioeconomic Survey [SUSENAS], March 2023) [1]. The high prevalence of multimorbidity and age-related diseases among older adults substantially increases healthcare costs [2,3]. These trends highlight the need for predictive tools that incorporate longevity-related considerations. Such tools can help anticipate aging-related costs, improve resource allocation, and support planning for care for older adults in Indonesia.
Indonesia’s national health insurance program, Jaminan Kesehatan Nasional (JKN), administered by Badan Penyelenggara Jaminan Sosial (BPJS), is central to financing healthcare for older adults and is funded by contributions and subsidies [46]. JKN provides essential services, including chronic disease management, hospitalization, and medications for older adults [7]. Data from SUSENAS show that 55.47% of older outpatients and 78.60% of inpatients rely on JKN coverage [8]. However, ensuring the financial sustainability of JKN is challenging due to rising costs.
Accurate healthcare cost prediction is essential for strategic planning. Traditional models, such as linear regression, often struggle to capture the complexity of healthcare data pertaining to older adults [9]. In contrast, machine learning (ML) methods can outperform traditional approaches in forecasting healthcare costs and supporting risk adjustment [10]. ML has also played a key role in supporting preventive health initiatives and optimizing resource allocation efforts [11].
The use of ML within JKN to predict claim costs for older adults remains limited [12]. A small number of studies have explored ML applications within Indonesia’s health system, such as artificial intelligence-based preeclampsia prediction using BPJS data [13], disease incidence prediction [14], blood pressure changes among patients with hypertension [15], and heart failure readmission risk [16]. However, no studies have focused on JKN claim cost prediction for older adults. In this study, we constructed ML models to predict claim costs for older adults with the aim of informing and strengthening JKN policy. Accordingly, this study addresses the gap in the literature by testing whether XGBoost outperforms other models and by examining whether outpatient utilization, length of stay (LOS), and chronic diagnoses are key predictors.
Study Design and Data Source
This retrospective analysis used secondary data from BPJS Kesehatan, including anonymized participant and healthcare utilization records collected from 2017 to 2023. The dataset comprises a 1% representative sample of Indonesia’s National Health Insurance members, selected through stratified random sampling based on facility type and family category [17]. We analyzed cumulative individual claim costs using a cross-sectional design, focusing on prediction rather than cost trajectories over time. Participants with at least 1 healthcare claim during 2017–2023 were included, regardless of status (active, inactive, or deceased), because the outcome was cumulative cost. An illustration of the sampling process is provided in Figure 1.
The dataset initially contained 2 501 251 patient-visit records, which were transformed into an individual-level cohort dataset. After excluding 77 records with missing values, 102 728 older adults remained for analysis. Only records with complete data were used for modeling. Figure 2 presents the sample selection process and the ML workflow.
Description of Variables

Target variable

The target variable was total claim cost (Indonesian rupiah [IDR]) per participant over 2017–2023, including all expenses submitted by healthcare facilities for inpatient and outpatient care, procedures, medications, and diagnostic services.

Predictor variables

Predictors included demographic factors (age, sex, and marital status); utilization measures by severity level and International Classification of Diseases, 10th revision (ICD-10) diagnosis; service type; ward class; LOS; segmentation; and facility type/ownership. Outpatient care refers to the number of specialist clinic visits, and inpatient care refers to the number of hospital admissions. A detailed description and conceptual framework are provided in Supplemental Material 1.

Descriptive analysis

Continuous variables (age, cost, and LOS) exhibited positive skewness, high variance, and long-tailed distributions indicative of outliers (Supplemental Material 2).
Statistical Analysis
The data were processed using R version 4.4.2 (R Foundation for Statistical Computing, Vienna, Austria). Descriptive statistics were used to summarize participants’ demographic, clinical, cost, and utilization characteristics. Categorical variables were presented as frequencies and percentages, and numerical variables were reported as means and standard deviations (SDs). Pearson correlation was used to examine relationships between claim costs and other variables and to identify potential data issues. Subsequently, 5-fold cross-validation was applied, with the data split into training (80%) [18] and testing (20%) sets, to validate the models [19].

Recursive feature selection

Data quality and potential overfitting were addressed by emphasizing influential features. Recursive feature elimination (RFE) was applied for feature selection. RFE iteratively ranks features by importance and removes less informative variables. By prioritizing the most informative features, RFE can reduce noise, improve model accuracy, and mitigate overfitting. It also enhances the interpretability and efficiency of the model, particularly for high-dimensional data [20]. A detailed description of the feature selection process is provided in Supplemental Material 3.

ML algorithms

The data were analyzed using 2 ML algorithms: random forest and XGBoost, both implemented via the Classification and Regression Training (caret) package, a comprehensive framework for building ML models in R [20]. These robust ensemble learning methods have previously shown strong performance for predicting medical expenses and total costs [21]. Hyperparameters for both models were optimized using grid search, with detailed configurations outlined in Supplemental Material 4.

Model evaluation

Root mean square error (RMSE) and the coefficient of determination (R2) were used to evaluate model performance on the testing dataset. RMSE reflects the average prediction error, while R2 indicates the proportion of variance explained by the model. The model with the lowest RMSE and the highest R2 was selected as the best-performing model.
Ethics Statement
This study used de-identified secondary data from BPJS Kesehatan and did not involve human participants or identifiable personal information. Therefore, institutional review board approval was not required.
Descriptive and Exploratory Data Analysis
This study included 102 728 participants, with a mean±SD age of 68.90±6.75 years. Most participants were male (51.8%) and married (72.3%). Table 1 provides a comprehensive summary of participants’ demographic and healthcare utilization characteristics. The mean and median total claim costs were IDR 12.7 million and IDR 5.9 million, respectively. The transformed exploratory data analysis (EDA) results are presented in Supplemental Material 2.
Outpatient care was the most frequently utilized service, with an average of 14.3 visits per participant. ICD-10 Chapter 21 (factors influencing health status) was the most common diagnostic category, followed by circulatory, eye, genitourinary, and digestive diseases. Segmentation indicated that most participants fell into the PBPU (informal self-employed) or PBI APBN (government-subsidized premium [national]) groups. Visits were concentrated in Class C and B hospitals. Correlation analysis revealed strong correlations (r=0.60–0.79) between outpatient services and severity levels. Additionally, robust correlations (r>0.79) were observed among outpatient visits, severity, and ICD-10 Chapter 21 diagnoses, suggesting potential multicollinearity and confounding (Supplemental Material 5).
Predictive Model Performance and Evaluation
Using RFE for feature selection reduced the number of predictors from 43 to 35. Additional information regarding the feature selection process and ranked variables is provided in Supplemental Material 3. RMSE, R2, and MAE were evaluated across varying numbers of features selected via RFE. Optimal performance, defined by the lowest RMSE and the highest R2, was achieved when 35 features were included.

Model comparison

Three modeling techniques were evaluated: XGBoost, random forest, and linear regression (used as the baseline model). Hyperparameter tuning parameters and model configurations are summarized in Supplemental Material 4.

Model evaluation

Model performance was assessed using RMSE, R2, and MAE (Table 2). XGBoost (500 rounds) achieved lower RMSE and MAE and a higher R2 than the other models. RMSE curves across varying nrounds and learning rates plateaued at higher iteration counts (200–500) and lower learning rates (eta=0.025, 0.050) (Supplemental Material 4C), indicating diminishing returns. Bonferroni-adjusted paired t-tests confirmed that XGBoost significantly outperformed linear regression (p<0.05), with smaller but non-significant improvements over random forest (Supplemental Material 6B). Residuals and additional performance details are provided in Supplemental Material 6A and B.
Unlike the linear patterns observed in the EDA, the ML models revealed nonlinear relationships among outpatient visits, severity, and diagnoses, suggesting potential multicollinearity or interaction effects (Supplemental Material 5A).
Single cross-validation and Bonferroni-adjusted tests revealed that XGBoost consistently outperformed random forest and linear regression across all metrics (MAE, RMSE, and R2), with significantly lower errors than linear regression (p<0.05), especially at 500 rounds. Performance gains for XGBoost variants plateaued beyond 100–200 trees, and the improvements from random forest over linear regression were modest. Overall, XGBoost provided the most reliable predictive performance, supporting its selection as the final model (Supplemental Materials 6B and 7). Visual evaluation indicated that linear regression substantially underpredicted high-cost cases; random forest improved mid-range predictions but underestimated extreme values, whereas XGBoost aligned most closely with the diagonal, indicating the best accuracy across cost ranges (Supplemental Material 6A).

Key predictors of claim costs

Variable importance analysis revealed that outpatient care was the most important predictor across all models (mean importance, 98.63; SD, 2.66). ICD diagnosis category also demonstrated moderate importance (mean, 4.79; SD, 12.44). In the best-performing model (XGBoost with 500 rounds), the top 3 predictors were outpatient care (importance: 100), LOS (59.8), and ICD-10 Chapter 14 (genitourinary diseases; 16.27). Figure 3 illustrates the variable importance across models, highlighting key utilization and diagnostic patterns.
This study demonstrated that XGBoost outperformed both random forest and linear regression in predicting health insurance claim costs among Indonesia’s population of older adults. With superior RMSE, MAE, and R2 values, XGBoost effectively captured complex, non-linear patterns in healthcare data, supporting the utility of tree-based ensemble methods in modeling diverse cost drivers [22,23].
Model performance was further supported by stable RMSE trends across increasing boosting rounds and low learning rates, with diminishing gains beyond 200–500 iterations. These findings suggest that the selected configuration (500 rounds) represents a robust trade-off between predictive performance and overfitting risk, consistent with internal cross-validation results. Although linear regression served as a useful baseline, its capacity to account for interactions among clinical and institutional variables was limited.
Feature importance analysis highlighted 6 key domains influencing claim cost variation: types of services received, LOS, diagnosis type (ICD-10 chapter), facility type, facility classification, and severity of illness. LOS and diagnosis counts were positively skewed with outliers, which we retained because extreme costs often reflect real, high-impact events in care for older adults. XGBoost and other ensemble methods are robust to these characteristics [24,25].
This multifactorial model extends beyond single-disease analyses by capturing the intersection of clinical, utilization, and structural factors, thereby providing a strong foundation for operational and policy-level decision-making within the JKN system.
Outpatient Care as the Primary Cost Driver
Outpatient care was identified as the most important predictor across all models. This finding aligns with global trends showing that older adults with chronic and degenerative conditions, such as cardiovascular disease, diabetes, and cognitive decline, tend to use outpatient services more frequently, contributing to cumulative healthcare costs [26]. These patterns underscore the need to prioritize chronic disease prevention and early intervention strategies to reduce avoidable outpatient utilization and long-term costs. Predictive models that incorporate chronic and functional care trajectories may improve cost forecasting and help guide targeted policy interventions [27,28].
Crucially, within the JKN system, claim costs are primarily driven by utilization. Even within a fixed-tariff structure such as the Indonesian Case-Based Groups (INA-CBGs), an increase in outpatient visits or inpatient episodes naturally results in higher cumulative expenditures. This underscores that utilization is a key cost driver and may warrant its own dedicated predictive modeling.
Length of Stay in the Context of Jaminan Kesehatan Nasional and Patient Complexity
Ranking second in predictive importance, LOS is commonly regarded as an indicator of disease complexity, functional decline, and institutional efficiency. In older populations, prolonged hospitalization is often attributable to multimorbidity, functional impairment, and delayed discharge [29].
Consistent with this, previous studies have identified conditions such as stroke, cancer, hip fractures, and infections (urinary tract infections and pneumonia) as significant contributors to prolonged LOS among older adults [30]. These conditions often require complex interventions or rehabilitation, particularly when patients are discharged to settings other than their homes. Importantly, a substantial share of inpatient costs is concentrated among high-cost users with extended LOS, many of whom have chronic conditions such as malignancy, which further intensifies the concentration of expenditures [31].
Although LOS was a significant predictor of claim costs, it is essential to interpret this finding within the context of Indonesia’s case-based payment system, INA-CBGs [32], in which reimbursement is determined by diagnosis group. Consequently, patients with the same diagnosis codes but different LOS may generate identical claim costs, indicating that LOS alone may not fully capture expenditure variation under the JKN system.
If LOS consistently emerges as a strong predictor of costs, policymakers could explore integrating predicted LOS categories into quarterly risk stratification models, piloting XGBoost-based tools in selected BPJS regional offices, and aligning reimbursement protocols more closely with actual resource demands.
Diagnosis and Cost Burden: Clinical Patterns and Payment Implications
Diagnosis type, particularly ICD-10 Chapter 14 (genitourinary diseases), was a key predictor of high costs. Older adults commonly experience urologic conditions, such as prostate disorders (including cancer and benign prostatic hyperplasia) in male and urinary incontinence in female. Chronic kidney disease is also a frequent comorbidity that contributes to high healthcare costs due to its progressive course and potential need for long-term dialysis or kidney transplantation. These conditions can increase costs through complex care needs and prolonged hospitalizations [30,33].
ICD-10 Chapter 21 (factors influencing health status) also played a significant role, as it includes codes for long-term care (e.g., Z79), which may reflect chronic conditions that increase healthcare costs due to complex disease interactions, polypharmacy, and elevated risks of adverse health outcomes [34]. Other chapters (Chapter 9 [circulatory] and Chapter 13 [musculoskeletal]) cover conditions such as hypertension, stroke, ischemic heart disease, osteoarthritis, and osteoporosis, all of which can lead to pain, functional impairment, frailty, and sarcopenia. These conditions diminish quality of life and may require high-cost interventions and prolonged rehabilitation [3537].
Despite standardized INA-CBG payments, costs can vary by contribution class and regional tariffs, meaning that diagnosis and severity remain key predictors.
This study categorized diagnoses using ICD-10 chapters and analyzed aggregate patterns. However, some high-cost conditions, such as cardiovascular disease, may involve relatively lower utilization but disproportionately higher costs per episode, unlike low-cost, high-frequency diagnoses such as fever. Therefore, the model may not fully capture skewed cost-utilization dynamics.
These findings suggest that diagnosis-driven costs, especially among chronic and complex cases, should inform BPJS policy. Predictive models that incorporate severity may support risk stratification, preventive care, and resource allocation aligned with true cost drivers.
Facility Type and Classification
Higher-tier facilities, particularly Type A and B hospitals, were associated with higher claim costs. Indonesia’s case-based payment system, INA-CBGs, assigns higher reimbursement tariffs to these facilities based on service complexity. These findings highlight the importance of optimizing referral systems and managing moderate-severity cases at more cost-effective facility levels, such as Class B or C hospitals, while ensuring appropriate quality and access [32,3840]. Predictive tools such as XGBoost could support BPJS monitoring by identifying cases that may be appropriate for management at lower-tier hospitals, optimizing resource allocation, and informing reimbursement policy.
Severity of Illness as a Direct Cost Determinant
Severity of illness, particularly moderate cases, was another significant cost driver. Under INA-CBGs, inpatient severity is categorized into 3 levels: mild, moderate, and severe. Moderate cases often represent chronic disease, remaining manageable but still requiring substantial resources. Reimbursement is structured accordingly, with moderate-level cases incurring higher costs than mild cases [32]. These findings suggest that proactive management of moderate-severity patients through targeted preventive care and monitoring could substantially reduce long-term expenditures.
This study confirms that health insurance claim costs among Indonesia’s population of older adults are influenced by clinical, service, and institutional factors. XGBoost effectively captured these dynamics, providing a foundation for data-driven cost-containment strategies within the national health insurance system. Other key predictors (outpatient utilization, LOS, diagnostic burden, and facility structure) are established drivers of high-cost care and were coherently captured by our integrated model.
Despite these contributions, this study has several limitations. First, we used cross-sectional secondary data and could not capture time-dependent outcomes, such as disease progression or treatment sequences. Although stratified sampling improved representativeness, some diagnoses or facilities may have been overrepresented, potentially introducing bias. Including inactive or deceased participants enabled capture of high-cost claims; however, exposure time was not modeled, possibly contributing additional bias.
Second, continuous variables (age, cost, and LOS) were positively skewed with high variance and outliers, potentially affecting the sensitivity and generalizability of non-tree-based models. Several categorical features (hospital type, specialty, and ICD chapters) also showed low variance and were dominated by a few categories. These imbalances may limit the models’ capacity to detect patterns in underrepresented groups, especially for linear regression. Third, the best-performing model (XGBoost with 500 rounds) consistently underpredicted expenditures among extremely high-cost users, as shown in the decile and top-5% analyses. This finding suggests that mean-based approaches may struggle to capture extreme values in skewed cost distributions (Supplemental Material 8).
These findings highlight the importance of translating predictive insights into actionable strategies within Indonesia’s health system. As JKN faces increasing financial pressure from demographic shifts and chronic disease burden, the use of advanced analytics such as XGBoost may improve resource planning and responsiveness. Importantly, this study extends beyond technical validation by presenting a policy-relevant framework that incorporates clinical complexity, utilization intensity, and facility-level dynamics. These insights may support the development of more equitable, efficient, and sustainable health financing models for Indonesia’s aging population.
Future research should explore longitudinal or time-to-event approaches to better capture care dynamics. In addition, modeling healthcare utilization (e.g., visit frequency or care intensity) as a related outcome may provide complementary insights for forecasting and policy planning within JKN.
This study shows that XGBoost is a valuable tool for predicting healthcare claim costs among older adults in Indonesia. The findings have important implications for developing risk-based cost-containment strategies, highlighting the need to focus on moderate-severity cases, optimize facility utilization, and improve chronic disease management. We recommend piloting ML tools, such as predictive risk stratification or outpatient alerts, within BPJS workflows to support data-driven decision-making in JKN. Integrating predictive tools into health policy may help strengthen the efficiency and sustainability of Indonesia’s national health insurance systems.
The dataset used in this study was obtained from BPJS Kesehatan under a data use agreement and is not publicly accessible. BPJS Kesehatan may provide access to the data upon reasonable request and with appropriate institutional approval.
All R scripts and other documentation are available at https://github.com/Dhihram/ML_BPJS.
Supplemental materials are available at https://doi.org/10.3961/jpmph.25.350.

Conflict of Interest

The authors have no conflicts of interest associated with the material presented in this paper.

Funding

None.

Acknowledgements

The authors express their gratitude to BPJS Kesehatan for granting access to the anonymized dataset used in this study. The authors would also like to express their gratitude to Pierre Masselot, Alex Lewin, and Andree Valle Campos for sharing valuable insights from the “Machine Learning” and “Data Challenge” modules of the MSc Health Data Science program at the London School of Hygiene and Tropical Medicine (LSHTM).

Author Contributions

Conceptualization: Mahwati Y, Tenrisau D. Data curation: Hasibuan SR. Formal analysis: Tenrisau D. Funding acquisition: None. Methodology: Mahwati Y, Tenrisau D, Hasibuan SR. Project administration: Mahwati Y. Visualization: Tenrisau D. Writing – original draft: Mahwati Y, Tenrisau D, Hasibuan SR, Wilaksono B, Indriyani Y, Abdullah AA, Malik H, Zainuddin AA. Writing – review & editing: Mahwati Y, Tenrisau D, Hasibuan SR, Wilaksono B, Indriyani Y, Abdullah AA, Malik H, Zainuddin AA.

Figure 1
Illustration of the sampling process.
jpmph-25-350f1.jpg
Figure 2
Sample selection process and machine learning workflow.
jpmph-25-350f2.jpg
Figure 3
(A) Variable importance across machine learning models (B) XGBoost, (C) random forest, and (D) linear regression.
jpmph-25-350f3.jpg
Table 1
Participant demographics and healthcare utilization characteristics (n=102 728)
Characteristics n (%) or mean±SD
Age (y) 68.90±6.75
Sex
 Male 53 165 (51.8)
 Female 49 563 (48.2)
Marital status
 Unmarried 6780 (6.6)
 Divorced 17 977 (17.5)
 Married 74 247 (72.3)
 Undefined 3724 (3.6)
Utilization by severity level
 Mild 0.83±1.35
 Moderate 0.33±0.82
 Severe 0.14±0.44
Outpatient care 14.30±32.50
Utilization by ICD-10 diagnosis
 ICD Chapter 1 0.16±0.55
 ICD Chapter 2 0.13±0.69
 ICD Chapter 3 0.04±0.38
 ICD Chapter 4 0.23±0.98
 ICD Chapter 5 0.02±0.33
 ICD Chapter 6 0.06±0.60
 ICD Chapter 7 0.50±1.60
 ICD Chapter 8 0.06±0.31
 ICD Chapter 9 0.63±1.70
 ICD Chapter 10 0.26±1.09
 ICD Chapter 11 0.28±0.88
 ICD Chapter 12 0.05±0.35
 ICD Chapter 13 0.26±1.95
 ICD Chapter 14 0.30±4.26
 ICD Chapter 15 0.00±0.03
 ICD Chapter 17 0.00±0.08
 ICD Chapter 18 0.19±0.78
 ICD Chapter 19 0.10±0.39
 ICD Chapter 21 12.30±30.50
Utilization by type of service received
 Outpatient 14.30±32.50
 Inpatient 1.29±1.88
Utilization by ward class
 Ward class I 6.11±25.00
 Ward class II 2.80±14.50
 Ward class III 6.68±20.20
Length of stay (day) 5.32±9.03
Utilization by participant segmentation
 Non-worker 25 054 (24.4)
 PBI APBD 9116 (8.9)
 PBI APBN 26 553 (25.8)
 PBPU 36 155 (35.2)
 PPU 5850 (5.7)
Utilization by facility ownership
 Government 6.55±19.80
 Private 9.03±25.40
Utilization by facility type
 Type A hospital 0.72±6.56
 Type B hospital 4.16±18.60
 Type C hospital 7.98±21.70
 Type D hospital 1.79±8.39
 Special hospital 0.87±5.26
 Other 0.07±2.19
Claim cost (IDR) 12 740 059±25 021 824

SD, standard deviation; ICD-10, International Classification of Diseases, 10th revision; PBI APBD, government-subsidized premium (local); PBI APBN, government-subsidized premium (national); PBPU, informal self-employed; PPU, formal worker participant; IDR, Indonesian rupiah.

Table 2
Model performance comparison using training data
Models RMSE R2 MAE
Linear regression 13 710 035 0.72 5 641 598
Random forest
 ntree 50 12 434 238 0.78 4 696 412
 ntree 100 12 438 436 0.78 4 672 509
 ntree 200 12 437 736 0.78 4 667 147
XGBoost
 nrounds 100 11 508 546 0.80 4 579 177
 nrounds 200 11 451 146 0.81 4 549 908
 nrounds 500 11 360 283 0.81 4 485 917

RMSE, root mean square error; MAE, mean absolute error.

  • 1. Statistics Indonesia. Statistics of aging population 2023. [cited 2025 Feb 5]. Available from: https://www.bps.go.id/en/publication/2023/12/29/5d308763ac29278dd5860fad/statistics-of-aging-population-2023.html (Indonesian)
  • 2. Okamoto S, Sata M, Rosenberg M, Nakagoshi N, Kamimura K, Komamura K, et al. Universal health coverage in the context of population ageing: catastrophic health expenditure and unmet need for healthcare. Health Econ Rev 2024;14(1):8. https://doi.org/10.1186/s13561-023-00475-2ArticlePubMedPMC
  • 3. Bertolazzi A, Quaglia V, Bongelli R. Barriers and facilitators to health technology adoption by older adults with chronic diseases: an integrative systematic review. BMC Public Health 2024;24(1):506. https://doi.org/10.1186/s12889-024-18036-5ArticlePubMedPMC
  • 4. Mahardhika JC. Socio-economical characteristics and determinants of Indonesian national health insurance subsidized by the government in Jakarta. Med Clin Update J 2023;2(1):1-7. https://doi.org/10.58376/mcu.v2i1.20Article
  • 5. Abnur A. Analysis on BPJS Kesehatan from various disciplines. Glob Rev Islam Econ Bus 2015;2(3):159-171. https://doi.org/10.14421/grieb.2015.023-01Article
  • 6. Muttaqien M, Setiyaningsih H, Aristianti V, Coleman HL, Hidayat MS, Dhanalvin E, et al. Why did informal sector workers stop paying for health insurance in Indonesia? Exploring enrollees’ ability and willingness to pay. PLoS One 2021;16(6):e0252708. https://doi.org/10.1371/journal.pone.0252708ArticlePubMedPMC
  • 7. Rosida SR. Analysis of access quality and health services on the effectiveness of health insurance system. Miracle Get J 2024;1(3):17-25. https://doi.org/10.69855/mgj.v1i3.62Article
  • 8. Zainafree I, Maharani C, Syukria N, Patriajati MM, Putri DA, Tsuroyya SL, et al. Deficit and surplus of BPJS kesehatan in the national health insurance program. Kota Semarang Universitas; Negeri Semarang: 2024. p. 1-23 (Indonesian)
  • 9. Rose S. Machine learning for prediction in electronic health data. JAMA Netw Open 2018;1(4):e181404. https://doi.org/10.1001/jamanetworkopen.2018.1404ArticlePubMed
  • 10. Rose S. A machine learning framework for plan payment risk adjustment. Health Serv Res 2016;51(6):2358-2374. https://doi.org/10.1111/1475-6773.12464ArticlePubMedPMC
  • 11. Yang C, Delcher C, Shenkman E, Ranka S. Machine learning approaches for predicting high cost high need patient expenditures in health care. Biomed Eng Online 2018;17(Suppl 1):131. https://doi.org/10.1186/s12938-018-0568-3ArticlePubMedPMC
  • 12. Vimont A, Leleu H, Durand-Zaleski I. Machine learning versus regression modelling in predicting individual healthcare costs from a representative sample of the nationwide claims database in France. Eur J Health Econ 2022;23(2):211-223. https://doi.org/10.1007/s10198-021-01363-4ArticlePubMed
  • 13. Sufriyana H, Wu YW, Su EC. Artificial intelligence-assisted prediction of preeclampsia: development and external validation of a nationwide health insurance dataset of the BPJS Kesehatan in Indonesia. EBioMedicine 2020;54: 102710. https://doi.org/10.1016/j.ebiom.2020.102710ArticlePubMedPMC
  • 14. Wardhana RG, Wang G, Sibuea F. Application of machine learning in predicting disease case levels in Indonesia. J Inf Syst Manag 2023;5(1):40-45. (Indonesian)https://doi.org/10.24076/joism.2023v5i1.1136Article
  • 15. Nuryunarsih D, Herawati L, Badi’ah A, Donsu JD, Okatiranti . Predicting changes in systolic and diastolic blood pressure of hypertensive patients in Indonesia using machine learning. Curr Hypertens Rep 2023;25(11):377-383. https://doi.org/10.1007/s11906-023-01261-5ArticlePubMedPMC
  • 16. Indriany FE, Siregar KN, Purwowiyoto BS, Siswanto BB, Sutedja I, Wijaya HR. Predicting the risk of severity and readmission in patients with heart failure in Indonesia: a machine learning approach. Healthc Inform Res 2024;30(3):253-265. https://doi.org/10.4258/hir.2024.30.3.253ArticlePubMedPMC
  • 17. Wilimitis D, Walsh CG. Practical considerations and applied examples of cross-validation for model development and evaluation in health care: tutorial. JMIR AI 2023;2: e49023. https://doi.org/10.2196/49023ArticlePubMedPMC
  • 18. Kuhn M. Building predictive models in R using the caret package. J Stat Softw 2008;28(5):1-26. https://doi.org/10.18637/jss.v028.i05ArticlePubMedPMC
  • 19. Li Q, Yao X, Échevin D. How good is machine learning in predicting all-cause 30-day hospital readmission? Evidence from administrative data. Value Health 2020;23(10):1307-1315. https://doi.org/10.1016/j.jval.2020.06.009ArticlePubMed
  • 20. Darst BF, Malecki KC, Engelman CD. Using recursive feature elimination in random forest to account for correlated variables in high dimensional data. BMC Genet 2018;19(Suppl 1):65. https://doi.org/10.1186/s12863-018-0633-8ArticlePubMedPMC
  • 21. Choi Y, An J, Ryu S, Kim J. Development and evaluation of machine learning-based high-cost prediction model using health check-up data by the National Health Insurance Service of Korea. Int J Environ Res Public Health 2022;19(20):13672. https://doi.org/10.3390/ijerph192013672ArticlePubMedPMC
  • 22. Huang JC, Tsai YC, Wu PY, Lien YH, Chien CY, Kuo CF, et al. Predictive modeling of blood pressure during hemodialysis: a comparison of linear model, random forest, support vector regression, XGBoost, LASSO regression and ensemble method. Comput Methods Programs Biomed 2020;195: 105536. https://doi.org/10.1016/j.cmpb.2020.105536ArticlePubMed
  • 23. Thongpeth W, Lim A, Kraonual S, Wongpairin A, Thongpeth T. Determinants of hospital costs for management of chronic-disease patients in southern Thailand. J Health Sci Med Res 2021;39(4):313-320. https://doi.org/10.31584/jhsmr.2021787Article
  • 24. Haddadi SJ, Farshidvard A, dos Santos Silva F, dos Reis JC, da Silva Reis M. Customer churn prediction in imbalanced datasets with resampling methods: a comparative study. Expert Syst Appl 2024;246: 123086. https://doi.org/10.1016/j.eswa.2023.123086Article
  • 25. Amin A, Anwar S, Adnan A, Nawaz M, Howard N, Qadir J, et al. Comparing oversampling techniques to handle the class imbalance problem: a customer churn prediction case study. IEEE Access 2016;4: 7940-7957. https://doi.org/10.1109/ACCESS.2016.2619719Article
  • 26. Chen J, Zhao M, Zhou R, Ou W, Yao P. How heavy is the medical expense burden among the older adults and what are the contributing factors? A literature review and problem-based analysis. Front Public Health 2023;11: 1165381. https://doi.org/10.3389/fpubh.2023.1165381ArticlePubMedPMC
  • 27. Madyaningrum E, Chuang YC, Chuang KY. Factors associated with the use of outpatient services among the elderly in Indonesia. BMC Health Serv Res 2018;18(1):707. https://doi.org/10.1186/s12913-018-3512-0ArticlePubMedPMC
  • 28. Lebina L, Kawonga M, Oni T, Kim HY, Alaba OA. The cost and cost implications of implementing the integrated chronic disease management model in South Africa. PLoS One 2020;15(6):e0235429. https://doi.org/10.1371/journal.pone.0235429Article
  • 29. Vetrano DL, Landi F, De Buyser SL, Carfì A, Zuccalà G, Petrovic M, et al. Predictors of length of hospital stay among older adults admitted to acute care wards: a multicentre observational study. Eur J Intern Med 2014;25(1):56-62. https://doi.org/10.1016/j.ejim.2013.08.709ArticlePubMed
  • 30. Lisk R, Uddin M, Parbhoo A, Yeong K, Fluck D, Sharma P, et al. Predictive model of length of stay in hospital among older patients. Aging Clin Exp Res 2019;31(7):993-999. https://doi.org/10.1007/s40520-018-1033-7ArticlePubMedPMC
  • 31. Fan Q, Wang J, Nicholas S, Maitland E. High-cost users: drivers of inpatient healthcare expenditure concentration in urban China. BMC Health Serv Res 2022;22(1):1348. https://doi.org/10.1186/s12913-022-08775-9ArticlePubMedPMC
  • 32. Ministry of Health, Republic of Indonesia. Regulation of the Minister of Health No 27 of 2014 on technical guidelines for the Indonesian Case-Based Groups (INA-CBGs) system. Jakarta: Ministry of Health, Republic of Indonesia; 2014. (Indonesian)
  • 33. Balducci F, Di Rosa M, Roller-Wirnsberger R, Wirnsberger G, Mattace-Raso F, Tap L, et al. Healthcare costs in relation to kidney function among older people: the SCOPE study. Eur Geriatr Med 2025;16(1):135-148. https://doi.org/10.1007/s41999-024-01086-8ArticlePubMedPMC
  • 34. Zhao X, Wang Y, Li J, Liu W, Yang Y, Qiao Y, et al. A machine-learning-derived online prediction model for depression risk in COPD patients: a retrospective cohort study from CHARLS. J Affect Disord 2025;377: 284-293. https://doi.org/10.1016/j.jad.2025.02.063ArticlePubMed
  • 35. Rosenberg MA, Farrell PM. Predictive modeling of costs for a chronic disease with acute high-cost episodes. North Am Actuar J 2008;12(1):1-19. https://doi.org/10.1080/10920277.2008.10597497Article
  • 36. Idris H, Afni N. Inpatient care utilization among elderly in Indonesia: a crosssectional study from Indonesia Family Life Survey. Indones J Public Health 2023;18: 242-252. https://doi.org/10.20473/Ijph.v18i2.2023.242-252Article
  • 37. Nguyen AT, Aris IM, Snyder BD, Harris MB, Kang JD, Murray M, et al. Musculoskeletal health: an ecological study assessing disease burden and research funding. Lancet Reg Health Am 2024;29: 100661. https://doi.org/10.1016/j.lana.2023.100661ArticlePubMedPMC
  • 38. Ministry of Health, Republic of Indonesia. Regulation of the Minister of Health No 3 of 2020 on hospital classification and licensing. Jakarta: Ministry of Health, Republic of Indonesia; 2020. (Indonesian)
  • 39. President of the Republic of Indonesia. Presidential Regulation No. 64 of 2020 on the second amendment to Presidential Regulation No 82 of 2018 concerning national health insurance. Jakarta: President of the Republic of Indonesia; 2020. (Indonesian)
  • 40. Ministry of Health, Republic of Indonesia. Regulation of the Minister of Health No 52 of 2016 on standard tariffs for health services under the national health insurance program. Jakarta: Ministry of Health, Republic of Indonesia; 2016. (Indonesian)

Figure & Data

References

    Citations

    Citations to this article as recorded by  

      • PubReader PubReader
      • Cite
        CITE
        export Copy
        Close
      • XML DownloadXML Download
      Figure
      • 0
      • 1
      • 2
      Related articles
      Development of Machine Learning Models to Predict Health Insurance Claim Costs Among Older Indonesians: A Retrospective Predictive Modeling Study
      Image Image Image
      Figure 1 Illustration of the sampling process.
      Figure 2 Sample selection process and machine learning workflow.
      Figure 3 (A) Variable importance across machine learning models (B) XGBoost, (C) random forest, and (D) linear regression.
      Development of Machine Learning Models to Predict Health Insurance Claim Costs Among Older Indonesians: A Retrospective Predictive Modeling Study
      Characteristics n (%) or mean±SD
      Age (y) 68.90±6.75
      Sex
       Male 53 165 (51.8)
       Female 49 563 (48.2)
      Marital status
       Unmarried 6780 (6.6)
       Divorced 17 977 (17.5)
       Married 74 247 (72.3)
       Undefined 3724 (3.6)
      Utilization by severity level
       Mild 0.83±1.35
       Moderate 0.33±0.82
       Severe 0.14±0.44
      Outpatient care 14.30±32.50
      Utilization by ICD-10 diagnosis
       ICD Chapter 1 0.16±0.55
       ICD Chapter 2 0.13±0.69
       ICD Chapter 3 0.04±0.38
       ICD Chapter 4 0.23±0.98
       ICD Chapter 5 0.02±0.33
       ICD Chapter 6 0.06±0.60
       ICD Chapter 7 0.50±1.60
       ICD Chapter 8 0.06±0.31
       ICD Chapter 9 0.63±1.70
       ICD Chapter 10 0.26±1.09
       ICD Chapter 11 0.28±0.88
       ICD Chapter 12 0.05±0.35
       ICD Chapter 13 0.26±1.95
       ICD Chapter 14 0.30±4.26
       ICD Chapter 15 0.00±0.03
       ICD Chapter 17 0.00±0.08
       ICD Chapter 18 0.19±0.78
       ICD Chapter 19 0.10±0.39
       ICD Chapter 21 12.30±30.50
      Utilization by type of service received
       Outpatient 14.30±32.50
       Inpatient 1.29±1.88
      Utilization by ward class
       Ward class I 6.11±25.00
       Ward class II 2.80±14.50
       Ward class III 6.68±20.20
      Length of stay (day) 5.32±9.03
      Utilization by participant segmentation
       Non-worker 25 054 (24.4)
       PBI APBD 9116 (8.9)
       PBI APBN 26 553 (25.8)
       PBPU 36 155 (35.2)
       PPU 5850 (5.7)
      Utilization by facility ownership
       Government 6.55±19.80
       Private 9.03±25.40
      Utilization by facility type
       Type A hospital 0.72±6.56
       Type B hospital 4.16±18.60
       Type C hospital 7.98±21.70
       Type D hospital 1.79±8.39
       Special hospital 0.87±5.26
       Other 0.07±2.19
      Claim cost (IDR) 12 740 059±25 021 824
      Models RMSE R2 MAE
      Linear regression 13 710 035 0.72 5 641 598
      Random forest
       ntree 50 12 434 238 0.78 4 696 412
       ntree 100 12 438 436 0.78 4 672 509
       ntree 200 12 437 736 0.78 4 667 147
      XGBoost
       nrounds 100 11 508 546 0.80 4 579 177
       nrounds 200 11 451 146 0.81 4 549 908
       nrounds 500 11 360 283 0.81 4 485 917
      Table 1 Participant demographics and healthcare utilization characteristics (n=102 728)

      SD, standard deviation; ICD-10, International Classification of Diseases, 10th revision; PBI APBD, government-subsidized premium (local); PBI APBN, government-subsidized premium (national); PBPU, informal self-employed; PPU, formal worker participant; IDR, Indonesian rupiah.

      Table 2 Model performance comparison using training data

      RMSE, root mean square error; MAE, mean absolute error.


      JPMPH : Journal of Preventive Medicine and Public Health
      TOP