INTRODUCTION
Diabetes mellitus is a metabolic disorder characterized by chronically elevated blood glucose levels due to insufficient insulin secretion or insulin resistance. According to the World Health Organization, inadequate long-term blood glucose control can lead to severe complications, affecting the nervous system, kidneys, and cardiovascular system [
1]. In South Korea, diabetes is considered a significant public health issue. While the incidence of diabetes among individuals aged 40 years and older has decreased, with a drop of approximately 0.1% annually from 2006 to 2015, national cohort studies show a concerning rise in diabetes prevalence among young adults aged 20-39 years. Specifically, the rate increased from 0.5 to 0.7 per 1,000 individuals in the 20-29 age group and from 2.0 to 2.6 per 1,000 in the 30-39 age group, highlighting the need for increased societal attention [
2].
The prevalence of prediabetes, a condition where blood sugar levels are higher than normal but not yet high enough to be classified as diabetes, has also been rapidly increasing among young adults in South Korea [
3]. Individuals with prediabetes are at high risk of the condition progressing to diabetes. Early intervention, including lifestyle modifications, during this stage, can effectively prevent the onset of diabetes. Therefore, accurately predicting prediabetes in individuals is crucial for public health strategies aimed at diabetes prevention [
4].
Diabetes is one of the chronic diseases caused by lifestyle factors, and the primary reason for the increased risk of diabetes in young adults can be attributed to lifestyle changes that lead to obesity [
5]. During the coronavirus disease 2019 pandemic, social distancing and the increase in remote work resulted in decreased physical activity, accompanied by mental health issues such as depression and anxiety. These factors contributed to unhealthy eating habits and the proliferation of irregular diets, which in turn increased the rate of obesity [
6,
7]. Therefore, assessing nutritional status can provide valuable information for predicting diabetes or prediabetes.
However, accurately assessing dietary intake is a challenging task. Therefore, methodologies such as the 24-hour dietary recall are useful for comprehensively evaluating nutrient intake [
8]. The 24-hour dietary recall requires participants to meticulously record all foods and beverages consumed in the past 24 hours, making it a reliable tool for precisely assessing habitual dietary patterns and nutrient intake [
9]. The method enables the collection and analysis of important data related to an individual's diet, including carbohydrates, sugars, fiber, and protein, to inform preventive interventions for diabetes [
10]. Therefore, dietary intake data obtained through 24-hour dietary recalls can provide valuable information for predicting individuals with prediabetes.
Previous studies have primarily focused on causal analyses and meta-analyses that explore the relationship between specific nutrients and diabetes [
11]. While these approaches are advantageous for evaluating the interactions between single variables, they may yield lower prediction accuracy in complex data environments. By contrast, prediction modeling using machine learning accounts for the interactions among multiple variables and employs algorithms capable of efficiently addressing non-linearity and high dimensionality. This enables machine learning models to maintain high predictive accuracy even with complex datasets, capturing intricate patterns that might be overlooked in causal analysis to support better decision-making [
12]. Since diabetes can be influenced by a combination of factors beyond nutrition, it is essential to use machine learning models that can integrate and analyze these various variables.
Therefore, integrating various dietary intake data with individual characteristics and applying machine learning algorithms can enable high-accuracy predictions for identifying individuals with prediabetes. This study aims to verify a model for the early prediction of prediabetes in young adults by utilizing dietary intake data, which is easier to collect than blood tests. This model will provide a basis for early interventions for improving the health management of young adults.
Purpose
This study aims to verify the effectiveness of machine learning models for predicting individuals with prediabetes using dietary intake data from young adults in Korea and to propose the most effective prediction model.
DISCUSSION
Based on the key results, the following discussion is provided.
In the total, training, and test datasets, the proportion of individuals with prediabetes or higher was consistently 14%-15%. Although this study excluded missing data from individuals who did not participate in the dietary intake survey, did not undergo blood tests, or did not respond, which somewhat limits the generalizability to the entire Korean population, the Korean Diabetes Association has also reported that the prevalence of prediabetes is increasing among younger adults, highlighting the need for management as the duration of diabetes may be prolonged [
3]. This indicates that various prediction methods are necessary for prevention and early intervention.
In this study, dietary intake survey data were applied to three machine learning models—logistic regression, kNN, and random forest—to predict individuals with prediabetes or higher. AUC is used to assess how well a model distinguishes between individuals with prediabetes and normal individuals, with values closer to 1 indicating better model performance. Precision represents the proportion of true positive results out of all positive predictions, while recall measures the proportion of actual positive cases that were correctly identified. These metrics help evaluate a model’s prediction accuracy and sensitivity [
16,
17]. Among the models, random forest achieved the highest performance, with a classification accuracy of 86%, indicating that the model can accurately distinguish between prediabetic and normal individuals. The precision of the random forest model was 85%, meaning it had a low rate of false positives, and the recall was 84%, indicating a well-balanced trade-off between precision and recall [
16,
17].
The logistic regression model showed a classification accuracy of .83, slightly lower than that of random forest, with a precision of .77 and a recall of .71. This suggests that logistic regression has a higher rate of false positives and may miss some individuals with prediabetes. And the kNN model demonstrated a classification accuracy of .84, with a precision of .77 and a recall of .84. While its recall is similar to that of random forest, the lower precision suggests that the kNN model may generate more false positives, potentially leading to unnecessary treatments or interventions [
16,
17]. These results indicate that random forest offers the best balance among the three models and is the most reliable tool for predicting prediabetes in young adults. By using this model, early intervention can be achieved, reducing unnecessary treatments and diagnostic errors.
In terms of AUC, which was employed to evaluate how well the model distinguishes between individuals with prediabetes and normal individuals [
18], the logistic regression model demonstrated the highest value. An AUC value close to 1 indicates excellent classification performance, whereas a value below 0.5 suggests no classification ability [
19]. As the logistic regression model showed an AUC value greater than 0.5, this suggests that this model provides more consistent performance compared with the other models. Although the logistic regression model showed somewhat lower performance in AUC and precision in the prediction validation, its overall accuracy was still acceptable. Synthesizing the results across models confirmed that the random forest model may outperform in managing the complexity of the data, while both the logistic regression and kNN models were found to maintain adequate predictive performance, suggesting their potential applicability in specific contexts.
Meanwhile, the random forest model has the advantage of handling complex interactions between various variables [
20], making it suitable for managing complicated variables such as dietary intake data. However, to precisely understand the model's prediction patterns, it is necessary to refer to the variable importance determined by the Gini index [
21]. Among such variables, age, followed by thiamine and water, contributed the most to predicting individuals with prediabetes or higher.
Due to the nature of machine learning analysis, it is difficult to determine the precise directionality of the variables. However, based on previous studies, it can be inferred that as individuals advance in age, insulin resistance increases and metabolic rate decreases, increasing the likelihood of progressing to prediabetes [
22]. In particular, adults aged 19-35 are in a phase of life whereby lifestyle changes often occur simultaneously [
23], possibly leading to increased insulin resistance and difficulty in blood sugar regulation. These factors likely explain why age emerged as an important variable in the machine learning model. Additionally, thiamine plays a crucial role in carbohydrate metabolism by acting as a coenzyme in the conversion of glucose to energy. It supports insulin secretion and helps reduce oxidative stress and inflammation, which are exacerbated by hyperglycemia and contribute to diabetic complications. Research has shown that thiamine enhances hexokinase activity, increases insulin secretion, and restores normal enzyme function in the liver and kidneys of diabetic models, thereby alleviating pathological abnormalities associated with diabetes [
24]. Therefore, a deficiency in thiamine may result in impaired insulin secretion, leading to problems in glucose metabolism, and potentially accelerating the progression to prediabetes. This likely explains why thiamine intake emerged as an important variable in this study. Similarly, adequate water intake is known to help in maintaining stable blood sugar levels by diluting glucose concentrations in the bloodstream [
25], which may account for its significance as a key variable in the machine learning model.
By considering the connection between physiological mechanisms and individual characteristics, as well as dietary intake patterns, blood sugar regulation can be approached from a physiological perspective based on the consumption of specific nutrients. Although the contribution scores of carbohydrates and sugar intake were low in this study, these nutrients rapidly increase blood glucose levels, as previous studies have shown. However, as various nutrients such as fats and dietary fibers interact in a complex manner to influence blood glucose regulation [
26], by utilizing the random forest machine learning model, which employs an ensemble learning method with multiple decision trees to learn dietary patterns, it is expected to effectively reflect multidimensional interactions and provide more accurate predictions of prediabetes. This study highlights the utility of machine learning models for predicting prediabetes in young adults using dietary intake data, which are relatively easy to collect. According to previous studies, while older adults may develop prediabetes due to age-related metabolic changes, younger adults are primarily influenced by lifestyle factors such as diet and physical activity [
22,
23]. The results of this study emphasize the importance of addressing these factors early in life to prevent the progression to diabetes. This finding has significant implications for public health policy development, as it enables early intervention for the prevention of chronic diseases among young adults, a group often overlooked in healthcare. Key variables identified in this study, such as age, thiamine, and water, play a critical role in predicting prediabetes and should be incorporated into future predictive models. By properly utilizing these variables, the predictive performance of the models can be further enhanced, allowing for more accurate predictions of prediabetes and thereby strengthening early intervention and prevention strategies.
Overall, this study contributes to the theoretical development of disease prevention and health management strategies by validating a model for predicting prediabetes. It also provides a foundation for future research by analyzing multidimensional variable interactions, offering new insights into diabetes prediction in young adults. Practically, this study offers foundational data that nurses can use as a tool for early intervention and personalized health education, helping to improve patient health outcomes effectively.
However, the relatively lower performance of the random forest model in terms of AUC and precision indicates that caution is needed when considering its clinical application. Future research should aim to enhance the model's reliability and conduct further validation. While the 24-hour dietary recall method provides detailed nutritional data, it has the limitation of relying on participants' memory, which may reduce accuracy [
27]. Therefore, future studies could improve the model by incorporating additional data, including psychological factors such as stress and depression, as well as physical activity and lifestyle factors that influence diabetes risk. Moreover, this study focused exclusively on young adults aged 19 to 35, limiting the generalizability of the findings to other age groups. As such, care must be taken when applying the results to the prediction of prediabetes across different age ranges.
CONCLUSION
In this study, we compared machine learning models to predict prediabetes in young adults in Korea using dietary intake data from the KNHANES. Among the models evaluated, the random forest model demonstrated the highest predictive accuracy and precision compared with logistic regression and kNN models. The key predictive variables were age, thiamine intake, and water intake, which contributed significantly to the model's performance. Based on these results, we confirmed the feasibility of predicting prediabetes in young adults and provided foundational data for early intervention in diabetes prevention. However, given the need for improvement in certain performance metrics, such as AUC, we recommend that future studies incorporate additional variables, such as physical activity, lifestyle factors, and psychological variables, to enhance the predictive power of the models.