比较泰国胆管癌发病率的机器学习的空间预测模型

开放访问
发布：
2025年6月7日BMC公共卫生

体积 25，文章编号： 2137（（2025）引用本文抽象的

背景

胆管癌（CCA）在泰国构成了重大的公共卫生挑战，其发病率较高。

这项研究旨在使用机器学习技术比较空间预测模型的性能，以分析整个泰国CCA的发生。

方法

这项回顾性队列研究分析了泰国四个基于人群的癌症登记处的CCA病例，该病例在2012年1月1日至2021年12月31日之间进行了诊断。该研究采用机器学习模型（线性回归，随机森林，神经网络和极端梯度增强（XGBOOST）），以预测基于Spatial spatal variables的年龄差异率（ASR）。使用均方根误差（RMSE）和R评估模型性能2^{使用70:30火车测试验证。}结果

该研究包括6,379例CCA病例，男性占主导地位（4,075例； 63.9％），平均年龄为66.2岁（标准偏差= 11.1岁）。

东北地区占大多数案件（3,898例； 61.1％）。CCA的总体ASR为每100,000人年8.9（95％CI：8.7至9.2），东北地区的发病率最高（ASR = 13.4每100,000人年； 95％CI：12.9至13.8）。在整个数据集中，随机森林模型在训练中表现出更好的预测性能（R2^{= 72.07％）和测试数据集（R2}= 71.66％）。^{观察到模型性能的区域变化，随机森林在北部，东北地区表现最佳，而Xgboost在中部和南部地区表现出色。}CCA最重要的空间预测因子是高程和距离水源的距离。

结论

随机森林模型表明，尽管预测性能在整个地区各不相同，但预测泰国CCA发病率的效率最高。空间因素有效地预测了CCA的ASR，为国家水平的疾病监测提供了宝贵的见解，并针对公共卫生干预措施。这些发现支持使用空间流行病学和机器学习技术开发针对CCA控制的区域特异性方法。

同行评审报告

背景

来自胆道上皮的恶性肿瘤胆管癌（CCA）代表了泰国的公共卫生挑战[1]。虽然全球相对较少的CCA在东南亚表现出异常较高的发病率，而泰国东北地区报告的全球率最高（每100,000人年85人）[2]。这种惊人的地理差异主要归因于Opisthorchis Viverrine（O。Viverrini）感染，尽管其他危险因素，例如肝脏石岩症，主要硬化性胆管炎，普拉齐素治疗的治疗O. Viverrini[3，，，，4，，，，5，，，，6]和乙型肝炎和C [7，，，，8，，，，9]还会导致疾病负担[8]。CCA在泰国的独特分布强调了需要进行复杂的空间分析，以更好地理解和解决这个关键的健康问题。

空间流行病学在阐明疾病模式及其根本原因中起着至关重要的作用[10]。在泰国CCA的背景下，空间预测模型为环境的复杂相互作用提供了宝贵的见解（尤其是与水源的接近性）[11，，，，12，，，，13]和影响疾病分布的生物因素。这些模型可以显着增强用于筛查和治疗的资源分配，实现有针对性的公共卫生干预措施，并加深我们对风险因素的理解[14]。

空间流行病学的传统统计方法虽然有价值，但经常努力捕获影响疾病分布的多种环境，人口和社会因素之间复杂的非线性关系以及相互作用。近年来，机器学习已成为分析复杂健康数据模式的强大替代方法[15]。与传统的统计方法不同，机器学习算法可以识别复杂的非线性关系，而无需预先指定的模型结构，这使得它们特别适合空间流行病学研究，在该空间流行病学研究中，变量之间的关系可能是复杂且多方面的。[16，，，，17]。

与传统方法相比，机器学习模型在空间流行病学上证明了几个优势。在分析健康数据中的复合物，非线性关系时，它们具有卓越的预测精度[18]。诸如随机森林和神经网络等算法可以整合各种数据源，包括卫星图像，人口普查数据和环境测量，创造了更全面的空间预测[19，，，，20]。这些技术在处理具有多个变量的大型数据集方面也很出色，并且可以识别传统统计方法可能遗漏的模式[21]。

我们的研究与2027年的泰国国家人工智能战略（NAIS）行动计划保持一致，其中包括五种旨在通过人工智能（AI）应用程序的国家发展的策略。策略4专注于使用AI来推进智能技术系统，以创建新颖的计算学习和推理方法。NAIS行动计划利用各个部门的这些智能系统，并支持国家人工智能作为服务（AIAAS）平台的研究[22]。先前对泰国CCA的研究仅限于仅涵盖仅涵盖选定省或地区的短期回顾性分析，而无需全国评估。

此外，大多数现有研究都依赖于传统的统计预测方法，而不是高级机器学习技术。因此，本研究旨在使用机器学习方法比较空间预测模型的性能，以分析整个泰国的CCA发生。通过进行全面的空间分析，集中于人口，环境和气候变量，我们可以识别高风险区域和潜在的因素。该映射计划可以为当地的公共卫生策略提供信息，并为CCA管理和预防提供宝贵的建议，同时为公共卫生中的空间流行病学和机器学习应用提供了更广泛的领域。

材料和方法

数据收集和研究领域

回顾性队列分析研究检查了泰国四个地区（北部，中部，东北，南部）的554个街道。我们从两个主要来源收集了数据：

CCA案例数据

来自四个基于人群的癌症登记局（PBCR）的信息：北部（倾斜癌症医院），中央（LOP Buri Cancer Hospital），东北部（Khon Kaen Propincial Cancer Registry）和南部地区（Surat Thani Cancer Hospital）[23]。All the CCA cases were diagnosed between January 1, 2012, and December 31, 2021, based on the International Classification of Diseases for Oncology, 3rd Edition (ICD-O-3), with the specific codes: C22.1 (Intrahepatic bile duct), C24.0 (Extrahepatic bile duct), C24.8 (Overlapping lesion of biliary), and C24.9 (Biliary tract, NOS)（不包括c24.1，vater的ampulla）[24，，，，25]。关键变量包括性别，诊断年龄，出生日期，ICD-O-3代码，地址和诊断基础。来自国家经济和社会发展委员会办公室的人口数据[26]用于在20121年至2021年之间每五年通过性别和年龄组计算年龄标准化率（ASR） 1）。表1研究中使用的因变量和自变量的描述

空间变量

首先，环境数据（海拔，水源坐标以及区域的规模和范围）来自中央地球信息系统和服务项目，自然资源与环境部水资源部[

27]。第二，使用统计数据请求系统[28]。所有空间变量均在分区级别汇总（表 1）。

研究领域

该研究涵盖了四个省份，代表泰国各个大小和地理坐标（纬度和纵向）的四个主要地区：（ i）倾斜省（北部）（北部）：12,533.96 km2^{，17.2°19.5°N，98.9°100.2°E;}（ii）Lop Buri Province（中央）：6,208.70公里2^{，14.6°15.8°N，100.3°101.5°e;}（iii）Khon Kaen省（东北）：10,885.99公里²，15.6°17.1°n，101.6°103.3°e;（iv）苏拉特·塔尼省（南部）：12,891.4公里2^{，8.3°10.2°N，98.5°100.2°E，对于代表各个地区的每个省份，相应[}23]。变量和测量

ASR，年龄标准化的率；

CCA，胆管癌；国际癌症登记协会IACR。

统计分析

CCA的发病率

ASR是针对每种性别计算的，并使用SEGI世界标准人群估计值进行标准化[29]。国际癌症登记协会（IACR）指南[30]用于计算每个分区中CCA病例的ASR。

机器学习模型

我们实施了四个不同的机器学习模型，以根据空间变量预测CCA的发生率。在我们的数据管理过程中，住宅地址代码被用作将CCA案例数据与所有空间因素联系起来的关键标识符。在分析之前，对所有变量进行了分布测试。如果数据表现出异常的分布模式（左或右偏度），则对所有受影响的变量进行了通过对数转换的变化转换，然后再继续进入机器学习模型。每个模型都代表了一种不同的预测建模方法，以提供适用于空间流行病学的技术的全面比较：

线性回归

一个统计模型，研究因变量（CCA的ASR）与多个自变量（空间因子）之间的线性关系。我们选择了该模型作为基线比较，因为它代表了传统的统计方法，并假定变量之间的线性关系。

随机森林

一种合奏学习方法，该方法在训练过程中构建了多个决策树并输出单个树的平均预测。随机森林非常适合空间流行病学，因为它们可以捕获非线性关系，处理变量之间的相互作用，并且可抵抗过度拟合。该算法通过引导观测和变量的样本来构建各种决策树，每棵树都对最终预测进行投票[31]。随机森林模型配置了以下规格：树数= 500；在每个分裂（mtry）= 2上随机采样的变量;最小节点尺寸= 5;Gini标准被用作分裂标准。

神经网络

受人脑的神经结构启发的计算模型，旨在通过互连的节点（神经元）识别复杂的模式。神经网络通过三个主要组成部分处理信息：输入层（接收空间变量），隐藏层（通过加权连接处理信息）和输出层（产生CCA发病率预测）。这种结构使神经网络能够建模空间因素与疾病发生率之间高度复杂的非线性关系[32，，，，33]。神经网络利用了一个5 –15â€10的架构，用于隐藏层的relu激活和输出层的线性激活。培训使用Adam Optimizer，L2正则化（重量衰减= 0.0001），批次大小为32和200个时期，并提早停止以优化性能。

极端梯度提升（XGBoost）

梯度提升的高级实现，该实现依次构建模型，每个新模型纠正了以前的模型。XGBoost具有三个关键组成部分：（i）评估模型准确性的损失函数，（ii）弱学习者（通常是决策树），其性能比随机猜测要好得多，以及（iii）将弱学习者结合到强大的预测系统中的加性模型。XGBOOST包括防止过度拟合的正规化技术，这对于使用有限数据的空间预测具有可能有价值[34]。XGBoost模型的学习率为0.05，最大树深度为6，最小儿童重量为3。子样本和列样本比都设置为0.8。出于正规化的目的，分别在0.2和0.1建立了alpha和lambda参数。使用1000次提升回合训练该模型，并具有早期停止机制，以防止过度拟合和优化性能。

模型培训和验证

对于模型开发和评估，我们将数据集随机分为培训（70％）和测试（30％）子集。选择该比率以平衡对足够的培训数据的需求，同时确保我们的样本量限制，以确保足够的测试数据进行可靠的绩效评估。70:30拆分广泛用于机器学习应用程序，并在这些竞争需求之间提供了良好的折衷。

尽管我们通过评估与所有实际分析的模型来考虑替代拆分比（80:20，90:10），但我们的初步分析表明，70:30 Split提供了模型学习和数据集尺寸验证之间的最佳平衡。在我们的研究中，大约有554个分区，这一分裂提供了388个分区（4,465例）培训，166例测试（1,914例）（1,914例） - 适用于强大的模型培训和有意义的验证的足够数量，而无需过度适应。

桌子1说明了我们从数据收集到模型评估的完整研究方法。该过程始于从四个区域癌症登记处收集CCA病例数据以及政府数据库的空间数据。在预处理（包括计算ASR值和标准化空间变量）之后，我们实施了按区域分层分层的70:30随机拆分，以维持比例表示。使用相同的培训数据和超参数优化技术对每个模型进行训练，然后使用RMSE在常见测试集上进行评估²，以及通过散点图进行视觉评估。

模型评估

我们使用三种互补方法实施了一个全面的评估框架，以确保对模型绩效的强大评估：

根平方错误（RMSE）

RMSE量化了与因变量相同的单位的预测错误，在重大错误可能会带来严重后果的健康应用中，对大错误的重量更大。该度量计算了预测和实际CCA发病率值之间平均平方差的平方根：

$$ rmse = \ sqrt \ lbrack \ sum（predictal-actual）2/n \ rbrack $$

较低的RMSE值表示更好的模型性能，预测误差较少。我们选择了rmse而不是诸如平均绝对错误（MAE）之类的替代指标，因为RMSE通过平方机制给予了更大的重量，这对于大型预测错误可能会对资源分配和干预计划产生重大影响的健康应用特别有价值。对离群值的这种敏感性有助于识别可能平均表现良好但在某些区域或发病率范围内出现错误的模型。

R平方（r²）

该确定系数测量了因模型自变量（空间因子）解释的因变量（CCA的ASR）中方差的比例。通过提供从0到1的直接量表，它使我们能够衡量CCA发病率中解释的方差的比例，并促进与以前的研究发现有意义的比较：r

2^{= 1-（平方残差总和/总和总和）}r

2^{值范围为0到1，值更接近1，表明该模型解释了CCA发病率方差的较大比例，这表明预测性能更好。}对于每个模型，我们计算了R的95％置信区间²使用Bootstrap用1000次迭代进行重新采样的值，以量化我们的性能估计中的不确定性，并允许模型之间进行更严格的统计比较。

散点图

我们创建了散点图，以可视化每个模型的预测和实际CCA发病率值之间的关系。这些视觉表示有多种分析目的：

识别不同发病率水平的预测准确性的模式。
揭示了潜在的系统偏见（例如，高档区域中的过度预测一致）。
检测预测错误中的异质性。
识别可能需要特别注意的区域集群或离群值。

我们用45度参考线增强了这些散点图，代表完美的预测，回归线显示了实际趋势，并按区域进行了颜色编码，以使模型性能的更深入的视觉分析。

在对这些模型进行了全面比较之后，我们使用表现最佳模型（随机森林）进行了可变的重要性分析，以识别CCA发病率的关键空间预测指标。当每个变量从模型中排除，同时使所有其他变量保持恒定时，该分析量化了预测准确性的平均降低。涉及的方法：

1。
在完整数据集上培训最佳随机森林模型
2。
一次将每个预测变量变量置换（在维护相同的数据结构的同时有效地删除其信息）
3。
测量预测准确性的降低
4。
通过对模型性能的影响来排名变量

这种基于置换的方法比替代变量重要性方法具有优势，因为它直接衡量对模型的预测性能的影响，而不是节点纯度的变化，从而提供了与我们的预测目标直接相关的更容易解释的结果。

对于模型实施，我们在R中使用了随机森林，Nural网络，XGBoost和Stats软件包。我们使用R软件版本4.2.1（R Core Team）[35]使用RSTUDIO软件版本1.4.1 [36]。空间数据处理利用了SF和栅格软件包，而可视化则使用具有自定义主题的GGPLOT2，以实现最佳清晰度。统计验证（包括置信区间计算）是使用重新采样实施的。

结果

人口和空间特征

在所有6,379例CCA案件中，大多数是男性（4,075例； 63.9％），平均年龄为66.2岁（标准偏差= 11.07年），东北地区的大部分案件（3,898例; 61.1％）占大多数案件（61.1％），随后是北部（1,695案例），南部（1,695案例），中央案例； 26.6％； 26.6％（624％）；（162例; 2.5％）（表格 2）。表2 2012年至2021年之间泰国CCA的人口统计学特征

CCA发病率和性别

CCA的总体ASR为每100,000人年8.9（95％CI：8.7至9.2），男性的发生率要高得多（每10万人年满12.5人，95％CI：12.1至12.9）（12.1至12.9）（比女性为5.9，每1000,000人年5.9，95％CI：5.6至6.1）。

东北地区的两性发病率最高（ASR =每10万人年满13.4个，95％CI：12.9至13.8），其次是北部（ASR = ASR = 11.2 / 100,000人年，95％CI：95％CI：10.6至11.7），Central（ASR = ASR = 4.8 4.8 Per Sounternions and Sountern and Sountern Ci and Sensers Ci：4.5％CI：4.2.2.2.2.2.2.2.2.2.2.2.2.2.5 to 4.2.2.2.2.2.2.2.2.2.2 as as as as as as ass ass 95％。=每100,000人年1.1，95％CI：0.9至1.3）（表格 3）。表3在2012年至2021年间，泰国每个地区的性别中CCA的发生率桌子

和图。1（A-E）显示了不同区域的四个机器学习模型的比较性能。对于整个数据集，随机森林模型表现出卓越的性能，最高R2^{培训（72.07％）和测试数据集（71.66％）和最低的RMSE值（训练= 8.991，测试= 9.022）的值。}XGBoost模型总体上显示了第二好的性能（训练R2^{= 70.57％，RMSE = 9.719＆Testing R r2}= 68.30％，RMSE = 0.904），其次是神经网络（训练R²= 57.25％，RMSE = 11.044＆Testing R r²= 56.81％，RMSE = 11.076）和线性回归（训练R2^{= 9.88％，rmse = 16.034＆测试r2}= 8.52％，RMSE = 16.078）。^{表4泰国预测CCA的机器学习模型}模型性能在各个地区都有很大的不同。^{在北部地区，所有模型均达到更高的R2}值比其他地区的值，随机森林显示出最佳性能（测试R2

= 87.30％）。

中央区域在模型之间显示出中等的性能，随机森林再次表现最好（测试R²= 77.17％）。^{在东北地区，随机森林保持最高的性能（测试R2}= 76.81％）。^{南部地区的模型性能显示出更大的可变性，XGBoost达到了最高的测试R2}（63.04％）（表 ⁴）。^{南部地区显示出独特的模式，Xgboost达到了最高的测试R2}（63.04％），明显优于随机森林（41.08％）。该区域还显示了所有模型的训练和测试性能之间最大的差距，这表明该地区的潜在过度适合样本量的挑战（表格 4，如图。

1^e）。对散点图的分析证实了这些定量发现，表明对随机森林和XGBoost模型的对角线周围的聚类更紧密，尤其是在北部和东北地区。线性回归始终显示与对角线的一致比对，尤其是在较高的ASR值下，强调了其无法捕获表征CCA空间流行病学的非线性关系。可变重要性分析随机森林模型的可变重要性分析确定为最重要的预测指标，从模型中删除时，准确性的平均降低为32.4％。与水源的距离在重要性中排名第二，其次是人口密度和平均温度。

平均降雨对预测准确性的影响最小（图

2

）。图1通过机器学习模型比较泰国的预测与观察到的CCA速率的散点图。

讨论

该研究提出了比较泰国CCA发病率的各种机器学习方法的预测性能。我们的发现揭示了不同区域的模型性能中的复杂模式，并确定了CCA分布的关键环境决定因素。

随机森林模型表现出较高的总体预测能力（测试R²= 71.66％），在大多数地区始终超过其他方法。这种出色的表现可以归因于几个优点，这些优点使随机森林特别适合空间流行病学数据：捕获非线性关系，与离群值的鲁棒性，处理无明确规范之间的相互作用的能力以及模型复杂性和普遍性之间的有效平衡的能力。在样本量较大的区域中，随机森林在训练和测试性能之间保持最小的差异，表明了出色的普遍性。这些发现与Tsilimigras等人先前的研究一致。[16]，他证明随机森林模型在预测CCA表型和患者预后时达到了85％的精度。同样，刘等人。[37]发现在空间癌症风险预测研究中，随机森林在空间癌的风险预测中取得了卓越的表现（AUC = 0.86）。我们的结果与Thongpeth等人的结果平行。[38]，他们比较了泰国医疗保健预测的各种建模方法，并发现随机森林始终优于其他机器学习方法。

XGBoost总体表现出色（测试R²= 68.30％），在南部地区表现出色（测试R²= 63.04％）。但是，它在训练和测试性能之间表现出更大的不一致，尤其是在样本量较小的区域。这种模式表明，应用于较小的数据集时，潜在的过度适合问题的提升方法的局限性。这一发现与Wu等人报告的结果不同。[39]，发现XGBoost在预测CCA结果方面达到了最高精度（AUC = 0.892），Chaudhary等人。[40]，他报告XGBoost的表现优于89.2％精度的传统方法。这种差异可能反映了临床预测环境（具有个体级变量）和我们的空间分析（使用分区级别的汇总环境因素）之间的差异。

神经网络在各个地区表现出适度但一致的性能（总体测试R²= 56.81％），训练和测试指标之间的差距最小。这一发现与张等人形成鲜明对比。[41]，发现神经网络在癌症预测方面取得了出色的表现。但是，我们的结果与Wang等人的发现保持一致[42]基于树的模型在空间疾病预测中的表现通常优于神经网络，尤其是在具有复杂生态相互作用的环境中。

机器学习模型与线性回归之间的明显性能差距（测试R²â€‰= 8.52%) confirms that CCA's spatial distribution follows complex, non-linear patterns that cannot be adequately captured by traditional statistical approaches.This finding has important methodological implications for future spatial epidemiology studies, suggesting that machine learning approaches should be preferred for similar complex spatial health phenomena.The dramatic performance differential stems from fundamental advances that machine learning brings to spatial epidemiology: superior ability to capture geographic variations in disease-environment relationships, better handling of spatial dependencies that violate independence assumptions in traditional models, and effective integration of multi-scale spatial interactions without requiring explicit hierarchical modeling.

Our analysis revealed substantial regional variations in both CCA incidence and model performance.All models performed best in the northern region (testing R²up to 87.30%), followed by northeastern (76.81%) and central regions (77.52%), with more modest performance in the southern region (63.04%).These patterns align with findings from Kaewpitoon et al.[13], who observed varying prediction accuracies across different Thai regions using GIS-based analysis.The exceptional performance in the northern region suggests that environmental factors strongly and consistently influence CCA risk in this area.By contrast, the more moderate performance in the southern region, despite using identical predictors, indicates that different etiological factors may be at play or that environmental relationships are more complex in this region.This regional heterogeneity in model performance highlights the importance of region-specific approaches to both disease modeling and public health intervention.

Our variable importance analysis identified elevation as the most significant predictor of CCA incidence, followed by population density, distance from water sources, and average rainfall.Elevation likely serves as a proxy for multiple ecological factors: it influences water flow patterns and drainage characteristics critical forO. viverrini's lifecycle, affects agricultural practices (particularly rice cultivation associated with increased human-water contact), and historically shaped settlement patterns in ways that overlap with endemic zones.The importance of water-related variables aligns with the established understanding ofO. viverrini's lifecycle, which requires freshwater environments for transmission through intermediate snail hosts and fish.These environmental determinants help explain the pronounced regional disparities in CCA incidence observed in our study.The northeastern region's high rates (13.4 per 100,000 person-years) [43] correspond with previous studies documenting highO. viverriniprevalence in this area, while the southern region's low rates (1.1 per 100,000 person-years) reflect minimalO. viverrini流行性。The topographical and hydrological conditions of northeastern Thailandâ€”characterized by low-elevation plateaus with numerous water bodiesâ€”create ideal conditions for parasite transmission, which our models effectively captured through environmental predictors [2].The implementation of Machine Learning for spatial epidemiology of cancer aligns with similar initiatives in other countries.Qiao等。[44] successfully implemented Machine Learning models for cancer prediction in China, achieving accuracy rates of 87.5% using ensemble methods similar to our Random Forest approach.Similarly, Kim et al.[45] demonstrated the effectiveness of spatial Machine Learning in predicting cancer patterns in South Korea, with Random Forest models showing high accuracy (AUC 0.89).The comparable performance of our models (particularly in the northern region with R²â€‰= 87.30%) suggests that our methodological approach represents current international best practices in spatial health modeling.However, our southern region results (maximum R²â€‰= 63.04%) highlight an important limitation: machine learning approaches remain sensitive to sample size constraints.The southern region's substantially lower CCA incidence resulted in fewer cases for model training, potentially limiting predictive accuracy.

The ability to predict CCA incidence with high accuracy has significant implications for public health planning in Thailand.Our findings can transform CCA control efforts in several crucial ways.Rather than implementing uniform screening programs, health authorities can use our predictive models to identify high-risk communities based on environmental factors.This approach would enable more efficient allocation of screening resources to areas with the highest predicted CCA risk, potentially improving early detection rates in a cost-effective manner.Understanding the relationship between environmental factors and CCA risk can guide targeted interventions addressing specific risk factors.For example, communities in low-elevation areas near water sources might benefit from enhanced water treatment initiatives, while education programs about proper fish cooking practices could be prioritized in areas with high predicted risk.The varying performance of models across regions suggests that a"one-size-fits-all"approach may not be optimal.In the northeastern and northern regions, where environmental factors strongly predict CCA risk, targeted interventions based on spatial risk factors are likely to be effective.The southern region, with its distinct epidemiological profile, may require different approaches.

Given the importance of elevation and water-related variables, climate change could potentially alter CCA risk patterns through changes in precipitation, temperature, and water body characteristics.Rising temperatures and changing precipitation patterns could shift the geographic distribution of suitable habitats forO. viverrini's intermediate hosts.Our predictive framework provides a baseline for modeling future scenarios under different climate projections.The translation of our findings into practical public health applications aligns with Thailand's National Artificial Intelligence Strategy (NAIS) Action Plan for 2022â€“2027 [22]。Our machine learning approach to disease prediction represents a concrete implementation of Strategy 4, which focuses on developing intelligent technologies to address national challenges.By demonstrating the superior performance of advanced analytical methods over traditional approaches, we provide evidence-based support for broader adoption of AI-driven approaches in public health planning across Thailand.

Our study features several methodological strengths that enhance the reliability and applicability of our findings.The comprehensive inclusion of 6,379 CCA cases from four population-based cancer registries provides a robust epidemiological foundation rarely achieved in spatial modeling studies.Our comparative evaluation of multiple machine learning approaches offers methodological insights beyond single-model studies, while the variable importance analysis provides a quantitative hierarchy of environmental determinants that advances epidemiological understanding beyond traditional association studies.Nevertheless, several limitations warrant acknowledgment.Despite our large overall sample, the regional distribution was uneven, with relatively few cases in the southern region (162 cases; 2.5%).This imbalance likely contributed to the lower model performance in that region and highlights a common challenge in modeling rare diseases across heterogeneous geographies.Our analysis relied on spatial variables available at the sub-district level, potentially missing finer-scale variations that could influence local CCA risk patterns.While our models effectively captured spatial patterns, they did not incorporate temporal dynamics of CCA developmentâ€”a significant consideration given the often decades-long lag betweenO. viverriniexposure and cancer development.Future research should address these limitations through several approaches.Incorporating village-level socioeconomic indicators, local food consumption patterns, and sanitation infrastructure data could enhance prediction accuracy by capturing behavioral determinants ofO. viverrini接触。Studies by Songserm et al.[6] suggest these factors might explain 15â€“20% of the variance currently not captured by environmental variables alone.Developing models that incorporate both spatial patterns and temporal trends could provide insights into how CCA incidence evolves over time in response to environmental changes and public health interventions.Using more detailed environmental data, including high-resolution remote sensing of surface water characteristics and land use patterns, could improve prediction accuracy by better capturing habitat suitability for intermediate hosts.Building on our identified environmental predictors, future research should model how climate change might alter CCA risk distribution through changes in temperature, precipitation, and hydrological patterns.The methodological advances demonstrated in our studyâ€”particularly the superior performance of machine learning approaches compared to traditional statistical methodsâ€”should inform future spatial epidemiology research for other environmentally-mediated diseases in Thailand and beyond.

结论

The incidence of CCA in Thailand presented in this study, found that most of the CCA cases occur in the Northeastern, Northern, Central, Southern region, respectively.In analyzing predictive models for CCA incidence in Thailand using R²and RMSE, the Random Forest model has emerged as the most effective approach with 71.66% prediction, followed by the XGBoost model (68.30% prediction), and the Neural Network model (56.81% prediction), respectively.In each region different Machine Learning models with regional variations highlighted the complexity of cholangiocarcinoma epidemiology across different parts of Thailand.Furthermore, spatial factors demonstrated the predictive capabilities for ASR of CCA.This national finding has pioneered the CCA distribution in Thailand and has developed a spatial-based approach to support disease control.The research presented in this paper has pointed to opportunities for examining additional geographical variables in future studies.

数据可用性

The population-based cancer registry data analyzed in this study, which includes confidential personal information from four Thai regions, cannot be shared publicly due to privacy regulations.The data are available from the corresponding author upon reasonable request from interested researchers.

参考

Banales JM, Marin JJG, Lamarca A, Rodrigues PM, Khan SA, Roberts LR, et al.Cholangiocarcinoma 2020: the next horizon in mechanisms and management.Nat Rev Gastroenterol Hepatol.2020;17(9):557â€“88.https://doi.org/10.1038/s41575-020-0310-z。文章
一个 PubMed一个 PubMed Central一个 Google Scholar一个 Sriamporn S, Pisani P, Pipitgool V, Suwanrungruang K, Kamsa-ard S, Parkin DM.Prevalence of Opisthorchis viverrini infection and incidence of cholangiocarcinoma in Khon Kaen.
Northeast Thailand Trop Med Int Health.2004;9(5):588â€“94.https://doi.org/10.1111/j.1365-3156.2004.01234.x。文章
一个 PubMed一个 CAS一个 Google Scholar一个 Kamsa-Ard S, Luvira V, Pugkhem A, Luvira V, Thinkhamrop B, Suwanrungruang K, et al.Association between praziquantel treatment and cholangiocarcinoma: a hospital-based matched case-control study.
BMC癌。2015;15:776.https://doi.org/10.1186/s12885-015-1788-6。文章
一个 PubMed一个 PubMed Central一个 CAS一个 Google Scholar一个 Shin H-R, Oh J-K, Masuyer E, Curado M-P, Bouvard V, Fang Y-Y, et al.Epidemiology of cholangiocarcinoma: an update focusing on risk factors.
Cancer Sci.2010;101:579â€“85.https://doi.org/10.1111/j.1349-7006.2009.01458.x。文章
一个 PubMed一个 CAS一个 Google Scholar一个 Honjo S, Srivatanakul P, Sriplung H, Kikukawa H, Hanai S, Uchida K, et al.Genetic and environmental determinants of risk for cholangiocarcinoma via Opisthorchis viverrini in a densely infested area in Nakhon Phanom, northeast Thailand.
Int J癌。2005;117(5):854â€“60.https://doi.org/10.1002/ijc.21146。文章
一个 PubMed一个 CAS一个 Google Scholar一个 Songserm N, Promthet S, Sithithaworn P, Pientong C, Ekalaksananan T, Chopjitt P, et al.Risk factors for cholangiocarcinoma in high-risk area of Thailand: role of lifestyle, diet and methylenetetrahydrofolate reductase polymorphisms.
Cancer Epidemiol.2012;36(2):e89-94.https://doi.org/10.1016/j.canep.2011.11.007。文章
一个 PubMed一个 CAS一个 Google Scholar一个 Sripa B, Pairojkul C. Cholangiocarcinoma: lessons from Thailand.Curr Opin Gastroenterol.
2008;24:349â€“56.https://doi.org/10.1097/MOG.0b013e3282fbf9b3。文章
一个 PubMed一个 PubMed Central一个 Google Scholar一个 Sripa B, Kaewkes S, Sithithaworn P, Mairiang E, Laha T, Smout M, et al.Liver Fluke Induces Cholangiocarcinoma.
PLoS Med.2007;4(7): e201.https://doi.org/10.1371/journal.pmed.0040201。文章
一个 PubMed一个 PubMed Central一个 Google Scholar一个 Promthet S, Kamsa-Ard S, Vatanasapt P, Wiangnon S, Suwanrungruang K, Poomphakwaen K. Risk factors for cancers: a cohort study in Khon Kaen, Northeast Thailand.Office of the Health Promotion Foundation under the health information system development plan.
2010。
Kirby RS, Delmelle E, Eberth JM.Advances in spatial epidemiology and geographic information systems.Ann Epidemiol.2017;27(1):1â€“9.https://doi.org/10.1016/j.annepidem.2016.12.001。文章
一个 PubMed一个 Google Scholar一个 Mungthanee T. The Thai-Laos Culture and The Solution of Liver Fluke and Cholangiocarcinoma: A Case Study in The Middle Songkhram River Basin.AJHSS BUU.
2019;27(55):60â€“82.
Google Scholar一个
Tamngam P, Pamulila S, Sarakum N, Inpang S. Knowledge, Attitude, and Consumption Behavior Associated with Cholangiocarcinoma in a Sub-District, Warinchamrab District, Ubon Ratchathani Province.J Sci Tech UBU.2019;21(3):74â€“85.
Google Scholar一个
Kaewpitoon SJ, Rujirakul R, Joosiri A, Jantakate S, Sangkudloa A, Kaewthani S, et al.GIS Database and Google Map of the Population at Risk of Cholangiocarcinoma in Mueang Yang District, Nakhon Ratchasima Province of Thailand.亚洲PAC J癌症上一条。2016;17(3):1293â€“7.https://doi.org/10.7314/apjcp.2016.17.3.1293。文章
一个 PubMed一个 Google Scholar一个 Blangiardo M, Cameletti M, Baio G, Rue H. Spatial and spatio-temporal models with R-INLA.Spat Spatiotemporal Epidemiol.
2013;7:39â€“55.https://doi.org/10.1016/j.sste.2013.07.003。文章
一个 PubMed一个 Google Scholar一个 Haghbin H, Aziz M. Artificial intelligence and cholangiocarcinoma: Updates and prospects.World J Clin Oncol.
2022;13(2):125â€“34.https://doi.org/10.5306/wjco.v13.i2.125。文章
一个 PubMed一个 PubMed Central一个 Google Scholar一个 Tsilimigras DI, Hyer JM, Paredes AZ, Diaz A, Moris D, Guglielmi A, et al.A novel classification of intrahepatic cholangiocarcinoma phenotypes using machine learning techniques: an international multi-institutional analysis.
Ann Surg Oncol.2020;27(13):5224â€“32.https://doi.org/10.1245/s10434-020-08646-9。文章
一个 PubMed一个 Google Scholar一个 Tsilimigras DI, Mehta R, Pawlik TM.ASO author reflections: use of machine learning to identify patients with intrahepatic cholangiocarcinoma who could benefit more from neoadjuvant therapies.
Ann Surg Oncol.2020;27(4):1120â€“1.https://doi.org/10.1245/s10434-020-08263-6。文章
一个 PubMed一个 Google Scholar一个 Wiens J, Shenoy ES.Machine learning for healthcare: on the verge of a major shift in healthcare epidemiology.
Clin Infect Dis.2018;66(1):149â€“53.https://doi.org/10.1093/cid/cix731。文章
一个 PubMed一个 Google Scholar一个 Grebovic M, Filipovic L, Katnic I, Vukotic M, Popovic T. Machine learning models for statistical analysis.Int Arab J Inf Technol.
2023;20(3A):505â€“14.https://doi.org/10.34028/iajit/20/3A/17。文章
一个 Google Scholar一个 Dhillon SK, Ganggayah MD, Sinnadurai S, Lio P, Taib NA.Theory and practice of integrating machine learning and conventional statistics in medical data analysis.
Diagnostics.2022;12(10): 2526.https://doi.org/10.3390/diagnostics12102526。文章
一个 PubMed一个 PubMed Central一个 Google Scholar一个 Saha S, Moorthi S, Wu X, Wang J, Nadiga S, Tripp P, et al.The NCEP climate forecast system version 2. J Clim.
2014;27(6):2185â€“208.https://doi.org/10.1175/JCLI-D-12-00823.1。文章
一个 Google Scholar一个 Ministry of Higher Education, Science, Research and Innovation and Ministry of Digital Economy and Society.the National Artificial Intelligence Action Plan with a vision for the development of Thailand 2022 â€“ 2027 [Online] 2022. Available from:
https://ai.in.th/wp-content/uploads/2022/12/20220726-AI.pdf。[inThai].Cited 2024 Sep 12.
Rojanamatin J, Ukranum W, Supaattagorn P, Chiawiriyabunya I, Wongsena M, Chaiwerawattana A, et al.Cancaer in Thailand Volume.X, 2016â€“2018.Bankok: Medical Record and Databased Cancer Unit;2021. [in Thai]
WHO。WHO |International Classification of Diseases for Oncology, 3rd Edition (ICD-O-3).2013. Available from:http://www.who.int/classifications/icd/adaptations/oncology/en/。Cited 2023 Aug 15.
世界卫生组织。International classification of diseases for oncology (ICD-O), 3rd ed., 1st revision.2013. Available from:https://www.who.int/standards/classifications/other-classifications/international-classification-of-diseases-for-oncology。Cited 2023 Aug 21.
Office of the National Economic and Social Development Board.Population Projections for Thailand 2010â€“2040.Bangkok, Thailand: Office of the National Economic and Social Development Board;2013。
Department of Water Resources Ministry of Natural Resources and Environment.Central Geo-Informatics System and Services Project.2022。可从以下方式获得：https://webgis.dwr.go.th/。Cited 2023 Sep 9.
Information Technology Center Meteorological Department.Meteorological statistics data request submission system.2023。可从以下方式获得：https://data-service.tmd.go.th/index.php。Cited 2023 Sep 13.
Bray F, Colombet M, Aitken JF, Bardot A, Eser S, Galceran J, et al.Cancer Incidence in Five Continents, Vol.XII (IARC CancerBase No. 19).Lyon: International Agency for Research on Cancer.2023。可从以下方式获得：https://ci5.iarc.who.int。Cited 2024 Feb 28.
Boyle P, Parkin DM.Cancer registration: principles and methods.Statistical methods for registries.IARC Sci Publ.1991;95:126â€“58.
Google Scholar一个
Breiman L. Random forests.Mach Learn.2001;45:5â€“32.https://doi.org/10.1023/A:1010933404324。文章
一个 Google Scholar一个 Krose B, Smagt PVD.An introduction to neural networks: The University of Amsterdam;
1996。
McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity.Bull Math Biophys.1943;5:115â€“33.https://doi.org/10.1007/BF02478259。文章
一个 Google Scholar一个 Chen T, Guestrin C, editors.Xgboost: A scalable tree boosting system.
Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining;2016。https://doi.org/10.1145/2939672.2939785。R Core Team.
R：统计计算的语言和环境。R Foundation for Statistical Computing.Vienna, Austria;2023. Available fromhttps://www.R-project.org/。Cited 2024 Jan 12.
Posit team.RStudio: Integrated Development Environment for R. Posit Software.Boston, PBC;2023。可从以下方式获得：http://www.posit.co/。Cited 2024 Jan 12.
Liu Y, Wu J, Liu M, Xu K, Guo Y, Wu J. Spatial epidemiology and machine learning methods for risk assessment of digestive tract cancers.Int J Environ Res公共卫生。2020;17(11): 3828.https://doi.org/10.3390/ijerph17113828。文章
一个 Google Scholar一个 Thongpeth W, Lim A, Wongpairin A, Thongpeth T, Chaimontree S. Comparison of linear, penalized linear and machine learning models predicting hospital visit costs from chronic disease in Thailand.Inform Med Unlocked.
2021;26: 100769.
文章一个 Google Scholar一个
Wu L, Zhou B, Yan C, Li M, Liu T, Zhu Q, et al.A deep learning model to predict survival outcomes in intrahepatic cholangiocarcinoma using histopathological images.疗法。2021;11(15):7537â€“50.https://doi.org/10.7150/thno.59879。文章
一个 Google Scholar一个 Chaudhary K, Poirion OB, Lu L, Garmire LX.Deep learning-based multi-omics integration robustly predicts survival in liver cancer.
Clin Cancer Res.2018;24(6):1248â€“59.https://doi.org/10.1158/1078-0432.CCR-17-0853。文章
一个 PubMed一个 CAS一个 Google Scholar一个 Zhang R, Xu J, Wang Y, Lu H, Miao Z, Han Z. Development and validation of a machine learning model for predicting illness trajectory and hospital resource utilization of patients with cholangiocarcinoma.JMIR Med Inform.
2021;9(4): e26586.https://doi.org/10.2196/26586。文章
一个 Google Scholar一个 Wang S, Yang DM, Rong R, Zhan X, Xiao G. Pathology image analysis using segmentation deep learning algorithms.Am J Pathol.
2019;189(9):1686â€“98.https://doi.org/10.1016/j.ajpath.2019.05.007。文章
一个 PubMed一个 PubMed Central一个 Google Scholar一个 Khuntikeo N, Loilome W, Thinkhamrop B, Chamadol N, Yongvanit P. A comprehensive public health conceptual framework and strategy to effectively combat Cholangiocarcinoma in Thailand.PLoS Negl Trop Dis.
2016;10(1): e0004293.https://doi.org/10.1371/journal.pntd.0004293。文章
一个 PubMed一个 PubMed Central一个 Google Scholar一个 Qiao Z, Sun N, Li X, Xia E, Li S, Chang Y. Using machine learning approaches for emergency room visit prediction based on electronic health record data.Stud Health Technol Inform.
2018;247:111â€“5.https://doi.org/10.3233/978-1-61499-852-5-111。文章
一个 PubMed一个 Google Scholar一个 Kim BJ, Kim JH, Kim HS, Choi YH.Machine learning application for prediction of cholangiocarcinoma in patients with primary sclerosing cholangitis.
J Hepatol.2021;74(3):567â€“74.https://doi.org/10.1016/j.jhep.2020.10.038。文章
一个 Google Scholar一个下载参考致谢

The authors thank all members of the four Population-based Cancer Registries, namely Lampang Cancer Hospital, Lop Buri Cancer Hospital, Khon Kean provincial cancer registry, Srinagarind Hospital, Faculty of Medicine, Khon Kaen University, Surat Thani Cancer Hospital.

Ulster University is acknowledged for supporting the writing of this research paper during Oraya Sahat's visit to Belfast Campus from January 12 to February 12, 2025.

作者信息

作者和隶属关系

Student of Doctor of Public Health Program, Faculty of Public Health, Khon Kaen University, Khon Kaen, Thailand

Oraya Sahat
Department of Epidemiology and Biostatistics, Faculty of Public Health, Khon Kaen University, Khon Kaen, Thailand
Supot Kamsa-ardÂ &Â Siriporn Kamsa-ard
Department of Mathematics and Computer Science, Faculty of Science and Technology, Prince of Songkla University, Pattani Campus, Pattani, Thailand
Apiradee Lim
School of Computing, Ulster University, Northern Ireland, Belfast Campus, Belfast, BT15 1â€‰AP, UK
Matias Garcia-ConstantinoÂ &Â Idongesit Ekerete
Contributions

SK1 served as the corresponding author of this study.

OS was the principal author and collaborated with all authors to develop the study conception and design.AL conducted the data analysis with support from SK1 and SK2.MGC reviewed the methodology and results along with IE reviewed and improved the language of the article.All authors reviewed previous versions of the manuscript and approved the final version.

相应的作者

对应Supot Kamsa-ard。

道德声明

道德批准并同意参加

This study utilized secondary data from four PBCRs, which did not involve the collection of individualsâ€™ identifying information.Therefore, individual informed consent was not required.This study received ethical approval from the Human Research Ethics Committees of all four data sources: Lampang Cancer Hospital (No. 10/2567), Lop Buri Cancer Hospital (No. LEC 6647), Khon Kaen University, where the consideration of human research ethics is in accordance with the Helsinki Declaration (No. HE671027), and Surat Thani Cancer Hospital (No. SCH_EC_01/2567).

同意出版

不适用。

竞争利益

作者没有宣称没有竞争利益。

附加信息

出版商的注释

关于已发表的地图和机构隶属关系中的管辖权主张，Springer自然仍然是中立的。

权利和权限

开放访问This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material.您没有根据本许可证的许可来共享本文或部分内容的改编材料。The images or other third party material in this article are included in the articleâ€™s Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the articleâ€™s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.要查看此许可证的副本，请访问http://creativecommons.org/licenses/by-nc-nd/4.0/。重印和权限

关于这篇文章

引用本文

Sahat, O., Kamsa-ard, S., Lim, A.

等。Comparison of spatial prediction models from Machine Learning of cholangiocarcinoma incidence in Thailand.BMC公共卫生25 , 2137 (2025).https://doi.org/10.1186/s12889-025-23119-y

下载引用

已收到：2025年2月11日
公认：2025年5月9日
出版：2025年6月7日
doi：https://doi.org/10.1186/s12889-025-23119-y