我们的实验设置是通过从各种方式中获取多个参数的术后扫描开始的,包括T1加权(T1W),对比增强的T1-加权(T1C),流体侵入的反转反恢复(FLAIR)和T2-WEIGHTEING(T2W)(T2W))图像(图 1)。这些扫描来自Brats-19数据集,该数据集由宾夕法尼亚大学的生物医学图像计算与分析中心(CBCIA)提供。42,,,,43,,,,44]。该数据集包括从诊断为GBM和低级神经胶质瘤(LGG)的患者获得的MRI扫描,并伴随着公共可访问的基因组和其他临床数据,可以通过癌症基因组图(TCGA)等平台访问[TCGA)[[TCGA)[35]和临床蛋白质组学分析联盟(CPTAC)[45]。CBCIA提供了一个文件名映射,将提供的扫描与TCGA和CPTAC门户中的患者标识符相关联,从而促进了从替代来源获得遗传和临床信息的访问。相关患者的信号通路是从CBIOPORTAL收集的。该平台提供了跨不同数据集的基因组配置文件的交互式访问,并托管了与这些配置文件相关的信号通路数据集。由于没有用于访问路径数据集的API,因此使用Web刮板手动提取路径数据。此提取是从GBM TCGA Pancancer Atlas(研究ID:GBM_TCGA_PAN_CAN_CAN_ATLAS_2018),GBM CPTAC(研究ID:GBM_CPTAC_2021)和大脑较低级glioma tcga pcaga pcancer(GBM_CPTAC_2021)和DATAS datas_pan_pan_pan_pan_pan_pan_pan_pan_pan_pan_pan_pan_pan_pan_antllan。如果在这些数据集中无法使用某些途径的情况下,则可以从TCGA FireHose Legacy数据集(研究ID:GBM_TCGA,LGG_TCGA)获得。
使用为每个受试者提供的分割掩码从四个MRI模式中提取了一组放射线特征。提取过程是利用吡啶组学[52],其中包括符合成像生物标准标准化计划(IBSI)的特征定义[53]。IBSI标准化特征定义并提供了参考值,以验证放射线软件,增强可重复性并促进放射线研究的临床翻译。提取的特征集经过标准化,以在[0,1]范围内实现正态分布和归一化。但是,大量的提取特征,数量超过一千多个(> 1000)可能会导致维度的诅咒。[[54]。通过生成的功能面板上的功能工程进行了降低尺寸和最相关特征的最相关特征。
选择了基于ML的五个监督分类模型,并使用提供的功能集来预测五个信号通路。选定的算法包括逻辑回归分类器(LRC),支持向量机(SVM),随机森林分类器(RFC),Adaboost分类器(ABC)和K-Nearest neight neighbor Classifier(KNN)。为了提高模型的预测准确性,我们采用了网格搜索超参数调整,旨在确定每种算法的最佳超参数配置。这些模型经过了5倍的交叉验证,以确保更精确和无偏的绩效评估。随后,在单独的,看不见的测试集上评估了具有最合适的高参数设置的模型。使用各种评估指标来识别每个信号通路的最佳表现算法。
TCGA的GBM和LGG患者的多机构术前MRI扫描(n= 167)和cptac(nCBCIA发布的BRATS-19数据集可用= 19)。除了包含T1,T2,对比度T1和Flair 3D MRI量外,数据集还包括经验丰富的神经放射学家的分割标签。这些分割标签包括增强肿瘤部分(ET),坏死和非增强肿瘤部分(净)和周围肿瘤性水肿的注释。这些扫描已经进行了各种预处理步骤,包括颅骨划痕,共同注册和插值,以达到1毫米佣的分辨率。
表中列出的九个致癌信号通路 1从cbioportal提取特征集。然而,两种途径NRF2和TGF-区没有显示任何改变,因此被排除在进一步分析之外。如图所示 2,途径改变的分布是不平衡的,观察到过多或稀缺的变化。在多数族裔和少数族裔之间存在显着差异的情况下,ML算法倾向于将其分类结果偏向多数级别,从而导致偏差的结果[56]。尽管达到高精度,但该算法可能无法在其他性能指标(例如灵敏度(回忆)或F1得分)方面发挥最佳性能。尽管表现出色,但Smote可能无法有效处理数据中的严重失衡。为了应对这一挑战,选择了最小不平衡(<30%)的前四个致癌信号通路进行模型训练。这些途径包括PI3K,TP53,RTK-RAS,WNT和Notch信号通路,已证明会显着影响GBM [29,,,,30,,,,31,,,,32,,,,33,,,,34,,,,36,,,,37,,,,38,,,,39,,,,40,,,,41]。
RTK-RAS,PI3K,Notch,TP53和Wnt信号通路与它们的相互作用和交叉调节相互联系,在开发和推进GBM方面起着重要作用[29,,,,30,,,,31,,,,32,,,,33,,,,34,,,,36,,,,37,,,,38,,,,39,,,,40,,,,41]。认识到这些信号通路并理解它们之间的串扰对于管理和推进目标疗法至关重要。在图2中的一个沮丧情节中描绘了跨受试者的信号通路之间的关系。 3。
从遵循IBSI标准的101个标准功能得出的1284个特征的全面放射线特征集,从四个成像序列(T1,T2,T1C,Flair)及其相应的分割掩码中获得。提取的特征包括一阶,体积和基于强度的纹理特征,分为一阶,形状,GLCM,GLDM,GLRLM,GLRLM,GLSZM和NGTDM,并在表格中列出了所有107个功能 2。表2从MRI扫描中提取的特征的全面列表,用于类别形状,一阶,GLCM,GLDM,GLRLM,GLSZM和NGTDM减少维度和特征选择
为每种途径选择的前10个功能显示在表中 3。
在预处理,功能工程和数据拆分之后,最终数据队列由TCGA的167名受试者(GBMâ= 98,LGGâ=â= 69)和CPTAC的19个受试者组成。将CTPAC的19位受试者放在一边以进行数据验证是适合的;但是,由于CPTAC数据集的信号传导途径的严重失衡,实验是在三个单独的数据集中进行的,这是通过将同时分配到训练和验证集中的三个单独的数据集中(图。 4)。
给出\(\:T = {)作为代表TCGA-GBM和CPTAC的两组\(\:\三角洲\:\)为了识别集合的类,使用以下方程生成验证集。
$ \:valset = \ left \ {\ begin {array} {c} \ begin {aligned} c \ cup \:\ left \ left \ {t:\:t \:t \:\:\ in \:{t} _ {maj} _ {maj}\:和\:\ weled | t \ right | = \ left | {c} _ {maj} \ right | - \ weft | {c} _ {min} \ right | \ right | \ right \},\ cr \ \ cr \ \:如果\:\ delta \:\ left({t} _ {maj} \右){t:\:t \:\ in \:{t} _ {min} \:and \:\:\ left | t \ t \ right | = 3 \ right \},\:如果{t} _ {maj} \ right)\ cr = \ delta \:\ left({c} _ {maj} \ right)\:and \:\ frac {\ frac {\ left | {t} _ {min} _ {min} \ right|} {\左|min} \:and \:\ weled | t \ right | = 5 \ right \},\:if \ \:\ delta \:\ left({t} _ {maj} _ {maj} \ right)\ cr = \ delta \:\ left({c} _ {maj} \ right)\:\:\ frac {\ left | {t} _ {min} \ right |} {\ left | t \ cup \ cup \:g \ right | right |}>0.11 \ end {Aligned} \ end {array} \ right。$$
训练了五种ML算法,通过训练从GBM MRI扫描的分割标签中提取的各种放射线特征来检测五个基本的致癌信号通路。这些模型使用五倍的交叉验证进行了培训,以评估所采用方法的普遍性。在表格中列出了所有三个数据集中每个模型的5倍交叉验证的平均结果 4并在图中进一步可视化以清晰 5。除准确性外,还选择了ROC_AUC分数以对模型性能进行更全面的评估。这里的平均结果展示了使用网格搜索通过彻底的超参数调整实现的最佳参数。表中概述了负责这些结果的精确的超参数配置 5。
在列出的测试集上进一步验证了在每个信号通路上执行最佳的模型,以评估模型的普遍性。选择以下指标以了解一般预测能力,展示不平衡数据的性能,并观察正确的预测率和错误分类之间的权衡。在处理中度至重度数据失衡时,准确性可能不是分类问题中最合适的度量。在这种情况下,精度,回忆(特异性)和F1得分对模型的性能进行了更精确的评估。基于准确性和F1得分的算法的比较可视化图表如图所示。 6。
了解RTK-RAS,PI3K,Notch和TP53等信号通路的互连性对于理解它们如何影响GBM的发展和进步至关重要。检测这些途径及其相互交流对于管理现有治疗和推进目标疗法至关重要。有趣的关联也可以在我们的数据中看到,并在图2中的不满图中显示。 3。PI3K,RTK-RAS或TP53改变在每种情况下不超过3例,而在9例中分别观察到Notch途径,在16例中,WNT途径分别观察到。相反,PI3K,RTK-RAS和TP53通常共同出现,并且观察到各种组合。例如,在31个实例中发现了RTK-RAS和TP53的独特组合,TP53在14例中只有PI3K,而在30例中,所有三种组合在一起。这三个途径在神经胶质成众所周知[35,,,,57,,,,58]。除了同时存在已建立的途径外,在35例病例中,在TP53途径的旁边发现了在p53基因中诱导p53基因凋亡的途径(如Notch)所识别的情况。这些途径之间的相互作用得到了广泛的研究和记录[33,,,,59]。值得注意的是,在总共167例案件中,所有四个途径都存在。
不同ML算法之间观察到的不同性能与每个途径和数据集大小所面临的不同挑战相对应。集合方法,尤其是随机森林,在不同情况下表现出一致的性能,表明它们的潜力是可靠的基线模型。TP53途径以其作为肿瘤抑制剂的功能而闻名,得出了令人兴奋的发现。当应用于OPER_SPLIT数据集时,ABC算法显示出显着的精度,精度和F1得分,表明其在检测此途径方面的功效。但是,算法的功效在under_split和uster_split_pure数据集上急剧下降,大多数算法显示出异常低的精度和回忆,这可能是由于最小的测试集构成了只有三个样品点。Conversely, cross-validation accuracy on these two datasets remained consistently above 0.70, except for LRC.
The RTK-RAS pathway, characterized by its intricate network of interactions, displayed diverse performance across different datasets. In the over_split dataset, the SVC exhibited a balanced performance. However, except KNN, all algorithms failed to identify any true negatives, classifying all nine samples as positives, leading to 100% recall but zero precision. This discrepancy contrasts with the cross-validation outcomes on the training set, suggesting that none of the models have overfit and have not achieved successful generalization.
The PI3K pathway, which plays a crucial role in cell growth and survival, exhibited relatively consistent performance trends on under_split_pure and over_split datasets. In the case of over_split, the RFC achieved consistent results across all metrics, reflecting its robustness. However, on the under_split dataset, all algorithms faced challenges detecting any true negatives primarily due to possible class imbalance issues. Interestingly, the RFC algorithm excelled on the under_split_pure dataset, with all 9 cases correctly detected. The NOTCH signaling pathway is known for its significance in gliomagenesis. Across all datasets, we observed consistently higher performance among all the algorithms over all three datasets in the prediction of NOTCH Pathway. The LRC displayed exceptional precision and recall on the over_split and under_split datasets. In contrast, all algorithms failed to perform significantly on the under_split_pure dataset.
The WNT signaling pathway, critical for cell differentiation in the central nervous system, presented diverse performance across algorithms. We had the highest class imbalance and the least number of alterations among the pathways, which performed relatively poorly on all the datasets. RFC showed variable performance across datasets, while the LRC achieved high precision and recall on the under_split dataset. The SVC demonstrated strong performance on the under_split_pure dataset, indicating its capability to handle class imbalance effectively. This research revealed insights into utilizing machine learning models with radiomic data to forecast specific oncogenic signaling pathways. The findings underscore the impact of dataset size, class distribution, and feature complexity on model effectiveness. By considering these elements, we can enhance our prediction algorithms, fostering a deeper understanding of employing AI in radiomics to elucidate the interactions among different signaling pathways and their influence on tumor phenotypic traits.
While this study highlights the potential of using artificial intelligence and radiomics to predicting oncogenic signaling pathways in glioblastoma, the study relies on publicly available datasets, such as BRATS-19 and TCGA, which may not represent the full heterogeneity of glioblastoma cases. Limited sample sizes, particularly for certain signaling pathways, have introduced class imbalance. Despite using techniques like SMOTE to address this issue, synthetic data generation may not fully capture the complexity of real-world tumor biology, potentially affecting model performance on underrepresented pathways. Furthermore, while cross-validation was used to mitigate overfitting, external validation on independent datasets could confirm the modelsâ reliability and applicability in diverse clinical settings.
Predicting oncogenic signaling pathways from radiomic features holds promise for advancing genomic diagnosis faster and more cost-effectively. Invasive diagnostic procedures for brain tumors, such as brain biopsies, entail additional risks, making the timely and accurate genetic profiling of specimens crucial for targeted therapeutic interventions in Glioblastoma cases. The study offers a non-invasive approach to identifying oncogenic signaling pathways, which can guide personalized therapeutic strategies which is clinically significant since this advancement could reduce reliance on invasive diagnostic procedures like biopsies, thereby mitigating associated risks. Our study deployed four machine-learning models to forecast four oncogenic signaling pathways using MRI scans from the TCGA-GBM dataset (Fig. 7)。Our findings revealed a positive correlation between the radiomic features extracted from MRI scans and oncogenic signaling pathways in GBM.With adequate data, manual feature extraction could be bypassed, leading to the development of a more generalized multi-label deep learning model capable of predicting additional signaling pathways.We intend to expand this research by developing a multi-label deep learning model that can predict a broader spectrum of signaling pathways.Future applications could also include extending the research beyond glioblastoma to other cancer types, which could help in improving patient outcomes in diverse clinical contexts.
