- Research
- Open access
- Published:
T4Seeker: a hybrid model for type IV secretion effectors identification
BMC Biology volume 22, Article number: 259 (2024)
Abstract
Background
The type IV secretion system is widely present in various bacteria, such as Salmonella, Escherichia coli, and Helicobacter pylori. These bacteria use the type IV secretion system to secrete type IV secretion effectors, infect host cells, and disrupt or modulate the communication pathways. In this study, type III and type VI secretion effectors were used as negative samples to train a robust model.
Results
The area under the curve of T4Seeker on the validation and independent test sets were 0.947 and 0.970, respectively, demonstrating the strong predictive capacity and robustness of T4Seeker. After comparing with the classic and state-of-the-art T4SE identification models, we found that T4Seeker, which is based on traditional features and large language model features, had a higher predictive ability.
Conclusion
The T4Seeker proposed in this study demonstrates superior performance in the field of T4SEs prediction. By integrating features at multiple levels, it achieves higher predictive accuracy and strong generalization capability, providing an effective tool for future T4SE research.
Background
Type IV secretion effectors (T4SEs) are proteins released by pathogenic bacteria into host cells via the type IV secretion system (T4SS), which plays a critical role in the interactions between pathogens and hosts [1, 2]. Once inside the host cells, T4SEs may change the signaling pathways of the host cells. Once inside the host cells, T4SEs may change the signaling pathway of host cells, suppress immune responses, and facilitate the invasion and survival of bacteria within the host cells. T4SEs can affect various biological processes in host cells, including signaling pathways, gene expression, and organelle function, thereby utilizing host cell resources to promote bacterial survival, reproduction, and transmission. Some bacteria, such as Helicobacter pylori, invade host cells and cause diseases by releasing T4SEs proteins [3]. In addition, T4SEs can disrupt host cell signaling pathways and immune responses, leading to disease [4]. T4SEs facilitate bacteria in evading detection and attacking the host’s immune system; thus enabling better survival [5, 6]. T4SEs proteins can affect the normal functions of host cells by disrupting cell signaling, gene expression, and organelle function, thereby affecting the physiological state of host cells. In severe cases, T4SEs may cause inflammatory responses in host cells [7]. In summary, T4SEs may have adverse effects on host cells and organisms, contributing to the disease progression.
Accurate identification of T4SEs can help deepen understanding of the molecular mechanisms of bacterial hosts, thereby revealing the molecular basis of bacterial pathogenesis [8, 9]. Traditionally, T4SE data are generated and validated through laboratory experiments. Traditional laboratory methods for distinguishing whether the proteins are T4SEs are more accurate but are time and resources-intensive. Laboratory methods require expensive reagents and equipment, resulting in relatively high costs. Compared to traditional laboratory methods, machine learning-based T4SE identification tools can process and analyze large amounts of data in a relatively short time, enabling the rapid identification of T4SE proteins. Unlike laboratory methods, machine learning tools can be automated, thereby reducing the need for extensive human resources and materials [10,11,12,13,14,15,16]. Importantly, machine learning methods are scalable, and with the continuous improvement and optimization of data [17,18,19,20,21,22], models will be continually updated to achieve more accurate T4SE identification. More importantly, the combination of traditional laboratory methods and machine-learning approaches can expedite research on T4SEs. Researchers can use machine learning tools to predict potential T4SEs and validate these candidates in the laboratory. Computational predictions can significantly reduce the scope of laboratory validation and save time and resources.
To date, the published T4SEs identification tools include DeepT3_4 [23], Bastion4 [24], iTSE-EP [25], OPT4e [26], DeepSecET4 [27], T4SE-XGB [28], T4Sefinder [29], T4SEpp [30], and T4SE-ARF [31]. DeepT3_4 [23], which integrates recurrent and deep neural networks, accurately classifies type III and type IV secreted effectors by utilizing amino acid character dictionaries and sequence-based features extracted from the effector proteins. Bastion4 [24] trained a T4SEs predictor using six machine learning models and 10 selected features, with ensemble models enhancing the predictive performance. OPT4e [26] employs a statistical approach to select the best features for predicting the T4SS effector proteins. DeepSecE [27] uses a pre-trained protein language model and transformer with the potential to identify disease-associated proteins across bacterial genomes. T4SE-XGB [28] is a model that uses the XGBoost algorithm to accurately identify T4SEs based on protein sequence features, with feature interpretation performed using the SHAP method. T4SEfinder [29] uses a pre-trained language model of protein sequences to classify T4SEs. T4SEpp [30] employs full-length embedding features from six pre-trained protein language models to train classifiers for predicting T4SEs. T4SEpp integrates three modules: a homolog search for known T4SEs, machine learning fine-tuning with signal sequence data, and the utilization of top-performing pre-trained protein language models. The study of TSE-ARF [31] proposed two new feature descriptors, fused them with universal features to form a 290-dimensional feature vector, and employed the TSE-ARF model for classification predictions using the parameter adaptation of different secretion effectors. By integrating the data from these studies, we trained T4Seeker, which offered several advantages.
-
1.
The results demonstrated that T4Seeker exhibited robustness and generalization ability, with an area under the curve (AUC) of 0.947 for the cross-validation test set and 0.970 for the independent dataset.
-
2.
Analysis of feature extraction at different levels revealed that distance-based residue (DR) [32], evolutionary scale modeling (ESM) [33], and long short-term memory (LSTM) features not only exhibit good individual performance but also synergistically complementary effects. The fusion features of DR, ESM, and LSTM effectively identified the T4SEs.
-
3.
By employing type III secretion effectors (T3SEs) and type VI secretion effectors (T6SEs) as negative samples, a powerful T4SEs identification model was trained.
Results
Model development
To construct an effective predictive model for T4SEs, we extracted a comprehensive array of features spanning multiple levels of protein sequence representation. These features encompassed distinct feature groups, including those based on amino acid composition features [32, 34,35,36], and on composition and distribution [34, 35]. Evolutionary scale modeling features. To ensure that the selection of discriminative features was conducive to a robust model performance, a meticulous screening process was employed [33]. Considering the effectiveness of Multi-Layer Perceptron [37] (MLP) in handling high-dimensional feature spaces and its capability to discern complex patterns within data, we chose the MLP framework to identify T4SEs [38, 39]. Feature subsets from different levels were individually evaluated using the MLP, with an emphasis on optimizing the AUC. DR [32], ESM-average [33], ESM-flatten-1024, and LSTM (with an AUC exceeding 90%) were subsequently retained for further analysis (Fig. 1).
Feature fusion is a form of ensemble learning that involves integrating information from different feature sets to obtain a more comprehensive and accurate representation for model training [40]. By combining multiple features, the model can use the complementarity of each feature to enhance predictive performance and generalization capability. Combining different features provides a more comprehensive perspective of the data, enabling better capture of its characteristics and patterns. In addition, using multiple features can reduce the overreliance of models on a single feature, thereby reducing the risk of overfitting. Therefore, different features are fused to train better-performing models. We trained the MLP models by combining DR and LSTM with ESM-average and DR with ESM-flatten-1024. The fusion of DR, ESM-average, and LSTM resulted in an average AUC of 0.938 on a fivefold cross-validation (cv) test, whereas the fusion of DR, ESM-flatten-1024, and LSTM resulted in an average AUC of 0.947. Therefore, we choose LSTM + ESM-flatten-1024 + DR as the final model, named T4Seeker.
Model performance and validation
Performance of individual features
To evaluate the feature subsets within the amino acid composition feature group, metrics including specificity (SP), precision, recall, accuracy (ACC), F1-score, and AUC were computed. The average performance metrics across all features in the amino acid decomposition feature group yielded an average SP of 0.807, precision of 0.802, recall of 0.818, ACC of 0.813, F1-score of 0.811, MCC of 0.626, and AUC of 0.868 on fivefold cv test (Table 1). Notably, the DR-based MLP models exhibited enhanced performance metrics, surpassing the average scores of the amino acid composition feature groups by 0.068, 0.067, 0.07, 0.068, 0.069, 0.019, and 0.079 on a fivefold cv test (Table 1). In the composition and transition groups, the AUC for both CTDC and CTDD were below 0.90. Additionally, the ESM-averaged, ESM-flatten-1024, and LSTM features exhibited high average AUC of 0.922, 0.908, and 0.906 on a fivefold cv test, respectively (Table 1). In summary, features with an AUC exceeding 0.9 were observed in the single-feature models for DR, ESM-average, ESM-flatten-1024, and LSTM.
Combining feature performance
We integrated DR and LSTM separately with ESM-average and ESM-flatten-1024 to train the models. For the fivefold test, the LSTM + ESM-average + DR model achieved an average AUC of 0.924. In comparison, the LSTM + ESM-flatten-1024 + DR (T4Seeker) achieved an AUC of 0.941 for the fivefold cv test. The SP, precision, ACC, F1-score, MCC, and AUC of T4Seeker were 0.005, 0.007, 0.016, 0.011, and 0.014, 0.032, and 0.009 higher than those of the LSTM + ESM-average + DR model on fivefold cv test, respectively. To demonstrate the performance of T4Seeker further, Table 3 presents the results of the independent test set. In the independent test set, T4Seeker achieved an SP, precision, recall, ACC, F1-score, MCC, and AUC of 0.944, 0.945, 0.92, 0.932, 0.932, 0.864 and 0.970, respectively. The SP, precision, recall, ACC, F1-score, MCC, and AUC were 0.122, 0.103, 0.018, 0.069, 0.061, 0.137, and 0.029 higher than those of the LSTM + ESM-average + DR model, respectively. The best-performing single-feature model had average SP values of 0.858, precision value of 0.866, recall value of 0.858, ACC value of 0.858, F1-score value of 0.858, MCC value of 0.717, AUC value of 0.928 on independent test. These values were lower than those of the T4Seeker by 0.086, 0.079, 0.063, 0.074, 0.075, 0.147, and 0.042 respectively. In conclusion, the performance of T4Seeker on both the validation and independent test sets demonstrated its strong predictive ability, robustness, and generalization capability, providing a solid foundation for biological research.
The proposed T4Seeker outperforms other methods
To highlight the superiority of T4Seeker in classifying T4SEs and non-T4SEs, T4seeker was used to compare the performance of four published and mainstream T4SE identification models (Bastion4 [24], T4SEpp [30], DeepSecEbd [27], and T4Sefinder [29]) on independent test sets. Among the four models, DeepSecEbd performed the best (Table 2). The SP, precision, ACC, F1-score, MCC, and AUC of DeepSecEbd were higher than those of the remaining three models by 0.212, 0.153, 0.089, 0.065, 0.172, and 0.091 on an independent test. Furthermore, although DeepSecEbd had a higher SP and precision, but the recall, ACC, F1-score, MCC and AUC of T4Seeker were higher than DeepSecEbd by 0.072, 0.023, 0.028, 0.039, and 0.06, respectively.
In addition, Fig. 2A shows the t-test statistical comparison between T4Seeker and Bastion4, T4SEpp, DeepSecEbd, and T4finder. The results show that T4Seeker exhibits statistically significant performance improvements over Bastion4, T4SEpp, and T4finder, with p-values less than 0.05. This indicates the observed performance differences between T4Seeker and these models. Although the comparison between T4Seeker and DeepSecEbd yielded a p-value greater than 0.05 (p-value = 0.1492), which suggests no statistically significant difference in performance, it is important to note that the T-statistic value is positive (T-statistic = 1.65). This positive T-statistic indicates that, on average, T4Seeker’s performance metrics are higher than those of DeepSecEbd. Therefore, while T4Seeker demonstrates clear performance superiority over Bastion4, T4SEpp, and T4finder, it also shows a trend of better performance compared to DeepSecEbd. This suggests that T4Seeker is generally effective and often outperforms existing models. This means that the overall performance of the T4Seeker model surpasses that of Bastion4, T4SEpp, DeepSecEbd, and T4Sefinder. As is well known, the purpose of models is to predict samples. The advantages of T4Seeker are mainly reflected in its generalization ability. Generalization ability is the performance of a model on new, unseen data and is an important indicator for evaluating model performance. This superior generalization not only ensures that T4Seeker remains effective across a variety of testing conditions, but also enhances its reliability in real-world applications where unpredictability is common.
Ablation studies
In this section, we present a comprehensive study to validate the effectiveness of different components of T4Seeker.
LSTM feature: First, the LSTM feature was removed, and the model was denoted as ESM-flatten-1024 + DR. Compared to T4Seeker, the ESM-flatten-1024 + DR model exhibited a decrease in the AUC of 0.025 and 0.026 in the fivefold cv test and independent test sets, respectively (Tables 2 and 3).
ESM-flatten-1024 feature: To assess the significance of the ESM-flatten-1024 feature in constructing the T4Seeker model, the ESM-flatten-1024 feature was removed (LSTM + DR). From Table 2, it is evident that the LSTM + DR resulted in decreases in the SP, precision, recall, ACC, F1, MCC, and AUC on the fivefold cv test, averaging 0.016, 0.01, 0.049, 0.034, 0.032, 0.031, and 0.026, respectively compared with T4Seeker. On the independent test set, the SP, precision, recall, ACC, F1-score, MCC, and AUC of LSTM + DR were lower than those of T4Seeker by 0.037, 0.039, 0.063, 0.051, 0.051, 0.1, and 0.026.
DR feature: The DR feature was eliminated, and the model was denoted as LSTM + ESM-flatten-1024. The performance of LSTM + ESM-flatten-1024 on the fivefold cv test is presented in Table 2. The SP, precision, recall, ACC, F1, MCC, and AUC of LSTM + ESM-flatten were lower than those of T4Seeker by 0.004, 0.004, 0.095, 0.056, 0.006, 0.099, and 0.018, respectively. Similarly, for the independent test set (Table 3), the LSTM + ESM-flatten parameters were lower than those of T4Seeker by 0.037, 0.038, 0.054, 0.046, 0.046, 0.091, and 0.009.
Evaluation of T4Seeker with additional baseline models
To further demonstrate the superiority of T4Seeker, we added baseline models, including one-vs-rest, k-nearest neighbors, multinomial naive Bayes, random forest, logistic regression, extra trees, and support vector machine. In the fivefold cross-validation test, T4Seeker showed higher mean values for SP, precision, recall, ACC, F1-score, MCC, and AUC compared to the baseline models, with differences of 0.057, 0.053, 0.037, 0.048, 0.048, 0.155, and 0.067 (please see Fig. 2B for more details). In the independent test set, these baseline models’ average SP, precision, recall, ACC, F1-score, MCC, and AUC were lower than T4Seeker’s by 0.104, 0.102, 0.104, 0.104, 0.103, 0.207, and 0.082 (Fig. 2C). This comparison highlights the superior performance of T4Seeker over a diverse set of baseline models, further substantiating the efficacy and robustness of T4Seeker.
In summary, the LSTM, ESM-flatten-1024, and DR features played positive roles in training the T4Seeker model. Their combination provides rich information, thereby assisting the model in making more accurate predictions and classifications.
Discussion
Our study represents a comprehensive effort to advance the classification of T4SEs through the integration of diverse feature sets. By leveraging multiple types of features, including amino acid composition, pseudo-amino acid composition, autocorrelation, and grouped amino acid composition, as well as by incorporating evolutionary information extracted using ESM and deep features using LSTM, we aimed to enhance the discriminatory power and generalization ability of T4SEs classification models. The fusion of the DR, ESM, and LSTM features effectively addresses the limitations of a single feature and significantly improves the predictive accuracy of the model. By integrating diverse feature sets and leveraging deep learning techniques, our study offers insight into exploration of T4SEs functions and the development of targeted interventions to combat infectious diseases. Furthermore, T4Seeker has surpassed existing classification models for T4SEs.
However, there are still some limitations. Despite integrating data from multiple sources, the dataset may not capture all variability present in natural T4SEs, potentially limiting the model’s generalizability. Future work will focus on expanding the dataset to include more diverse bacterial species and newly discovered T4SEs to enhance model robustness. By addressing these limitations and focusing on these areas, we aim to enhance the accuracy and applicability of T4Seeker. T4Seeker can be integrated into existing workflows as a tool for preliminary screening of potential IV secretion system effectors and guiding experimental validation. T4Seeker can aid in identifying potential virulence factors crucial for understanding pathogenic mechanisms and informing targeted therapeutic development. By focusing on T4SEs, T4Seeker can also assist in comparative studies, helping researchers explore the presence and variation of these effectors across different bacterial species. In addition, future studies should focus on expanding the scope of T4SEs classification to include additional bacterial species and virulence factors as well as exploring novel computational approaches for improved model interpretability and biological relevance.
Conclusions
T4Seeker improves model prediction accuracy and generalization by integrating multi-level features, including amino acid composition, ESM evolutionary information, and deep LSTM features. T4Seeker can serve as a tool for the preliminary screening of T4SEs, enhancing our understanding of viral mechanisms. The current dataset lacks diversity. Including more bacterial species and newly discovered T4SEs in the future will improve T4Seeker’s robustness and applicability.
Methods
Data description
We collected the published literature available and accessible for data on T4SEs, including DeepT3_4 [23], Bastion4 [24], iT4SE-EP [25], OPT4e [41], DeepSecE [27], T4SE-XGB [28], T4SEfinder [29], T4SEpp [30], and TSE-ARF [31]. Among them, T4SEfinder, T4SEpp, and DeepSecE are all derived from the SecReT4 database [42]. DeepT3_4 comes from the SecretEPDB database [43]. iT4SE-EP, T4SE-XGB, and T4SE-ARF not only use T4SEs from the SecReT4 database but also integrate T4SEs from ten types of bacteria retrieved from the literature, including Agrobacterium, Anaplasma, Bartonella, Bordetella, Brucella, Coxiella, Ehrlichia, Helicobacter, Legionella, and Ochrobactrum [44]. The T4SEs in OPT4e41 are from known effectors of four Gram-negative bacterial pathogens in the classes Alphaproteobacteria and Gammaproteobacteria. The T4SEs in T4SE-ARF come from the BastionHub database [45]. The T4SE sequences from these different sources were integrated, resulting in 5,473 samples. Considering the potential redundancy and sequence similarity among the samples, we performed CD-HIT [46,47,48] clustering to reduce redundancy and enhance dataset diversity (CD-HIT = 80%). Following CD-HIT clustering, the number of T4SE samples was reduced to 730 (Fig. 3).
To train a high-performance model, we also collected data on T3SEs and T6SEs from datasets including DeepT3_4 [23], Bastion3 [49], DeepT3-Keras [50], TSE-ARF [31], SecReT6 [51], and Bastion6 [52]. Among them, the T3SEs in Bastion3 and DeepT3-Keras are from NCBI Protein{Tatusova, 2016 #654} and UniProt{Consortium, 2019 #117}. The T6SEs in Bastion6 are from the SecretEPDB database [43]. The integration of T3SEs and T6SEs from different data bases resulted in 2301 T3SE samples and 670 T6SE samples. Subsequent CD-HIT clustering reduced the numbers of T3SEs and T6SEs samples to 730 and 309, respectively. We then combined the T3SEs and T6SEs datasets into a negative-sample dataset (Fig. 3). To ensure a balanced dataset, which is crucial for training robust models, we randomly selected negative samples from the negative-sample dataset consisting of T3SEs and T6SEs. This random selection process did not involve any additional filtering or criteria, ensuring that we obtained a completely random set of negative samples, which helps in maintaining the unbiased nature of the model training. Subsequently, the dataset was divided into training, validation, and test sets, with 70% allocated for training, 15% for validation, and 15% for testing.
Feature representation
Proteins are fundamental components of living organisms and perform a diverse range of functions that are critical to cellular processes [53]. The identification and characterization of T4SEs play a pivotal role in understanding host–pathogen interactions. In this study, we extracted features from four levels.
Long short-term memory features
Long short-term memory [54,55,56] networks are effectively used for feature extraction, especially in handling data that require sequence dependency. Through unique gating mechanisms, LSTM can process amino acid sequence data and remember important information while forgetting the irrelevant information. This allows the extraction of key features from the entire sequence, thereby capturing the temporal dynamics of the data suitable for a variety of downstream tasks. As shown in Fig. 1B, the base layer is an embedding layer that transforms each amino acid in the sequence (such as AA1, AA2, AA3, …, AAn) into an embedding vector. These embedding vectors are then fed into the upper LSTM units. In this study, a bidirectional LSTM was used. One direction of the LSTM (blue) processes the sequence from left to right, whereas the other (orange) processes it from right to left. This captures the dependencies in both directions of the sequence, thereby extracting richer feature information.
Evolutionary scale modeling features
Evolutionary scale modeling [33] features can not only elucidate the evolutionary dynamics of individual residues but also provide invaluable insights into the broader evolutionary context of proteins, empowering downstream tasks. Leveraging the ESM, each residue was represented by a comprehensive 320-dimensional feature vector. This methodological framework enables the extraction of nuanced evolutionary information. Initially, the ESM features were computed across the entire dataset to provide a comprehensive insight into the evolutionary patterns. We applied different pooling methods to the ESM features to obtain a better model, including average pooling over the sequence dimensions (ESM-average), max pooling over the sequence dimensions (ESM-max), retaining only fixed-length (33 × 320) residue information and flattening (ESM-flatten), and performing feature selection on ESM-flatten (ESM-flatten-1024).
ESM-average
The ESM generated a 320-dimensional vector for each residue. For an amino acid sequence of length N, the number of feature vector dimensions is N*320. Subsequently, average pooling is performed along the sequence length dimension, meaning that an average is obtained across the column dimension. Thus, the amino acid sequence produced a 1 × 320 feature vector (Fig. 1B).
ESM-max
Similar to ESM-average, ESM-max uses max pooling along the column dimension to determine the maximum values (Fig. 1B).
ESM-flatten
Considering the minimum sequence length of the 33 residues, only the features generated by the first 33 residues were retained (Fig. 1B).
ESM-flatten-1024
Feature selection was performed on ESM-flatten using random forests [57], reducing the feature dimensions to 1024 (Fig. 1B).
Traditional features based on amino acid composition
Amino acid composition quantifies the frequency of individual amino acid residues in a protein sequence, providing insights into its primary structure. To represent the composition of T4SEs, we utilized several amino acid composition-based features, including amino acid composition (AAC) [34]: quantifies the frequency of individual amino acid residues in the protein sequence. Dipeptide composition (DPC) [34]: Represents the distribution of dipeptides within a protein sequence. DR [32]: The distance-based residue feature captures the spatial relationships between amino acid residues within a protein structure. By calculating the distances between each pair of residues, this feature provides insight into the three-dimensional arrangement of the protein. This method involves generating a distance matrix, where each element represents the Euclidean distance between the alpha carbon atoms of two residues. Key statistical measures such as the mean, maximum, minimum, and standard deviation of these distances are then extracted to serve as features. K-mer (k = 2) [32, 58]: represents the frequency distribution of subsequences of length two within the protein sequence. Grouped amino acid composition (GAAC) [34]: frequency distribution of grouped amino acids based on their physicochemical properties in the protein sequence. Grouped dipeptide composition (GDPC) [34] represents the frequency distribution of grouped dipeptides in the protein sequence. Pseudo-position-specific amino acid composition general (PC-PseAAC-General) [32]: a generalized form of pseudo-amino acid composition with sequence-order effects. Split amino acid composition general (SC-PseAAC-General) [32]: generalized pseudo-amino acid composition was derived from the split amino acid composition (Fig. 1B).
Traditional features based on composition and distribution
CTDC [34] quantifies the frequency distribution of amino acids in a protein sequence. Distribution (CTDD) [34]: the CTDD captures the positional distribution of amino acids along the protein sequence. Transition (CTDT) [34]: CTDT quantifies amino acid transition patterns along the protein sequence (Fig. 1B).
Feature analysis
During the feature analysis phase, the performances of the different feature groups were assessed using the MLP. Within the amino acid composition group, the DR [32] feature exhibited the highest AUC, indicating its discriminatory power for distinguishing T4SEs (Table 1). In the composition and transition groups, CTDC [34] and CTDT [34] emerged as the top-performing features with the highest AUC; however, the AUC remained below 0.9 (Table 1). This means that DR exhibited the highest AUC among all the traditional features. In addition, ESM [33] and LSTM have shown promising potential for distinguishing between T4SEs. The AUC for the validation datasets for ESM-flatten-1024, ESM-average, and LSTM were 0.908, 0.922, and 0.906, respectively (Table 1 and Fig. 4).
Performance on different models. A Performance on single features models. B–G Summary plot of SHAP values for DR, CTDD, TPC, kmer, DPC, and CTDT. For each feature, one point corresponds to a single sample. The SHAP value along the x-axis represents the impact that feature had on the model’s output for that specific sample. Features in the higher position in the plot indicate the more important it is for the model
Additionally, we introduced SHAP to analyze traditional features, as depicted in Fig. 4B–G. Due to space constraints, only the top six values of total MASV for traditional features are displayed. The results revealed that the total MASV of DR obtained the highest score of 1.36. This aligns with DR achieving the highest AUC among traditional features. In other words, DR demonstrates strong capability in distinguishing T4SEs.
Hyperparameter settings
In this study, we employed specific hyperparameters and optimization strategies to train the model. The learning rate was set to 1e − 05, and the AdamW optimizer [59], a variant of Adam, was used. The optimizer parameters were set to a learning rate of 0.00001 and a weight decay of 0.001. To dynamically adjust the learning rate, we utilized a Cyclic Learning Rate [60] (CLR) scheduler with parameters including a base learning rate of 0.00001, a maximum learning rate of 0.001, and a step size of up to 30. During training, we conducted 50 epochs, with a batch size of 64 for each epoch. The structure of the Multi-Layer Perceptron (MLP) was 2756–1280-2. We chose the Cross-Entropy Loss function as the criterion, as it is suitable for classification tasks and effectively measures the difference between the predicted and actual class distributions. These configurations helped improve the training efficiency and performance of the model. We also conducted experiments with different learning rates, optimizers, schedulers, and MLPs, as shown in Fig. 5. From this figure, we can see that under the influence of our parameters, our model achieves good performance. We observe that the learning rate of 0.00001, the AdamW optimizer, the MLP structure with 1280 hidden layers, and the Cyclic Learning Rate scheduler yielded the best performance, achieving high AUC. These consistent results indicate that our chosen hyperparameters and optimization strategies significantly enhance the performance of the T4Seeker.
Comparative analysis of hyperparameter effects on model metrics. A The metrics of different learning rate models on independent tests. B The metrics of different optimizer models on independent test. C The metrics of different hidden layers models on independent test. D The metrics of different scheduler models on independent test
Evaluation metrics
Specificity represents the ability of T4Seeker to correctly predict T4SEs. Precision is the proportion of true values among the samples predicted as T4SEs by T4Seeker. The recall indicates the proportion of true T4Ses predicted as T4SEs by T4Seeker among all actual T4SEs. The accuracy is the proportion of samples correctly predicted by T4Seeker among all samples. The F-score is the proportional mean of precision and recall. The AUC is the area under the receiver operating characteristic curve [61, 62]. An AUC value closer to 1 indicated better model performance.
where TP refers to the number of samples correctly predicted as T4SEs by the T4Seeker. TN represents the number of samples correctly predicted as non-T4SEs by the T4Seeker. FP indicates the number of non-T4SEs incorrectly predicted as T4SEs, and FN is the number of T4SEs incorrectly predicted as non-T4SEs by T4Seeeker.
Data availability
No datasets were generated or analysed during the current study.
Abbreviations
- AAC:
-
Amino acid composition
- DPC:
-
Dipeptide composition
- DR:
-
Distance-based residue
- PC-PseAAC-General:
-
Pseudo position-specific amino acid composition general
- SC-PseAAC-General:
-
Split amino acid composition general
- CTDC:
-
Composition
- CTDD:
-
Distribution
- GAAC:
-
Grouped amino acid composition
- GDPC:
-
Grouped dipeptide composition
- LSTM:
-
Long short-term memory
- ESM:
-
Evolutionary scale modeling features
- ESM-average:
-
Average pooling over the sequence dimension
- ESM-max:
-
Max pooling over the sequence dimension
- ESM-flatten:
-
Retaining only fixed-length (33*320) residue information and flattening
- ESM-flatten-1024:
-
Performing feature selection on ESM-flatten
- MLP:
-
Multilayer perceptron
- SP:
-
Specificity
- ACC:
-
Accuracy
- AUC:
-
Area under the curve
- OVR:
-
One-vs-rest
- KNB:
-
K-nearest neighbors
- MNB:
-
Multinomial naive Bayes
- RF:
-
Random forest
- LR:
-
Logistic regression
- ETAs:
-
Extra trees
- SVM:
-
Support vector machine
References
Dehio C. Infection-associated type IV secretion systems of Bartonella and their diverse roles in host cell interaction. Cell Microbiol. 2008;10(8):1591–8.
Voth DE, Broederdorf LJ, Graham JG. Bacterial Type IV secretion systems: versatile virulence machines. Future Microbiol. 2012;7(2):241–57.
Dielen AS, Badaoui S, Candresse T, German-Retana S. The ubiquitin/26S proteasome system in plant–pathogen interactions: a never-ending hide-and-seek game. Mol Plant Pathol. 2010;11(2):293–308.
Rajendhran J. Genomic insights into Brucella. Infect Genet Evol. 2021;87: 104635.
Finlay BB, McFadden G. Anti-immunology: evasion of the host immune system by bacterial and viral pathogens. Cell. 2006;124(4):767–82.
Hornef MW, Wick MJ, Rhen M, Normark S. Bacterial strategies for overcoming host innate and adaptive immune responses. Nat Immunol. 2002;3(11):1033–40.
Sankarasubramanian J, Vishnu US, Dinakaran V, Sridhar J, Gunasekaran P, Rajendhran J. Computational prediction of secretion systems and secretomes of Brucella: identification of novel type IV effectors and their interaction with the host. Mol BioSyst. 2016;12(1):178–90.
Agany DD, Pietri JE, Gnimpieba EZ. Assessment of vector-host-pathogen relationships using data mining and machine learning. Comput Struct Biotechnol J. 2020;18:1704–21.
Wei L, Zhou C, Chen H, Song J, Su R. ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics. 2018;34(23):4007–16.
Xing EP, Ho Q, Xie P, Wei D. Strategies and principles of distributed machine learning on big data. Engineering. 2016;2(2):179–95.
Wang Y, Zhai, Y., Ding, Y., Zou, Q. SBSM-Pro: Support Bio-sequence Machine for Proteins. arXiv preprint. 2023: arXiv:2308.10275 .
Sinha D, Dasmandal T, Yeasin M, Mishra DC, Rai A, Archak S. EpiSemble: A Novel Ensemble-based Machine-learning Framework for Prediction of DNA N6-methyladenine Sites Using Hybrid Features Selection Approach for Crops. Curr Bioinform. 2023;18(7):587–97.
Li X, Ma S, Xu J, Tang J, He S, Guo F. TranSiam: Aggregating multi-modal visual features with locality for medical image segmentation. Expert Systems Appl. 2024;237:121574.
Wei L, Xing P, Shi G, Ji Z, Zou Q. Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique. Ieee-Acm Transactions on Computational Biology and Bioinformatics. 2019;16(4):1264–73.
Li H, Pang Y, Liu B. BioSeq-BLM: a platform for analyzing DNA, RNA, and protein sequences based on biological language models. Nucleic Acids Res. 2021;49(22): e129.
Li H, Liu B. BioSeq-Diabolo: Biological sequence similarity analysis using Diabolo. PLoS Comput Biol. 2023;19(6): e1011214.
Kashyap H, Ahmed HA, Hoque N, Roy S, Bhattacharyya DK. Big data analytics in bioinformatics: A machine learning perspective. arXiv preprint arXiv:150605101. 2015.
Sparks ER, Talwalkar A, Haas D, Franklin MJ, Jordan MI, Kraska T, editors. Automating model search for large scale machine learning. Proceedings of the Sixth ACM Symposium on Cloud Computing; 2015.
Wang L, Ding Y, Tiwari P, Xu J, Lu W, Muhammad K, et al. A deep multiple kernel learning-based higher-order fuzzy inference system for identifying DNA N4-methylcytosine sites. Inf Sci. 2023;630:40–52.
Guo X, Huang Z, Ju F, Zhao C, Yu L. Highly Accurate Estimation of Cell Type Abundance in Bulk Tissues Based on Single-Cell Reference and Domain Adaptive Matching. Advanced Science. 2024;11(7):2306329.
Jiang Y, Wang R, Feng J, Jin J, Liang S, Li Z, et al. Explainable deep hypergraph learning modeling the peptide secondary structure prediction. Advanced Science. 2023;10(11):2206151.
Liu B, Gao X, Zhang H. BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Res. 2019;47(20):e127.
Yu L, Liu F, Li Y, Luo J, Jing R. DeepT3_4: a hybrid deep neural network model for the distinction between bacterial type III and IV secreted effectors. figshare https://figshare.com/articles/dataset/Data_Sheet_1_DeepT3_4_A_Hybrid_Deep_Neural_Network_Model_for_the_Distinction_Between_Bacterial_Type_III_and_IV_Secreted_Effectors_docx/13619651?file=26139221 (2021).
Wang J, Yang B, An Y, Marquez-Lago T, Leier A, Wilksch J, et al. Systematic analysis and prediction of type IV secreted effector proteins by machine learning approaches. Brief Bioinform. 2019;20(3):931–51.
Han H, Ding C, Cheng X, Sang X, Liu T. iT4SE-EP: Accurate identification of bacterial type IV secreted effectors by exploring evolutionary features from two PSI-BLAST profiles. Molecules. 2021;26(9):2487.
Esna Ashari Z, Dasgupta N, Brayton KA, Broschat SL. An optimal set of features for predicting type IV secretion system effector proteins for a subset of species based on a multi-level feature selection approach. Figshare https://figshare.com/collections/An_optimal_set_of_features_for_predicting_type_IV_secretion_system_effector_proteins_for_a_subset_of_species_based_on_a_multi-level_feature_selection_approach/4094450 (2018).
Zhang Y, Guan J, Li C, et al. DeepSecE: a deep-learning-based Framework for multiclass Prediction of secreted Proteins in gram-negative bacteria. Figshare https://figshare.com/articles/software/DeepSecE/23489021?file=41197619 (2023).
Chen T, Wang X, Chu Y, et al. T4SE-XGB: interpretable sequence-based prediction of type IV secreted effectors using eXtreme gradient boosting algorithm. figshare https://figshare.com/collections/T4SE-XGB_Interpretable_Sequence-Based_Prediction_of_Type_IV_Secreted_Effectors_Using_eXtreme_Gradient_Boosting_Algorithm/5131205 (2020).
Zhang Y, Zhang Y, Xiong Y, Wang H, Deng Z, Song J, et al. T4SEfinder: a bioinformatics tool for genome-scale prediction of bacterial type IV secreted effectors using pre-trained protein language model. Briefings in Bioinformatics. 2022;23(1):bbab420.
Hu Y, Wang Y, Hu X, Chao H, Li S, Ni Q, et al. T4SEpp: A pipeline integrating protein language models to predict bacterial type IV secreted effectors. Comput Struct Biotechnol J. 2024;23:801–12.
Tang X, Luo L, Wang S. TSE-ARF: An adaptive prediction method of effectors across secretion system types. Anal Biochem. 2024;686: 115407.
Liu B, Wu H, Chou K-C. Pse-in-One 2.0: an improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein sequences. Natural science. 2017;9(04):67.
Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci. 2021;118(15): e2016239118.
Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Wang Y, et al. iFeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics. 2018;34(14):2499–502.
Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A, Revote J, et al. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform. 2020;21(3):1047–57.
Song N, Dong R, Pu Y, Wang E, Xu J, Guo F. Pmf-cpi: assessing drug selectivity with a pretrained multi-functional model for compound-protein interactions. J Cheminf. 2023;15(1):97.
Popescu M-C, Balas VE, Perescu-Popescu L, Mastorakis N. Multilayer perceptron and neural networks. WSEAS Transactions on Circuits and Systems. 2009;8(7):579–88.
Jakkula V. Tutorial on support vector machine (svm). School of EECS, Washington State University. 2006;37(2.5):3.
Huang S, Cai N, Pacheco PP, Narrandes S, Wang Y, Xu W. Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genomics Proteomics. 2018;15(1):41–51.
Yang X, Niu Z, Liu Y, Song B, Lu W, Zeng L, et al. Modality-DTA: Multimodality fusion strategy for drug–target affinity prediction. IEEE/ACM Trans Comput Biol Bioinf. 2023;20(2):1200–10.
Esna Ashari Z, Brayton KA, Broschat SL. Prediction of T4SS effector proteins for Anaplasma phagocytophilum using OPT4e, a new software tool. figshare https://figshare.com/articles/dataset/Data_Sheet_1_Prediction_of_T4SS_Effector_Proteins_for_Anaplasma_phagocytophilum_Using_OPT4e_A_New_Software_Tool_FASTA/8306882?file=15564524 (2019).
Bi D, Liu L, Tai C, Deng Z, Rajakumar K, Ou H-Y. SecReT4: a web-based bacterial type IV secretion system resource. Nucleic Acids Res. 2013;41(D1):D660–5.
An Y, Wang J, Li C, Revote J, Zhang Y, Naderer T, et al. SecretEPDB: a comprehensive web-based resource for secreted effector proteins of the bacterial types III, IV and VI secretion systems. Sci Rep. 2017;7(1):41031.
Wang Y, Wei X, Bao H, Liu S-L. Prediction of bacterial type IV secreted effectors by C-terminal features. BMC Genomics. 2014;15:1–14.
Wang J, Li J, Hou Y, Dai W, Xie R, Marquez-Lago TT, et al. BastionHub: a universal platform for integrating and analyzing substrates secreted by Gram-negative bacteria. Nucleic Acids Res. 2021;49(D1):D651–9.
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150.
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9.
Zou Q, Lin G, Jiang XP, Liu XR, Zeng XX. Sequence clustering in bioinformatics: an empirical study. Brief Bioinform. 2020;21(1):1–10.
Wang J, Li J, Yang B, Xie R, Marquez-Lago TT, Leier A, et al. Bastion3: a two-layer ensemble predictor of type III secreted effectors. Bioinformatics. 2019;35(12):2017–28.
Xue L, Tang B, Chen W, Luo J. DeepT3: deep convolutional neural networks accurately identify Gram-negative bacterial type III secreted effectors using the N-terminal sequence. Bioinformatics. 2019;35(12):2051–7.
Li J, Yao Y, Xu HH, Hao L, Deng Z, Rajakumar K, et al. SecReT6: a web-based resource for type VI secretion systems found in bacteria. Environ Microbiol. 2015;17(7):2196–202.
Wang J, Yang B, Leier A, Marquez-Lago TT, Hayashida M, Rocker A, et al. Bastion6: a bioinformatics approach for accurate prediction of type VI secreted effectors. Bioinformatics. 2018;34(15):2546–55.
Zhu W, Yuan SS, Li J, Huang CB, Lin H, Liao B. A First Computational Frame for Recognizing Heparin-Binding Protein. Diagnostics (Basel). 2023;13(14):2465.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.
Chen J, Zou Q, Li J. DeepM6ASeq-EL: Prediction of Human N6-Methyladenosine (m6A) Sites with LSTM and Ensemble Learning. Front Comp Sci. 2022;16(2): 162302.
Lv H, Dao FY, Guan ZX, Yang H, Li YW, Lin H. Deep-Kcr: accurate detection of lysine crotonylation sites using deep learning method. Brief Bioinfor. 2021;22(4):bbaa255.
Hasan MAM, Nasser M, Ahmad S, Molla KI. Feature selection for intrusion detection using random forest. J Inf Secur. 2016;7(3):129–40.
Zou X, Ren L, Cai P, Zhang Y, Ding H, Deng K, et al. Accurately identifying hemagglutinin using sequence information and machine learning methods. Front Med (Lausanne). 2023;10:1281880.
Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv preprint arXiv:171105101. 2017.
Smith LN, editor Cyclical learning rates for training neural networks. 2017 IEEE winter conference on applications of computer vision (WACV); 2017: IEEE.
Zhu H, Hao H, Yu L. Identifying disease-related microbes based on multi-scale variational graph autoencoder embedding Wasserstein distance. BMC Biol. 2023;21(1):294.
Zulfiqar H, Guo Z, Ahmad RM, Ahmed Z, Cai P, Chen X, et al. Deep-STP: a deep learning-based approach to predict snake toxin proteins by using word embeddings. Front Med. 2024;10:1291352.
Acknowledgements
We would like to express our gratitude to all participants involved in this study. Additionally, the authors sincerely thank the three anonymous reviewers for their valuable feedback, which greatly contributed to enhancing the quality and presentation of this paper.
Funding
This work was supported by the National Science and Technology Major Project (2022ZD0117700) and National Natural Science Foundation of China (grants 62102063, 62303355).
Author information
Authors and Affiliations
Contributions
In this work, Fengming Ni and Quan Zou are the initiators of this project and the main authors of the paper. Jing Li made significant contributions to the design, execution, and model training of the project, and is also a main author of the paper. Shida He made significant contributions during the training process of the model. Jian Zhang and Feng Zhang participated in the deployment of the work and provided additional insights.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Our data and models can be accessed by https://github.com/lijingtju/T4Seeker.git.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, J., He, S., Zhang, J. et al. T4Seeker: a hybrid model for type IV secretion effectors identification. BMC Biol 22, 259 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12915-024-02064-z
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12915-024-02064-z