T4Seeker: a hybrid model for type IV secretion effectors identification

Li, Jing; He, Shida; Zhang, Jian; Zhang, Feng; Zou, Quan; Ni, Fengming

doi:10.1186/s12915-024-02064-z

Research
Open access
Published: 14 November 2024

T4Seeker: a hybrid model for type IV secretion effectors identification

Jing Li^1,2,3^na1,
Shida He^2,4,5^na1,
Jian Zhang²,
Feng Zhang^4,5,
Quan Zou² &
…
Fengming Ni⁶

BMC Biology volume 22, Article number: 259 (2024) Cite this article

837 Accesses
Metrics details

Abstract

Background

The type IV secretion system is widely present in various bacteria, such as Salmonella, Escherichia coli, and Helicobacter pylori. These bacteria use the type IV secretion system to secrete type IV secretion effectors, infect host cells, and disrupt or modulate the communication pathways. In this study, type III and type VI secretion effectors were used as negative samples to train a robust model.

Results

The area under the curve of T4Seeker on the validation and independent test sets were 0.947 and 0.970, respectively, demonstrating the strong predictive capacity and robustness of T4Seeker. After comparing with the classic and state-of-the-art T4SE identification models, we found that T4Seeker, which is based on traditional features and large language model features, had a higher predictive ability.

Conclusion

The T4Seeker proposed in this study demonstrates superior performance in the field of T4SEs prediction. By integrating features at multiple levels, it achieves higher predictive accuracy and strong generalization capability, providing an effective tool for future T4SE research.

Background

Type IV secretion effectors (T4SEs) are proteins released by pathogenic bacteria into host cells via the type IV secretion system (T4SS), which plays a critical role in the interactions between pathogens and hosts [1, 2]. Once inside the host cells, T4SEs may change the signaling pathways of the host cells. Once inside the host cells, T4SEs may change the signaling pathway of host cells, suppress immune responses, and facilitate the invasion and survival of bacteria within the host cells. T4SEs can affect various biological processes in host cells, including signaling pathways, gene expression, and organelle function, thereby utilizing host cell resources to promote bacterial survival, reproduction, and transmission. Some bacteria, such as Helicobacter pylori, invade host cells and cause diseases by releasing T4SEs proteins [3]. In addition, T4SEs can disrupt host cell signaling pathways and immune responses, leading to disease [4]. T4SEs facilitate bacteria in evading detection and attacking the host’s immune system; thus enabling better survival [5, 6]. T4SEs proteins can affect the normal functions of host cells by disrupting cell signaling, gene expression, and organelle function, thereby affecting the physiological state of host cells. In severe cases, T4SEs may cause inflammatory responses in host cells [7]. In summary, T4SEs may have adverse effects on host cells and organisms, contributing to the disease progression.

Accurate identification of T4SEs can help deepen understanding of the molecular mechanisms of bacterial hosts, thereby revealing the molecular basis of bacterial pathogenesis [8, 9]. Traditionally, T4SE data are generated and validated through laboratory experiments. Traditional laboratory methods for distinguishing whether the proteins are T4SEs are more accurate but are time and resources-intensive. Laboratory methods require expensive reagents and equipment, resulting in relatively high costs. Compared to traditional laboratory methods, machine learning-based T4SE identification tools can process and analyze large amounts of data in a relatively short time, enabling the rapid identification of T4SE proteins. Unlike laboratory methods, machine learning tools can be automated, thereby reducing the need for extensive human resources and materials [10,11,12,13,14,15,16]. Importantly, machine learning methods are scalable, and with the continuous improvement and optimization of data [17,18,19,20,21,22], models will be continually updated to achieve more accurate T4SE identification. More importantly, the combination of traditional laboratory methods and machine-learning approaches can expedite research on T4SEs. Researchers can use machine learning tools to predict potential T4SEs and validate these candidates in the laboratory. Computational predictions can significantly reduce the scope of laboratory validation and save time and resources.

To date, the published T4SEs identification tools include DeepT3_4 [23], Bastion4 [24], iTSE-EP [25], OPT4e [26], DeepSecET4 [27], T4SE-XGB [28], T4Sefinder [29], T4SEpp [30], and T4SE-ARF [31]. DeepT3_4 [23], which integrates recurrent and deep neural networks, accurately classifies type III and type IV secreted effectors by utilizing amino acid character dictionaries and sequence-based features extracted from the effector proteins. Bastion4 [24] trained a T4SEs predictor using six machine learning models and 10 selected features, with ensemble models enhancing the predictive performance. OPT4e [26] employs a statistical approach to select the best features for predicting the T4SS effector proteins. DeepSecE [27] uses a pre-trained protein language model and transformer with the potential to identify disease-associated proteins across bacterial genomes. T4SE-XGB [28] is a model that uses the XGBoost algorithm to accurately identify T4SEs based on protein sequence features, with feature interpretation performed using the SHAP method. T4SEfinder [29] uses a pre-trained language model of protein sequences to classify T4SEs. T4SEpp [30] employs full-length embedding features from six pre-trained protein language models to train classifiers for predicting T4SEs. T4SEpp integrates three modules: a homolog search for known T4SEs, machine learning fine-tuning with signal sequence data, and the utilization of top-performing pre-trained protein language models. The study of TSE-ARF [31] proposed two new feature descriptors, fused them with universal features to form a 290-dimensional feature vector, and employed the TSE-ARF model for classification predictions using the parameter adaptation of different secretion effectors. By integrating the data from these studies, we trained T4Seeker, which offered several advantages.

1.
The results demonstrated that T4Seeker exhibited robustness and generalization ability, with an area under the curve (AUC) of 0.947 for the cross-validation test set and 0.970 for the independent dataset.
2.
Analysis of feature extraction at different levels revealed that distance-based residue (DR) [32], evolutionary scale modeling (ESM) [33], and long short-term memory (LSTM) features not only exhibit good individual performance but also synergistically complementary effects. The fusion features of DR, ESM, and LSTM effectively identified the T4SEs.
3.
By employing type III secretion effectors (T3SEs) and type VI secretion effectors (T6SEs) as negative samples, a powerful T4SEs identification model was trained.

Results

Model development

To construct an effective predictive model for T4SEs, we extracted a comprehensive array of features spanning multiple levels of protein sequence representation. These features encompassed distinct feature groups, including those based on amino acid composition features [32, 34,35,36], and on composition and distribution [34, 35]. Evolutionary scale modeling features. To ensure that the selection of discriminative features was conducive to a robust model performance, a meticulous screening process was employed [33]. Considering the effectiveness of Multi-Layer Perceptron [37] (MLP) in handling high-dimensional feature spaces and its capability to discern complex patterns within data, we chose the MLP framework to identify T4SEs [38, 39]. Feature subsets from different levels were individually evaluated using the MLP, with an emphasis on optimizing the AUC. DR [32], ESM-average [33], ESM-flatten-1024, and LSTM (with an AUC exceeding 90%) were subsequently retained for further analysis (Fig. 1).

Feature fusion is a form of ensemble learning that involves integrating information from different feature sets to obtain a more comprehensive and accurate representation for model training [40]. By combining multiple features, the model can use the complementarity of each feature to enhance predictive performance and generalization capability. Combining different features provides a more comprehensive perspective of the data, enabling better capture of its characteristics and patterns. In addition, using multiple features can reduce the overreliance of models on a single feature, thereby reducing the risk of overfitting. Therefore, different features are fused to train better-performing models. We trained the MLP models by combining DR and LSTM with ESM-average and DR with ESM-flatten-1024. The fusion of DR, ESM-average, and LSTM resulted in an average AUC of 0.938 on a fivefold cross-validation (cv) test, whereas the fusion of DR, ESM-flatten-1024, and LSTM resulted in an average AUC of 0.947. Therefore, we choose LSTM + ESM-flatten-1024 + DR as the final model, named T4Seeker.

Model performance and validation

Performance of individual features

To evaluate the feature subsets within the amino acid composition feature group, metrics including specificity (SP), precision, recall, accuracy (ACC), F1-score, and AUC were computed. The average performance metrics across all features in the amino acid decomposition feature group yielded an average SP of 0.807, precision of 0.802, recall of 0.818, ACC of 0.813, F1-score of 0.811, MCC of 0.626, and AUC of 0.868 on fivefold cv test (Table 1). Notably, the DR-based MLP models exhibited enhanced performance metrics, surpassing the average scores of the amino acid composition feature groups by 0.068, 0.067, 0.07, 0.068, 0.069, 0.019, and 0.079 on a fivefold cv test (Table 1). In the composition and transition groups, the AUC for both CTDC and CTDD were below 0.90. Additionally, the ESM-averaged, ESM-flatten-1024, and LSTM features exhibited high average AUC of 0.922, 0.908, and 0.906 on a fivefold cv test, respectively (Table 1). In summary, features with an AUC exceeding 0.9 were observed in the single-feature models for DR, ESM-average, ESM-flatten-1024, and LSTM.

Table 1 Performance of single features of MLP on fivefold cross-validation test

Full size table

Combining feature performance

We integrated DR and LSTM separately with ESM-average and ESM-flatten-1024 to train the models. For the fivefold test, the LSTM + ESM-average + DR model achieved an average AUC of 0.924. In comparison, the LSTM + ESM-flatten-1024 + DR (T4Seeker) achieved an AUC of 0.941 for the fivefold cv test. The SP, precision, ACC, F1-score, MCC, and AUC of T4Seeker were 0.005, 0.007, 0.016, 0.011, and 0.014, 0.032, and 0.009 higher than those of the LSTM + ESM-average + DR model on fivefold cv test, respectively. To demonstrate the performance of T4Seeker further, Table 3 presents the results of the independent test set. In the independent test set, T4Seeker achieved an SP, precision, recall, ACC, F1-score, MCC, and AUC of 0.944, 0.945, 0.92, 0.932, 0.932, 0.864 and 0.970, respectively. The SP, precision, recall, ACC, F1-score, MCC, and AUC were 0.122, 0.103, 0.018, 0.069, 0.061, 0.137, and 0.029 higher than those of the LSTM + ESM-average + DR model, respectively. The best-performing single-feature model had average SP values of 0.858, precision value of 0.866, recall value of 0.858, ACC value of 0.858, F1-score value of 0.858, MCC value of 0.717, AUC value of 0.928 on independent test. These values were lower than those of the T4Seeker by 0.086, 0.079, 0.063, 0.074, 0.075, 0.147, and 0.042 respectively. In conclusion, the performance of T4Seeker on both the validation and independent test sets demonstrated its strong predictive ability, robustness, and generalization capability, providing a solid foundation for biological research.

The proposed T4Seeker outperforms other methods

To highlight the superiority of T4Seeker in classifying T4SEs and non-T4SEs, T4seeker was used to compare the performance of four published and mainstream T4SE identification models (Bastion4 [24], T4SEpp [30], DeepSecEbd [27], and T4Sefinder [29]) on independent test sets. Among the four models, DeepSecEbd performed the best (Table 2). The SP, precision, ACC, F1-score, MCC, and AUC of DeepSecEbd were higher than those of the remaining three models by 0.212, 0.153, 0.089, 0.065, 0.172, and 0.091 on an independent test. Furthermore, although DeepSecEbd had a higher SP and precision, but the recall, ACC, F1-score, MCC and AUC of T4Seeker were higher than DeepSecEbd by 0.072, 0.023, 0.028, 0.039, and 0.06, respectively.

Table 2 Performance of concatenated features and ablation studies on validation dataset

Full size table

In addition, Fig. 2A shows the t-test statistical comparison between T4Seeker and Bastion4, T4SEpp, DeepSecEbd, and T4finder. The results show that T4Seeker exhibits statistically significant performance improvements over Bastion4, T4SEpp, and T4finder, with p-values less than 0.05. This indicates the observed performance differences between T4Seeker and these models. Although the comparison between T4Seeker and DeepSecEbd yielded a p-value greater than 0.05 (p-value = 0.1492), which suggests no statistically significant difference in performance, it is important to note that the T-statistic value is positive (T-statistic = 1.65). This positive T-statistic indicates that, on average, T4Seeker’s performance metrics are higher than those of DeepSecEbd. Therefore, while T4Seeker demonstrates clear performance superiority over Bastion4, T4SEpp, and T4finder, it also shows a trend of better performance compared to DeepSecEbd. This suggests that T4Seeker is generally effective and often outperforms existing models. This means that the overall performance of the T4Seeker model surpasses that of Bastion4, T4SEpp, DeepSecEbd, and T4Sefinder. As is well known, the purpose of models is to predict samples. The advantages of T4Seeker are mainly reflected in its generalization ability. Generalization ability is the performance of a model on new, unseen data and is an important indicator for evaluating model performance. This superior generalization not only ensures that T4Seeker remains effective across a variety of testing conditions, but also enhances its reliability in real-world applications where unpredictability is common.

Ablation studies

In this section, we present a comprehensive study to validate the effectiveness of different components of T4Seeker.

LSTM feature: First, the LSTM feature was removed, and the model was denoted as ESM-flatten-1024 + DR. Compared to T4Seeker, the ESM-flatten-1024 + DR model exhibited a decrease in the AUC of 0.025 and 0.026 in the fivefold cv test and independent test sets, respectively (Tables 2 and 3).

Table 3 Performance on an independent test set of different models

Full size table

ESM-flatten-1024 feature: To assess the significance of the ESM-flatten-1024 feature in constructing the T4Seeker model, the ESM-flatten-1024 feature was removed (LSTM + DR). From Table 2, it is evident that the LSTM + DR resulted in decreases in the SP, precision, recall, ACC, F1, MCC, and AUC on the fivefold cv test, averaging 0.016, 0.01, 0.049, 0.034, 0.032, 0.031, and 0.026, respectively compared with T4Seeker. On the independent test set, the SP, precision, recall, ACC, F1-score, MCC, and AUC of LSTM + DR were lower than those of T4Seeker by 0.037, 0.039, 0.063, 0.051, 0.051, 0.1, and 0.026.

DR feature: The DR feature was eliminated, and the model was denoted as LSTM + ESM-flatten-1024. The performance of LSTM + ESM-flatten-1024 on the fivefold cv test is presented in Table 2. The SP, precision, recall, ACC, F1, MCC, and AUC of LSTM + ESM-flatten were lower than those of T4Seeker by 0.004, 0.004, 0.095, 0.056, 0.006, 0.099, and 0.018, respectively. Similarly, for the independent test set (Table 3), the LSTM + ESM-flatten parameters were lower than those of T4Seeker by 0.037, 0.038, 0.054, 0.046, 0.046, 0.091, and 0.009.

Evaluation of T4Seeker with additional baseline models

To further demonstrate the superiority of T4Seeker, we added baseline models, including one-vs-rest, k-nearest neighbors, multinomial naive Bayes, random forest, logistic regression, extra trees, and support vector machine. In the fivefold cross-validation test, T4Seeker showed higher mean values for SP, precision, recall, ACC, F1-score, MCC, and AUC compared to the baseline models, with differences of 0.057, 0.053, 0.037, 0.048, 0.048, 0.155, and 0.067 (please see Fig. 2B for more details). In the independent test set, these baseline models’ average SP, precision, recall, ACC, F1-score, MCC, and AUC were lower than T4Seeker’s by 0.104, 0.102, 0.104, 0.104, 0.103, 0.207, and 0.082 (Fig. 2C). This comparison highlights the superior performance of T4Seeker over a diverse set of baseline models, further substantiating the efficacy and robustness of T4Seeker.

In summary, the LSTM, ESM-flatten-1024, and DR features played positive roles in training the T4Seeker model. Their combination provides rich information, thereby assisting the model in making more accurate predictions and classifications.

Discussion

Our study represents a comprehensive effort to advance the classification of T4SEs through the integration of diverse feature sets. By leveraging multiple types of features, including amino acid composition, pseudo-amino acid composition, autocorrelation, and grouped amino acid composition, as well as by incorporating evolutionary information extracted using ESM and deep features using LSTM, we aimed to enhance the discriminatory power and generalization ability of T4SEs classification models. The fusion of the DR, ESM, and LSTM features effectively addresses the limitations of a single feature and significantly improves the predictive accuracy of the model. By integrating diverse feature sets and leveraging deep learning techniques, our study offers insight into exploration of T4SEs functions and the development of targeted interventions to combat infectious diseases. Furthermore, T4Seeker has surpassed existing classification models for T4SEs.

However, there are still some limitations. Despite integrating data from multiple sources, the dataset may not capture all variability present in natural T4SEs, potentially limiting the model’s generalizability. Future work will focus on expanding the dataset to include more diverse bacterial species and newly discovered T4SEs to enhance model robustness. By addressing these limitations and focusing on these areas, we aim to enhance the accuracy and applicability of T4Seeker. T4Seeker can be integrated into existing workflows as a tool for preliminary screening of potential IV secretion system effectors and guiding experimental validation. T4Seeker can aid in identifying potential virulence factors crucial for understanding pathogenic mechanisms and informing targeted therapeutic development. By focusing on T4SEs, T4Seeker can also assist in comparative studies, helping researchers explore the presence and variation of these effectors across different bacterial species. In addition, future studies should focus on expanding the scope of T4SEs classification to include additional bacterial species and virulence factors as well as exploring novel computational approaches for improved model interpretability and biological relevance.

Conclusions

T4Seeker improves model prediction accuracy and generalization by integrating multi-level features, including amino acid composition, ESM evolutionary information, and deep LSTM features. T4Seeker can serve as a tool for the preliminary screening of T4SEs, enhancing our understanding of viral mechanisms. The current dataset lacks diversity. Including more bacterial species and newly discovered T4SEs in the future will improve T4Seeker’s robustness and applicability.

Methods

Data description

We collected the published literature available and accessible for data on T4SEs, including DeepT3_4 [23], Bastion4 [24], iT4SE-EP [25], OPT4e [41], DeepSecE [27], T4SE-XGB [28], T4SEfinder [29], T4SEpp [30], and TSE-ARF [31]. Among them, T4SEfinder, T4SEpp, and DeepSecE are all derived from the SecReT4 database [42]. DeepT3_4 comes from the SecretEPDB database [43]. iT4SE-EP, T4SE-XGB, and T4SE-ARF not only use T4SEs from the SecReT4 database but also integrate T4SEs from ten types of bacteria retrieved from the literature, including Agrobacterium, Anaplasma, Bartonella, Bordetella, Brucella, Coxiella, Ehrlichia, Helicobacter, Legionella, and Ochrobactrum [44]. The T4SEs in OPT4e41 are from known effectors of four Gram-negative bacterial pathogens in the classes Alphaproteobacteria and Gammaproteobacteria. The T4SEs in T4SE-ARF come from the BastionHub database [45]. The T4SE sequences from these different sources were integrated, resulting in 5,473 samples. Considering the potential redundancy and sequence similarity among the samples, we performed CD-HIT [46,47,48] clustering to reduce redundancy and enhance dataset diversity (CD-HIT = 80%). Following CD-HIT clustering, the number of T4SE samples was reduced to 730 (Fig. 3).

To train a high-performance model, we also collected data on T3SEs and T6SEs from datasets including DeepT3_4 [23], Bastion3 [49], DeepT3-Keras [50], TSE-ARF [31], SecReT6 [51], and Bastion6 [52]. Among them, the T3SEs in Bastion3 and DeepT3-Keras are from NCBI Protein{Tatusova, 2016 #654} and UniProt{Consortium, 2019 #117}. The T6SEs in Bastion6 are from the SecretEPDB database [43]. The integration of T3SEs and T6SEs from different data bases resulted in 2301 T3SE samples and 670 T6SE samples. Subsequent CD-HIT clustering reduced the numbers of T3SEs and T6SEs samples to 730 and 309, respectively. We then combined the T3SEs and T6SEs datasets into a negative-sample dataset (Fig. 3). To ensure a balanced dataset, which is crucial for training robust models, we randomly selected negative samples from the negative-sample dataset consisting of T3SEs and T6SEs. This random selection process did not involve any additional filtering or criteria, ensuring that we obtained a completely random set of negative samples, which helps in maintaining the unbiased nature of the model training. Subsequently, the dataset was divided into training, validation, and test sets, with 70% allocated for training, 15% for validation, and 15% for testing.

Feature representation

Proteins are fundamental components of living organisms and perform a diverse range of functions that are critical to cellular processes [53]. The identification and characterization of T4SEs play a pivotal role in understanding host–pathogen interactions. In this study, we extracted features from four levels.

Long short-term memory features

Long short-term memory [54,55,56] networks are effectively used for feature extraction, especially in handling data that require sequence dependency. Through unique gating mechanisms, LSTM can process amino acid sequence data and remember important information while forgetting the irrelevant information. This allows the extraction of key features from the entire sequence, thereby capturing the temporal dynamics of the data suitable for a variety of downstream tasks. As shown in Fig. 1B, the base layer is an embedding layer that transforms each amino acid in the sequence (such as AA₁, AA₂, AA₃, …, AA_n) into an embedding vector. These embedding vectors are then fed into the upper LSTM units. In this study, a bidirectional LSTM was used. One direction of the LSTM (blue) processes the sequence from left to right, whereas the other (orange) processes it from right to left. This captures the dependencies in both directions of the sequence, thereby extracting richer feature information.

Evolutionary scale modeling features

Evolutionary scale modeling [33] features can not only elucidate the evolutionary dynamics of individual residues but also provide invaluable insights into the broader evolutionary context of proteins, empowering downstream tasks. Leveraging the ESM, each residue was represented by a comprehensive 320-dimensional feature vector. This methodological framework enables the extraction of nuanced evolutionary information. Initially, the ESM features were computed across the entire dataset to provide a comprehensive insight into the evolutionary patterns. We applied different pooling methods to the ESM features to obtain a better model, including average pooling over the sequence dimensions (ESM-average), max pooling over the sequence dimensions (ESM-max), retaining only fixed-length (33 × 320) residue information and flattening (ESM-flatten), and performing feature selection on ESM-flatten (ESM-flatten-1024).

ESM-average

The ESM generated a 320-dimensional vector for each residue. For an amino acid sequence of length N, the number of feature vector dimensions is N*320. Subsequently, average pooling is performed along the sequence length dimension, meaning that an average is obtained across the column dimension. Thus, the amino acid sequence produced a 1 × 320 feature vector (Fig. 1B).

ESM-max

Similar to ESM-average, ESM-max uses max pooling along the column dimension to determine the maximum values (Fig. 1B).

ESM-flatten

Considering the minimum sequence length of the 33 residues, only the features generated by the first 33 residues were retained (Fig. 1B).

ESM-flatten-1024

Feature selection was performed on ESM-flatten using random forests [57], reducing the feature dimensions to 1024 (Fig. 1B).

Traditional features based on amino acid composition

Amino acid composition quantifies the frequency of individual amino acid residues in a protein sequence, providing insights into its primary structure. To represent the composition of T4SEs, we utilized several amino acid composition-based features, including amino acid composition (AAC) [34]: quantifies the frequency of individual amino acid residues in the protein sequence. Dipeptide composition (DPC) [34]: Represents the distribution of dipeptides within a protein sequence. DR [32]: The distance-based residue feature captures the spatial relationships between amino acid residues within a protein structure. By calculating the distances between each pair of residues, this feature provides insight into the three-dimensional arrangement of the protein. This method involves generating a distance matrix, where each element represents the Euclidean distance between the alpha carbon atoms of two residues. Key statistical measures such as the mean, maximum, minimum, and standard deviation of these distances are then extracted to serve as features. K-mer (k = 2) [32, 58]: represents the frequency distribution of subsequences of length two within the protein sequence. Grouped amino acid composition (GAAC) [34]: frequency distribution of grouped amino acids based on their physicochemical properties in the protein sequence. Grouped dipeptide composition (GDPC) [34] represents the frequency distribution of grouped dipeptides in the protein sequence. Pseudo-position-specific amino acid composition general (PC-PseAAC-General) [32]: a generalized form of pseudo-amino acid composition with sequence-order effects. Split amino acid composition general (SC-PseAAC-General) [32]: generalized pseudo-amino acid composition was derived from the split amino acid composition (Fig. 1B).

Traditional features based on composition and distribution

CTDC [34] quantifies the frequency distribution of amino acids in a protein sequence. Distribution (CTDD) [34]: the CTDD captures the positional distribution of amino acids along the protein sequence. Transition (CTDT) [34]: CTDT quantifies amino acid transition patterns along the protein sequence (Fig. 1B).

Feature analysis

During the feature analysis phase, the performances of the different feature groups were assessed using the MLP. Within the amino acid composition group, the DR [32] feature exhibited the highest AUC, indicating its discriminatory power for distinguishing T4SEs (Table 1). In the composition and transition groups, CTDC [34] and CTDT [34] emerged as the top-performing features with the highest AUC; however, the AUC remained below 0.9 (Table 1). This means that DR exhibited the highest AUC among all the traditional features. In addition, ESM [33] and LSTM have shown promising potential for distinguishing between T4SEs. The AUC for the validation datasets for ESM-flatten-1024, ESM-average, and LSTM were 0.908, 0.922, and 0.906, respectively (Table 1 and Fig. 4).

Additionally, we introduced SHAP to analyze traditional features, as depicted in Fig. 4B–G. Due to space constraints, only the top six values of total MASV for traditional features are displayed. The results revealed that the total MASV of DR obtained the highest score of 1.36. This aligns with DR achieving the highest AUC among traditional features. In other words, DR demonstrates strong capability in distinguishing T4SEs.

Hyperparameter settings

In this study, we employed specific hyperparameters and optimization strategies to train the model. The learning rate was set to 1e − 05, and the AdamW optimizer [59], a variant of Adam, was used. The optimizer parameters were set to a learning rate of 0.00001 and a weight decay of 0.001. To dynamically adjust the learning rate, we utilized a Cyclic Learning Rate [60] (CLR) scheduler with parameters including a base learning rate of 0.00001, a maximum learning rate of 0.001, and a step size of up to 30. During training, we conducted 50 epochs, with a batch size of 64 for each epoch. The structure of the Multi-Layer Perceptron (MLP) was 2756–1280-2. We chose the Cross-Entropy Loss function as the criterion, as it is suitable for classification tasks and effectively measures the difference between the predicted and actual class distributions. These configurations helped improve the training efficiency and performance of the model. We also conducted experiments with different learning rates, optimizers, schedulers, and MLPs, as shown in Fig. 5. From this figure, we can see that under the influence of our parameters, our model achieves good performance. We observe that the learning rate of 0.00001, the AdamW optimizer, the MLP structure with 1280 hidden layers, and the Cyclic Learning Rate scheduler yielded the best performance, achieving high AUC. These consistent results indicate that our chosen hyperparameters and optimization strategies significantly enhance the performance of the T4Seeker.

Evaluation metrics

Specificity represents the ability of T4Seeker to correctly predict T4SEs. Precision is the proportion of true values among the samples predicted as T4SEs by T4Seeker. The recall indicates the proportion of true T4Ses predicted as T4SEs by T4Seeker among all actual T4SEs. The accuracy is the proportion of samples correctly predicted by T4Seeker among all samples. The F-score is the proportional mean of precision and recall. The AUC is the area under the receiver operating characteristic curve [61, 62]. An AUC value closer to 1 indicated better model performance.

$$\begin{array}{c}\mathrm{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}}\end{array}$$

(1)

$$\begin{array}{c}\mathrm{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}}\end{array}$$

(2)

$$\begin{array}{c}\mathrm{ACC}=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}\end{array}$$

(3)

$$\begin{array}{c}\mathrm F1-\mathrm{score}=\frac{2\ast\text{Precision}\ast\text{SN}}{\text{Precision}+\text{SN}}\end{array}$$

(4)

where TP refers to the number of samples correctly predicted as T4SEs by the T4Seeker. TN represents the number of samples correctly predicted as non-T4SEs by the T4Seeker. FP indicates the number of non-T4SEs incorrectly predicted as T4SEs, and FN is the number of T4SEs incorrectly predicted as non-T4SEs by T4Seeeker.

Data availability

No datasets were generated or analysed during the current study.

Abbreviations

AAC:: Amino acid composition
DPC:: Dipeptide composition
DR:: Distance-based residue
PC-PseAAC-General:: Pseudo position-specific amino acid composition general
SC-PseAAC-General:: Split amino acid composition general
CTDC:: Composition
CTDD:: Distribution
GAAC:: Grouped amino acid composition
GDPC:: Grouped dipeptide composition
LSTM:: Long short-term memory
ESM:: Evolutionary scale modeling features
ESM-average:: Average pooling over the sequence dimension
ESM-max:: Max pooling over the sequence dimension
ESM-flatten:: Retaining only fixed-length (33*320) residue information and flattening
ESM-flatten-1024:: Performing feature selection on ESM-flatten
MLP:: Multilayer perceptron
SP:: Specificity
ACC:: Accuracy
AUC:: Area under the curve
OVR:: One-vs-rest
KNB:: K-nearest neighbors
MNB:: Multinomial naive Bayes
RF:: Random forest
LR:: Logistic regression
ETAs:: Extra trees
SVM:: Support vector machine

References

Dehio C. Infection-associated type IV secretion systems of Bartonella and their diverse roles in host cell interaction. Cell Microbiol. 2008;10(8):1591–8.
Article CAS PubMed PubMed Central Google Scholar
Voth DE, Broederdorf LJ, Graham JG. Bacterial Type IV secretion systems: versatile virulence machines. Future Microbiol. 2012;7(2):241–57.
Article CAS PubMed Google Scholar
Dielen AS, Badaoui S, Candresse T, German-Retana S. The ubiquitin/26S proteasome system in plant–pathogen interactions: a never-ending hide-and-seek game. Mol Plant Pathol. 2010;11(2):293–308.
Article CAS PubMed Google Scholar
Rajendhran J. Genomic insights into Brucella. Infect Genet Evol. 2021;87: 104635.
Article CAS PubMed Google Scholar
Finlay BB, McFadden G. Anti-immunology: evasion of the host immune system by bacterial and viral pathogens. Cell. 2006;124(4):767–82.
Article CAS PubMed Google Scholar
Hornef MW, Wick MJ, Rhen M, Normark S. Bacterial strategies for overcoming host innate and adaptive immune responses. Nat Immunol. 2002;3(11):1033–40.
Article CAS PubMed Google Scholar
Sankarasubramanian J, Vishnu US, Dinakaran V, Sridhar J, Gunasekaran P, Rajendhran J. Computational prediction of secretion systems and secretomes of Brucella: identification of novel type IV effectors and their interaction with the host. Mol BioSyst. 2016;12(1):178–90.
Article CAS PubMed Google Scholar
Agany DD, Pietri JE, Gnimpieba EZ. Assessment of vector-host-pathogen relationships using data mining and machine learning. Comput Struct Biotechnol J. 2020;18:1704–21.
Article PubMed PubMed Central Google Scholar
Wei L, Zhou C, Chen H, Song J, Su R. ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics. 2018;34(23):4007–16.
Article CAS PubMed PubMed Central Google Scholar
Xing EP, Ho Q, Xie P, Wei D. Strategies and principles of distributed machine learning on big data. Engineering. 2016;2(2):179–95.
Article Google Scholar
Wang Y, Zhai, Y., Ding, Y., Zou, Q. SBSM-Pro: Support Bio-sequence Machine for Proteins. arXiv preprint. 2023: arXiv:2308.10275 .
Sinha D, Dasmandal T, Yeasin M, Mishra DC, Rai A, Archak S. EpiSemble: A Novel Ensemble-based Machine-learning Framework for Prediction of DNA N6-methyladenine Sites Using Hybrid Features Selection Approach for Crops. Curr Bioinform. 2023;18(7):587–97.
Article CAS Google Scholar
Li X, Ma S, Xu J, Tang J, He S, Guo F. TranSiam: Aggregating multi-modal visual features with locality for medical image segmentation. Expert Systems Appl. 2024;237:121574.
Article Google Scholar
Wei L, Xing P, Shi G, Ji Z, Zou Q. Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique. Ieee-Acm Transactions on Computational Biology and Bioinformatics. 2019;16(4):1264–73.
Article CAS PubMed Google Scholar
Li H, Pang Y, Liu B. BioSeq-BLM: a platform for analyzing DNA, RNA, and protein sequences based on biological language models. Nucleic Acids Res. 2021;49(22): e129.
Article CAS PubMed PubMed Central Google Scholar
Li H, Liu B. BioSeq-Diabolo: Biological sequence similarity analysis using Diabolo. PLoS Comput Biol. 2023;19(6): e1011214.
Article CAS PubMed PubMed Central Google Scholar
Kashyap H, Ahmed HA, Hoque N, Roy S, Bhattacharyya DK. Big data analytics in bioinformatics: A machine learning perspective. arXiv preprint arXiv:150605101. 2015.
Sparks ER, Talwalkar A, Haas D, Franklin MJ, Jordan MI, Kraska T, editors. Automating model search for large scale machine learning. Proceedings of the Sixth ACM Symposium on Cloud Computing; 2015.
Wang L, Ding Y, Tiwari P, Xu J, Lu W, Muhammad K, et al. A deep multiple kernel learning-based higher-order fuzzy inference system for identifying DNA N4-methylcytosine sites. Inf Sci. 2023;630:40–52.
Article Google Scholar
Guo X, Huang Z, Ju F, Zhao C, Yu L. Highly Accurate Estimation of Cell Type Abundance in Bulk Tissues Based on Single-Cell Reference and Domain Adaptive Matching. Advanced Science. 2024;11(7):2306329.
Article CAS PubMed Google Scholar
Jiang Y, Wang R, Feng J, Jin J, Liang S, Li Z, et al. Explainable deep hypergraph learning modeling the peptide secondary structure prediction. Advanced Science. 2023;10(11):2206151.
Article CAS PubMed PubMed Central Google Scholar
Liu B, Gao X, Zhang H. BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Res. 2019;47(20):e127.
Article CAS PubMed PubMed Central Google Scholar
Yu L, Liu F, Li Y, Luo J, Jing R. DeepT3_4: a hybrid deep neural network model for the distinction between bacterial type III and IV secreted effectors. figshare https://figshare.com/articles/dataset/Data_Sheet_1_DeepT3_4_A_Hybrid_Deep_Neural_Network_Model_for_the_Distinction_Between_Bacterial_Type_III_and_IV_Secreted_Effectors_docx/13619651?file=26139221 (2021).
Wang J, Yang B, An Y, Marquez-Lago T, Leier A, Wilksch J, et al. Systematic analysis and prediction of type IV secreted effector proteins by machine learning approaches. Brief Bioinform. 2019;20(3):931–51.
Article CAS PubMed Google Scholar
Han H, Ding C, Cheng X, Sang X, Liu T. iT4SE-EP: Accurate identification of bacterial type IV secreted effectors by exploring evolutionary features from two PSI-BLAST profiles. Molecules. 2021;26(9):2487.
Article CAS PubMed PubMed Central Google Scholar
Esna Ashari Z, Dasgupta N, Brayton KA, Broschat SL. An optimal set of features for predicting type IV secretion system effector proteins for a subset of species based on a multi-level feature selection approach. Figshare https://figshare.com/collections/An_optimal_set_of_features_for_predicting_type_IV_secretion_system_effector_proteins_for_a_subset_of_species_based_on_a_multi-level_feature_selection_approach/4094450 (2018).
Zhang Y, Guan J, Li C, et al. DeepSecE: a deep-learning-based Framework for multiclass Prediction of secreted Proteins in gram-negative bacteria. Figshare https://figshare.com/articles/software/DeepSecE/23489021?file=41197619 (2023).
Chen T, Wang X, Chu Y, et al. T4SE-XGB: interpretable sequence-based prediction of type IV secreted effectors using eXtreme gradient boosting algorithm. figshare https://figshare.com/collections/T4SE-XGB_Interpretable_Sequence-Based_Prediction_of_Type_IV_Secreted_Effectors_Using_eXtreme_Gradient_Boosting_Algorithm/5131205 (2020).
Zhang Y, Zhang Y, Xiong Y, Wang H, Deng Z, Song J, et al. T4SEfinder: a bioinformatics tool for genome-scale prediction of bacterial type IV secreted effectors using pre-trained protein language model. Briefings in Bioinformatics. 2022;23(1):bbab420.
Article PubMed Google Scholar
Hu Y, Wang Y, Hu X, Chao H, Li S, Ni Q, et al. T4SEpp: A pipeline integrating protein language models to predict bacterial type IV secreted effectors. Comput Struct Biotechnol J. 2024;23:801–12.
Article CAS PubMed PubMed Central Google Scholar
Tang X, Luo L, Wang S. TSE-ARF: An adaptive prediction method of effectors across secretion system types. Anal Biochem. 2024;686: 115407.
Article CAS PubMed Google Scholar
Liu B, Wu H, Chou K-C. Pse-in-One 2.0: an improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein sequences. Natural science. 2017;9(04):67.
Article CAS Google Scholar
Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci. 2021;118(15): e2016239118.
Article CAS PubMed PubMed Central Google Scholar
Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Wang Y, et al. iFeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics. 2018;34(14):2499–502.
Article CAS PubMed PubMed Central Google Scholar
Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A, Revote J, et al. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform. 2020;21(3):1047–57.
Article CAS PubMed Google Scholar
Song N, Dong R, Pu Y, Wang E, Xu J, Guo F. Pmf-cpi: assessing drug selectivity with a pretrained multi-functional model for compound-protein interactions. J Cheminf. 2023;15(1):97.
Article CAS Google Scholar
Popescu M-C, Balas VE, Perescu-Popescu L, Mastorakis N. Multilayer perceptron and neural networks. WSEAS Transactions on Circuits and Systems. 2009;8(7):579–88.
Google Scholar
Jakkula V. Tutorial on support vector machine (svm). School of EECS, Washington State University. 2006;37(2.5):3.
Google Scholar
Huang S, Cai N, Pacheco PP, Narrandes S, Wang Y, Xu W. Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genomics Proteomics. 2018;15(1):41–51.
CAS PubMed Google Scholar
Yang X, Niu Z, Liu Y, Song B, Lu W, Zeng L, et al. Modality-DTA: Multimodality fusion strategy for drug–target affinity prediction. IEEE/ACM Trans Comput Biol Bioinf. 2023;20(2):1200–10.
Article CAS Google Scholar
Esna Ashari Z, Brayton KA, Broschat SL. Prediction of T4SS effector proteins for Anaplasma phagocytophilum using OPT4e, a new software tool. figshare https://figshare.com/articles/dataset/Data_Sheet_1_Prediction_of_T4SS_Effector_Proteins_for_Anaplasma_phagocytophilum_Using_OPT4e_A_New_Software_Tool_FASTA/8306882?file=15564524 (2019).
Bi D, Liu L, Tai C, Deng Z, Rajakumar K, Ou H-Y. SecReT4: a web-based bacterial type IV secretion system resource. Nucleic Acids Res. 2013;41(D1):D660–5.
Article CAS PubMed Google Scholar
An Y, Wang J, Li C, Revote J, Zhang Y, Naderer T, et al. SecretEPDB: a comprehensive web-based resource for secreted effector proteins of the bacterial types III, IV and VI secretion systems. Sci Rep. 2017;7(1):41031.
Article CAS PubMed PubMed Central Google Scholar
Wang Y, Wei X, Bao H, Liu S-L. Prediction of bacterial type IV secreted effectors by C-terminal features. BMC Genomics. 2014;15:1–14.
Google Scholar
Wang J, Li J, Hou Y, Dai W, Xie R, Marquez-Lago TT, et al. BastionHub: a universal platform for integrating and analyzing substrates secreted by Gram-negative bacteria. Nucleic Acids Res. 2021;49(D1):D651–9.
Article CAS PubMed Google Scholar
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150.
Article CAS PubMed PubMed Central Google Scholar
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9.
Article CAS PubMed Google Scholar
Zou Q, Lin G, Jiang XP, Liu XR, Zeng XX. Sequence clustering in bioinformatics: an empirical study. Brief Bioinform. 2020;21(1):1–10.
CAS PubMed Google Scholar
Wang J, Li J, Yang B, Xie R, Marquez-Lago TT, Leier A, et al. Bastion3: a two-layer ensemble predictor of type III secreted effectors. Bioinformatics. 2019;35(12):2017–28.
Article CAS PubMed Google Scholar
Xue L, Tang B, Chen W, Luo J. DeepT3: deep convolutional neural networks accurately identify Gram-negative bacterial type III secreted effectors using the N-terminal sequence. Bioinformatics. 2019;35(12):2051–7.
Article CAS PubMed Google Scholar
Li J, Yao Y, Xu HH, Hao L, Deng Z, Rajakumar K, et al. SecReT6: a web-based resource for type VI secretion systems found in bacteria. Environ Microbiol. 2015;17(7):2196–202.
Article PubMed Google Scholar
Wang J, Yang B, Leier A, Marquez-Lago TT, Hayashida M, Rocker A, et al. Bastion6: a bioinformatics approach for accurate prediction of type VI secreted effectors. Bioinformatics. 2018;34(15):2546–55.
Article CAS PubMed PubMed Central Google Scholar
Zhu W, Yuan SS, Li J, Huang CB, Lin H, Liao B. A First Computational Frame for Recognizing Heparin-Binding Protein. Diagnostics (Basel). 2023;13(14):2465.
Article CAS PubMed PubMed Central Google Scholar
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.
Article CAS PubMed Google Scholar
Chen J, Zou Q, Li J. DeepM6ASeq-EL: Prediction of Human N6-Methyladenosine (m6A) Sites with LSTM and Ensemble Learning. Front Comp Sci. 2022;16(2): 162302.
Article Google Scholar
Lv H, Dao FY, Guan ZX, Yang H, Li YW, Lin H. Deep-Kcr: accurate detection of lysine crotonylation sites using deep learning method. Brief Bioinfor. 2021;22(4):bbaa255.
Article Google Scholar
Hasan MAM, Nasser M, Ahmad S, Molla KI. Feature selection for intrusion detection using random forest. J Inf Secur. 2016;7(3):129–40.
Google Scholar
Zou X, Ren L, Cai P, Zhang Y, Ding H, Deng K, et al. Accurately identifying hemagglutinin using sequence information and machine learning methods. Front Med (Lausanne). 2023;10:1281880.
Article PubMed Google Scholar
Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv preprint arXiv:171105101. 2017.
Smith LN, editor Cyclical learning rates for training neural networks. 2017 IEEE winter conference on applications of computer vision (WACV); 2017: IEEE.
Zhu H, Hao H, Yu L. Identifying disease-related microbes based on multi-scale variational graph autoencoder embedding Wasserstein distance. BMC Biol. 2023;21(1):294.
Article PubMed PubMed Central Google Scholar
Zulfiqar H, Guo Z, Ahmad RM, Ahmed Z, Cai P, Chen X, et al. Deep-STP: a deep learning-based approach to predict snake toxin proteins by using word embeddings. Front Med. 2024;10:1291352.
Article Google Scholar

Download references

Acknowledgements

We would like to express our gratitude to all participants involved in this study. Additionally, the authors sincerely thank the three anonymous reviewers for their valuable feedback, which greatly contributed to enhancing the quality and presentation of this paper.

Funding

This work was supported by the National Science and Technology Major Project (2022ZD0117700) and National Natural Science Foundation of China (grants 62102063, 62303355).

Author information

Jing Li and Shida He contributed equally to this work.

Authors and Affiliations

Department of Microbiology, University of Hong Kong, Hong Kong, China
Jing Li
Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, 1 Chengdian Road, Quzhou, Zhejiang, China
Jing Li, Shida He, Jian Zhang & Quan Zou
School of Biomedical Sciences, University of Hong Kong, Hong Kong, China
Jing Li
The Joint Innovation Center for Engineering in Medicine, Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People’s Hospital, Quzhou, 324000, China
Shida He & Feng Zhang
Department of Respiratory and Critical Care, Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou, 324000, China
Shida He & Feng Zhang
Department of Gastroenterology, The First Hospital of Jilin University, Changchun, 130021, China
Fengming Ni

Authors

Jing Li
View author publications
You can also search for this author inPubMed Google Scholar
Shida He
View author publications
You can also search for this author inPubMed Google Scholar
Jian Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Feng Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Quan Zou
View author publications
You can also search for this author inPubMed Google Scholar
Fengming Ni
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

In this work, Fengming Ni and Quan Zou are the initiators of this project and the main authors of the paper. Jing Li made significant contributions to the design, execution, and model training of the project, and is also a main author of the paper. Shida He made significant contributions during the training process of the model. Jian Zhang and Feng Zhang participated in the deployment of the work and provided additional insights.

Corresponding author

Correspondence to Fengming Ni.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Our data and models can be accessed by https://github.com/lijingtju/T4Seeker.git.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Li, J., He, S., Zhang, J. et al. T4Seeker: a hybrid model for type IV secretion effectors identification. BMC Biol 22, 259 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12915-024-02064-z

Download citation

Received: 26 June 2024
Accepted: 06 November 2024
Published: 14 November 2024
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12915-024-02064-z

T4Seeker: a hybrid model for type IV secretion effectors identification

Abstract

Background

Results

Conclusion

Background

Results

Model development

Model performance and validation

Performance of individual features

Combining feature performance

The proposed T4Seeker outperforms other methods

Ablation studies

Evaluation of T4Seeker with additional baseline models

Discussion

Conclusions

Methods

Data description

Feature representation

Long short-term memory features

Evolutionary scale modeling features

ESM-average

ESM-max

ESM-flatten

ESM-flatten-1024

Traditional features based on amino acid composition

Traditional features based on composition and distribution

Feature analysis

Hyperparameter settings

Evaluation metrics

Data availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Biology

Contact us