A hybrid machine learning model with attention mechanism and multidimensional multivariate feature coding for essential gene prediction

Yan, Wu; Yu, Fu; Tan, Li; Mengshan, Li; Xiaojun, Xie; Weihong, Zhou; Sheng, Sheng; Jun, Wang; Fu-an, Wu

doi:10.1186/s12915-025-02209-8

Research article
Open access
Published: 24 April 2025

A hybrid machine learning model with attention mechanism and multidimensional multivariate feature coding for essential gene prediction

Wu Yan^1,2,3,
Fu Yu⁴,
Li Tan¹,
Li Mengshan ORCID: orcid.org/0000-0003-4175-088X^1,4,
Xie Xiaojun¹,
Zhou Weihong^2,3,
Sheng Sheng^2,3,
Wang Jun^2,3 &
…
Wu Fu-an^2,3

BMC Biology volume 23, Article number: 108 (2025) Cite this article

451 Accesses
Metrics details

Abstract

Background

Essential genes are crucial for the development, inheritance, and survival of species. The exploration of these genes can unravel the complex mechanisms and fundamental life processes and identify potential therapeutic targets for various diseases. Therefore, the identification of essential genes is significant. Machine learning has become the mainstream approach for essential gene prediction. However, some key challenges in machine learning need to be addressed, such as the extraction of genetic features, the impact of imbalanced data, and the cross-species generalization ability.

Results

Here, we proposed a hybrid machine learning model based on graph convolutional neural networks (GCN) and bi-directional long short-term memory (Bi-LSTM) with attention mechanism and multidimensional multivariate feature coding for essential gene prediction, called EGP Hybrid-ML. In the model, GCN was used to extract feature encoding information from the visualized graphics of gene sequences and the attention mechanism was combined with Bi-LSTM to assess the importance of each feature in gene sequences and analyze the influences of different feature encoding methods and data imbalance. Additionally, the cross-species predictive performance of the model was evaluated through cross-validation. The results indicated that the sensitivity of the EGP Hybrid-ML model reached 0.9122.

Conclusions

This model demonstrated the superior predictive performance and strong generalization capabilities compared to other models. The EGP Hybrid-ML model proposed in this paper has broad application prospects in bioinformatics, chemical information, and pharmaceutical information. The codes, architectures, parameters, and datasets of the proposed model are available free of charge at GitHub (https://github.com/gnnumsli/EGP-Hybrid-ML).

Graphical Abstract

Background

Essential genes largely determine the survival and development of species, as they constitute the minimal gene set required for the survival of species. These genes also play a central role in supporting the core functions of species and maintaining the essential biological processes, growth, and development of species [1,2,3,4]. The study on essential genes can reveal the fundamental processes and key functions of lives and guide the search for potential drug targets.

Research methods of essential genes encompass experimental approaches and computational methods. Experimental methods include techniques like single-gene knockout, conditional gene knockout, transposon mutagenesis, and RNA interference, which are considered as the gold standard for identifying essential genes [5,6,7,8,9]. However, these methods have key disadvantages such as high time cost, significant resource consumption, and technical difficulties. Furthermore, experimental results might be affected by experimental noise, limited experimental conditions, and experimental design flaws. Therefore, computational methods, including approaches based on sequences, structures, network, and machine learning, have become the crucial auxiliary tool in the prediction of essential genes [10,11,12,13,14,15]. Firstly, sequence-based methods utilize gene or protein sequence information and potentially overlook other crucial biological contextual information. Secondly, structure-based methods are primarily constrained by the availability of structural data, rendering them ineffective for new genes or proteins whose structures have not yet been resolved. Thirdly, network-based methods are largely dependent on experimental data and noise and imbalance contained in such data negatively affect the prediction accuracy. Currently, some methods involving machine learning and deep learning have been widely applied in predicting essential genes [16]. Campos et al. [17, 18] developed multiple models for essential gene identification through combining machine learning with feature engineering techniques and achieved commendable results. Aromolaran et al. [19, 20] proposed a machine learning approach for essential gene identification based on sequences and functional features. Chen et al. [21] developed another model for essential gene identification based on Z curve method. Additionally, other researchers proposed various computational methods for essential gene identification and achieved satisfactory results. However, the predictive performance of machine learning and deep learning largely depends on several key factors, including high-quality training data, effective feature engineering, appropriate model architecture, and training strategies. Consequently, hybrid machine learning approaches integrated with multiple models were used to improve the generalization ability and address the challenges like overfitting [22,23,24,25]. The performance of hybrid machine learning models mainly relies on the quality and distribution of input data as well as the selection of hyperparameters.

Currently, hybrid prediction models in the essential gene prediction face numerous challenges, including the influences of gene feature encoding methods, the data imbalance between essential and non-essential genes, and the difficulties in cross-species validation [26,27,28,29,30,31,32,33]. To explore the effects of these factors on the prediction performance, in this study, a hybrid machine learning model, called EGP Hybrid-ML, was proposed based on graph convolutional neural networks (GCN) and bidirectional long short-term memory networks (Bi-LSTM) with an attention mechanism and multidimensional multivariate feature coding, for essential gene prediction [34,35,36,37,38]. EGP Hybrid-ML utilized graphs of gene sequences for feature encoding extraction to analyze the influences of different encoding methods on the prediction performance. Additionally, we also investigated the effects of attention mechanisms and Bi-LSTM on the model in terms of gene features and data imbalance [39,40,41]. Furthermore, the predictive performance of the model was evaluated through cross-species cross-validation experiments to verify its generalization capability.

The contributions of the study are drawn as follows. Firstly, in the study, through multidimensional multivariate feature coding, temporal data and sequence information of genes were integrated together. The encoding method facilitates the more comprehensive understanding of gene characteristics. Secondly, Bi-LSTM was incorporated with attention mechanisms to facilitate the balance of important gene features and timely responses. In this way, Bi-LSTM enhanced the feature extraction efficiency and improved the prediction performance of the model. Thirdly, EGP Hybrid-ML, as a hybrid machine learning model, has broad application prospects in bioinformatics, evolutionary biology, and genetics. The study offers the decision support in various fields including engineering, computer science, chemistry, and biology.

Results

Data collection

The experimental data were obtained from DEG (A public Database of Essential Genes) (http://tubic.tju.edu.cn/deg/) [42,43,44]. As a comprehensive essential gene database, DEG contained the majority of currently available essential gene information. The study involved 31 species from three major biological domains: Archaea, Bacteria, and Eukaryotes. To reduce data redundancy and eliminate homology bias, the experimental data in the study were processed with CD-HIT algorithm and the sequence identity threshold was set to be 20%. A total of 87,782 essential and non-essential gene data entries were collated to construct the dataset. Based on the essential and non-essential genes of each species, the dataset was partitioned into two subsets: a training set (70% data) and a testing set (30% data) (detailed information on the experimental data will be provided in the appendix).

Results

The EGP Hybrid-ML model was established on a hardware configuration comprising 16 GB of RAM and an Intel (R) Core ™ i7-12700F processor under Windows 10 64-bit operating system. In the training process, with the Adam optimizer, a learning rate was set to be 0.001 and the model was trained for 1000 epochs. Experimental results were reported as the mean of six repeated trials to ensure the statistical reliability. The details of the experimental parameters, codes, architectures and datasets of the proposed model are available free of charge at GitHub (https://github.com/gnnumsli/EGP-Hybrid-ML).

The EGP Hybrid-ML model was trained with a training set containing 31 different species. The predictive results of EGP Hybrid-ML model obtained with the training and testing sets are illustrated in Fig. 1.

Figure 1a shows a wide distribution of SN, SP and ACC in the training set. The wide distribution range indicates the diversity characteristics of the testing data, and also shows that the model can minimize errors during the training process. In the testing set, the distribution ranges of the aforementioned metrics were shorter, indicating the stability for new data. Particularly, the distribution of SN highlighted the robustness of the model in identifying positive cases, and the concentrated distribution of SP indicated the effective discrimination of negative cases (Fig. 1b). MCC and AUC indicate that the EGP Hybrid-ML performs well in both the training and testing sets, and can perform well on different datasets. This alignment between training and testing set proved the cross-dataset reliability and precision of the model and highlighted its robust adaptability to various datasets. Statistical data indicated that the differences between the average and median values of various evaluation metrics on both the training and testing datasets were negligible (Fig. 1b and d). The trained model exhibited the comprehensive performance on new testing data, thus proving its stable generalization capability. The radar charts of five evaluation metrics in the training and testing datasets are shown in Fig. 2.

The radar charts also revealed that the performance of EGP Hybrid-ML model in the training set was better than that in the testing set. The model performed better in capturing data features and adjusting parameters during training. From the perspective of the model’s mechanism, machine learning models adjusted their parameters based on the training results. Consequently, the prediction outcomes of the model obtained with the same data were changed after training. The training process aims to minimize the overall error and achieve optimal predictive results. Conversely, the testing set composed of new and unfamiliar data, serves as a measure of the model’s predictive capability. The average values of various evaluation metrics for EGP Hybrid-ML model in the testing set are shown in Table 1.

Table 1 Predictive results of the EGP Hybrid-ML model obtained with the testing datasets

Full size table

The performances of the model among multiple species showed that the EGP Hybrid-ML model achieved an ACC average score of 0.9, indicating its superior predictive capability. The ACC peaked at nearly 0.98, demonstrating the robust predictive capability of the model. Even though the MCC value of the model was as low as 0.7976, its ACC was close to 0.9, demonstrating its high predictive performance. The variations of the performance of the model proved its exceptional robustness in handling diverse biological datasets and its high generalization ability. The model was applicable to tackle diverse biological prediction tasks.

Discussion

Discussion of essential genes and non-essential genes

To evaluate the performance of EGP Hybrid-ML model in the study, the model was used to predict essential genes in 31 species and non-essential genes in 23 species. The performance in the testing set has been quantified in terms of trends and statistical distributions (Fig. 3).

In the testing set, the performance metrics of non-essential genes were generally higher than those of essential genes, indicating that the model had the more pronounced advantage in predicting non-essential genes (Fig. 3a and d). This phenomenon may be attributed to two factors: the nature of essential genes and the more data of non-essential genes for training. The average values of performance metrics for essential and non-essential genes indicated that the predictions for non-essential genes were more concentrated and exhibited less variability, displaying the high consistency in recognizing their features (Fig. 3b to f).

Overall, the EGP Hybrid-ML model showcased notable accuracy and consistency in identifying non-essential genes, while also demonstrating satisfactory performance in predicting essential genes. These characteristics highlighted the exceptional robustness of the model in gene prediction tasks. The differences reflect the model’s specific adaptability to different gene categories.

Discussion of different feature encodings

Among various feature encoding strategies, the multidimensional multivariate feature coding method was adopted in the study. To explore the influences of various gene feature encoding methods on the prediction performance of the model, a series of independent experiments were conducted with multiple encoding strategies, including six individual encoding methods and three combined encoding approaches and the experimental results verified the advantage of multidimensional multivariate feature encoding. The distributions and statistical analysis of evaluation metrics of the results obtained with different encoding methods are presented in Fig. 4.

With Code 9 encoding sequence, EGP Hybrid-ML model performed well and all performance metrics reached their highest levels (Fig. 4a and b). In comparison, the performance metrics of a model utilizing solely time series encoding is poor, whereas the combination of time series encoding and gene feature encoding demonstrated the better performance. The model’s exceptional performance with the Code 9 encoding sequence was primarily attributed to the information integration from both time series and gene sequences, which endowed the model with the comprehensive feature representation and the high accuracy of gene prediction. Similarly, the model’s average predictive performance based on the Code 9 encoding sequence significantly surpassed those obtained with other encoding methods, thus further confirming the substantial impact of encoding method on the predictive performance of the model (Fig. 4c to e).

Analysis of data imbalance

To verify the dependence of the network-based model on experimental data and further understand the impact of data distribution on its predictive performance, the effects of the proportions of the essential and non-essential genes of different species on the model were analyzed (Fig. 5). Notably, the datasets of eight species entirely consisted of essential genes, revealing the significant data imbalance.

To assess the effect of data imbalance on the performance of EGP Hybrid-ML model, the statistical analysis of the performance was performed under different proportions of essential and non-essential genes. Figures 6 and 7 show the predictive performance of the model for essential genes and non-essential genes.

As the proportion of essential genes increased, the performance metrics showed an increasing trend. When this proportion exceeded 20%, the performance metrics were generally above 0.85 highlighting the overall predictive performance of the model. When the proportion of essential genes was close to 50%, the performance metrics reached a high point. However, even though most metrics increased with the increase in the proportion of essential genes, the MCC value remained relatively low and occasionally declined. The results indicated the unique property of MCC in measuring the performance under the conditions of data imbalance. The model tended to predict essential genes in high-proportion scenarios, and overlooked the fewer non-essential genes. Although the other metrics showed the positive growth trends, the importance of evaluating MCC lies in ensuring the model has good predictive capabilities for both essential genes and non-essential genes.

The performance of the model varied with the proportion of non-essential genes in the dataset (Fig. 7a). Particularly, when the proportion of non-essential genes exceeded 70%, all evaluation metrics remained above 0.89. When the proportion of non-essential genes was less than 70%, the performance metrics generally remained to be 0.87, indicating that despite the challenge of data imbalance, the model maintained considerable robustness and high accuracy. In the SN distribution (Fig. 7b), as the proportion of non-essential genes increased, the range of SN values became more concentrated, indicating an overall upward trend in identification ability of true positives. When the proportion of non-essential genes was below 70%, ACC exhibited significant fluctuations (Fig. 7c). However, the overall AUC increased as the proportion of non-essential genes increased, indicating that the model performed better when the proportion of non-essential genes was high (Fig. 7d).

The results showed that the performance of the model was proportional to the proportions of both essential and non-essential genes. EGP Hybrid-ML model performed better when the proportion of essential genes was around 20% or the proportion of non-essential genes was around 40%. When the dataset achieved an approximate balance of 50% between the two types of genes, the performance was particularly notable, indicating that EGP Hybrid-ML model could overcome the challenge of data imbalance and maintain stable predictive accuracy across different gene distributions. The model showed the robustness in handling complex biological information and maintained the good performance even under the conditions of varying degrees of data imbalance.

Analysis of cross-validation under cross-species

To explore the generalizability of the model, cross-species validation experiments across multiple species were conducted in this study. We constructed a validation dataset by randomly selecting 10 species out of the 31 species. During cross-validation, one of the 10 species was chosen for training the model each time and the remaining nine species served as the testing set. The heatmaps of various evaluation metrics for EGP Hybrid-ML model across different test species are shown in Fig. 8.

Through the analysis of SN and ACC, the color on the main diagonal is very prominent. The distinct blocks along the main diagonal indicated that the performance was exceptional when the data from the same species were used for both training and testing, as confirmed by the metrics above 0.9. Furthermore, cross-validation experiments conducted between different species (Species 1 and Species 2; Species 3 and Species 6; Species 7 and Species 10) exhibited the robust performance, and the majority of metrics surpassed 0.8. Conversely, the performance was lower for other species. The distributions of evaluation metrics for EGP Hybrid-ML model in the cross-validation experiments are shown in Fig. 9.

EGP Hybrid-ML model showed the best performance when it was trained and validated with the gene data from the same species. The performance was moderate when the data were from different genes of homologous species. It was the lowest when the data were from different genes of non-homologous species. However, the distribution of performance indicated that EGP Hybrid-ML model also performed well in cross-species validation. In summary, these cross-species validation experiments demonstrated that EGP Hybrid-ML model had the good generalization capability.

Comparison with other benchmark models

To evaluate the performance of EGP Hybrid-ML model in predicting essential genes relative to other models, we selected nine latest supervised and unsupervised models for the benchmark comparison. The benchmark models used for comparison are provided in Table 2.

Table 2 Benchmark models for comparison

Full size table

To ensure the fair evaluation of comparison models, we extracted 100 data sets from 31 species to form a comparison dataset. Each comparison model was independently trained and tested with the dataset. We calculated the average testing results of each model. The performance metrics of comparison models are shown in Fig. 10.

All performance metrics of EGP Hybrid-ML model exceeded 0.9, indicating its superior performance. The radar chart also clearly showed the significant advantages of the model in terms of various evaluation metrics. The statistical and distribution details of the performance metrics for comparison models are shown in Fig. 11.

EGP Hybrid-ML model showed the superior predictive performance (Fig. 11a and b). In the validation across 31 species, EGP Hybrid-ML model performed well. The SN values were around 0.93 for 20 species and close to 0.96 for 5 species. The AUC values were around 0.9 for 12 species and 0.93 for 14 species (Fig. 11c and d). Additionally, SP, MCC, and ACC metrics also demonstrated the superior performance of EGP Hybrid-ML model compared to other models (Fig. 11e to g).

The performance metrics revealed that EGP Hybrid-ML model had the best performance. EGP Hybrid ML demonstrated significant advantages in terms of all performance metrics. During the experiments conducted with the data form 31 species, CPU computation time is displayed in Fig. 12.

Computation time of the ten models did not exhibit a significant trend, but computation time for each species was generally similar. The iEsGene-ZCPseKNC model had the shortest computation time, followed by NLP-SL, DeeplyEssential, and DeepHE. Computation time for DeepLOF, Bingo and PreEGSRF models were comparatively longer. Average computation time for the model proposed in this paper was 16.01 s, which was acceptable.

Ablation study

To confirm the working principle and mechanism of EGP Hybrid ML, we performed a series of ablation experiments on EGP Hybrid ML and its variants. During the experiment, we systematically removed the modules of Attention Mechanism, GCN, and Bi-LSTM to evaluate their individual contributions to the performance. To ensure the integrity of these comparison models, we replaced the removed modules with other similar alternative modules. Therefore, we ultimately obtained four variant models. The ablation experiment was conducted with the same testing data and the model parameters were set to be the same as possible. The statistical results were the average of six repeated experiments (Table 3).

Table 3 Results of ablation study

Full size table

After any module was removed, all performance indicators significantly decreased, indicating that each module contributed to EGP Hybrid-ML model. EGP Hybrid ML without GNN module had the worst experimental results, indicating that GNN module had the greatest contribution. The contributions of other modules ranked in the following order: GCN > Attention Mechanism > Bi-LSTM. In this study, we defined the contribution rate as the ratio of the contribution of a module to the total contribution. It was assumed that the total contribution of each module was 100%. Figure 13 shows the contribution rate of each module.

EGP Hybrid ML outperformed its variants (Fig. 13). The contributions of different modules ranked in the following order: GCN > Attention Mechanism > Bi-LSTM. The contribution rate of each module was above 18%. Especially, GCN had a contribution rate above 50%. The ablation experiment showed that the three modules were crucial and indispensable and jointly guaranteed the performance of EGP Hybrid ML.

The experimental results clearly demonstrated that our proposed EGP Hybrid-ML model exhibited significant advantages in terms of accuracy and other evaluation metrics. The advantages were primarily attributed to three key factors. Firstly, through the combination of time-series encoding and gene feature encoding, the model obtained rich feature information. This hybrid encoding strategy realized the reliable data support. Secondly, GCN demonstrated the significant capability in processing genetic data and allowed the in-depth exploration and extraction of various gene features. Thirdly, the strategy combining Bi-LSTM with attention mechanism effectively utilized the long- and short-term information in gene sequences and enhanced the identification ability of key features. The attention mechanism allowed the model to focus on the most important parts of sequences.

Conclusions

The study introduced a novel hybrid machine learning model that integrated time series encoding with gene feature encoding and combined long short-term memory networks (LSTM) with graph convolutional neural networks (GCN), and attention mechanisms. In the experiments involving 31 species for the prediction of essential and non-essential genes, the model demonstrated the outstanding performance in distinguishing any two gene categories. The research results indicated that the model not only provided a substantial computational tool for bioinformatics and valuable insights for multiple disciplines, including computer science, chemistry, and medical pharmacology. The experimental results were satisfactory, but many issues require further discussion. It is necessary to explore the utilization of biological big data, the interpretability of machine learning models, and the relationship between feature extraction and the crucial research questions. We will persist in researching advanced machine learning algorithms and applying them to the analysis of various biological data, striving to contribute to the progress of interdisciplinary fields in biology, even if these contributions may only small steps.

Methods

Attention mechanism

The attention mechanism involves a model dividing a subject into several parts and then allocating different levels of attention to each part based on their weights, as illustrated in Fig. 14. The model consists of two main components: encoder and decoder [51,52,53]. In the attention model, the encoder’s function is to transform incoming data into a sequence of vectors and a weight is assigned to each vector to indicate its significance. The decoder utilizes these weighted vectors generated by the encoder to produce the final output. This mechanism can enhance the overall performance of the model.

$X=({\mathcal{x}}_{1},{\mathcal{x}}_{2},{\mathcal{x}}_{3},{\mathcal{x}}_{4})$ denotes the input and $Y=({\mathcal{y}}_{1},{\mathcal{y}}_{2},{\mathcal{y}}_{3})$ denotes the output. X is segmented into words${\mathcal{x}}_{1},{\mathcal{x}}_{2},{\mathcal{x}}_{3},{\mathcal{x}}_{4}$, which are then input into the encoder. Subsequently, through non-linear transformation, the input vector sequence $\mathcal{C}=\mathcal{f}({\mathcal{x}}_{1},{\mathcal{x}}_{2},{\mathcal{x}}_{3},{\mathcal{x}}_{4})$ is generated; the vector sequence $\mathcal{C}=({\mathcal{c}}_{1},{\mathcal{c}}_{2},{\mathcal{c}}_{3})$ is input into the decoder; the outputs are ${\mathcal{y}}_{1}=\mathcal{f}({\mathcal{c}}_{1})$;$y_{2}=\mathcal{f}({\mathcal{c}}_{2},{\mathcal{y}}_{1})$, and${\mathcal{y}}_{3}=\mathcal{f}({\mathcal{c}}_{3},{\mathcal{y}}_{1},{\mathcal{y}}_{2})$; different weight coefficients are automatically assigned to different${\mathcal{c}}_{\mathcal{i}}$.

Machine learning

Bi-LSTM

Bi-LSTM consists of a forward layer and a backward layer, as illustrated in Fig. 15. It enhances learning capabilities by integrating and processing input features from both layers, which are then linked for the prediction. Therefore, the determination of the network output is more precise and the model can capture temporal dependencies and contextual information contained in sequence data [54,55,56].

The hidden layer includes three gates: forget gate, input gate, and output gate. Within the network at time $\mathcal{t}$, $\mathcal{f}(\mathcal{t})$, $\mathcal{i}(\mathcal{t})$, and $\mathcal{o}(\mathcal{t})$ denote the activation states of the forget gate, input gate, and output gate, respectively. $\widetilde{\mathcal{c}}(\mathcal{t})$ represents the outcome of feature analysis based on $\mathcal{h}\left(\mathcal{t}-1\right)$ and $\mathcal{x}(\mathcal{t})$, where $\mathcal{h}(\mathcal{t}-1)$ encompasses previous information and $\mathcal{x}(\mathcal{t})$ corresponds to the current data input. Through its gating mechanisms, LSTM can selectively update, discard, or acquire information, and optimize the retention of essential information in the cell state.

Graph convolutional neural networks

Graph convolutional neural networks (GCN) is a multi-layer interconnected network model and commonly used to learn low-dimensional representations of nodes from graph-structured data. Each layer of GCN aggregates information from neighboring nodes and then reconstructs embeddings through edge connections to serve as the input for the next layer. GCN effectively updates the features of a node by aggregating information from the node itself and its adjacent nodes. This aggregation approach allows GCNs to function directly on irregular graphs and realizes the efficient utilization of the graph’s structural information [57, 58]. The architecture of this process is shown in Fig. 16.

GCN involves two steps: information aggregation and information update. Initially, each node in GCN accumulates information from its adjacent nodes. Subsequently, the node’s features are updated based on the consideration of both the gathered information and the node’s inherent characteristics. This approach enables GCNs to effectively capture the complex structures and patterns within graphs for various predictive tasks.

Information aggregation

Each node aggregates information from its neighboring nodes. The aggregation step is typically achieved through functions like summation and averaging. The process is computed as:

$${\mathcal{a}}_{\mathcal{v}}=\text{AGGREGATE}(\left({\mathcal{h}}_{\mathcal{u}}:\mathcal{u}\in \text{Neighbors}\left(\mathcal{v}\right)\right)),$$

(1)

where ${\mathcal{h}}_{\mathcal{u}}$ denotes the current feature of node u and ${\mathcal{a}}_{\mathcal{v}}$ denotes the aggregated feature.

Information update

Each node updates its features based on both its current characteristics and the aggregated features of its neighboring nodes. The update process is expressed as:

$${\mathcal{h}}_{\mathcal{v}}{\prime}=\text{UPDATE}({\mathcal{h}}_{\mathcal{v}},{\mathcal{a}}_{\mathcal{v}}),$$

(2)

where for an undirected graph $\text{G}=(\mathcal{V},\mathcal{E},\mathcal{A})$ with $\mathcal{N}$ nodes, $\mathcal{V}$ is the set of nodes; $\mathcal{E}$ is the set of edges. In graph $\text{G}$, $\mathcal{A}\in {\mathcal{R}}^{\mathcal{N}\times \mathcal{N}}$ denotes the adjacency matrix and $\mathcal{D}\in {\mathcal{R}}^{\mathcal{N}\times \mathcal{N}}$ is the degree matrix with ${\mathcal{D}}_{\mathcal{i}\mathcal{i}}={\sum }_{\mathcal{j}}{\mathcal{A}}_{\mathcal{i}\mathcal{j}}$.

The Laplacian operator of a graph is defined as $\mathcal{L}=\mathcal{D}-\mathcal{A}$. By performing eigen decomposition on the Laplacian matrix $\mathcal{L}$, its eigenvalues and eigenvectors can be obtained and denoted as $\mathcal{L}=\mathcal{U}\Lambda {\mathcal{U}}^{\mathcal{T}}$, where $\mathcal{U}$ is the matrix of eigenvectors and $\Lambda$ is the diagonal matrix of eigenvalues. In the spectral (frequency) domain of a graph, the convolution operation is defined as the product of the filter function $\mathcal{g}$ and the signal $\mathcal{x}$:

$$\mathcal{g}\mathcal{*}\mathcal{x}=\mathcal{U}\mathcal{g}(\Lambda ){\mathcal{U}}^{\mathcal{T}}\mathcal{x},$$

(3)

where $\mathcal{g}$ is the convolution kernel; ${\mathcal{U}}^{\mathcal{T}}$ represents the transpose of $\mathcal{U}$; $\mathcal{g}(\Lambda )$ denotes the operation of the filter function on eigenvalues.

To enhance the efficiency of spectral convolution, Chebyshev polynomials are commonly utilized to approximate the filter function $\mathcal{g}$. These polynomials, a series of orthogonal polynomials within the interval of [− 1,1] are used to approximate various functions. The recurrence relation for the $k$ th order Chebyshev polynomial is established as follows: ${\mathcal{T}}_{\mathcal{i}}(\mathcal{X}){=2\mathcal{X}\mathcal{T}}_{\mathcal{K}-1}-{\mathcal{T}}_{\mathcal{i}-2}(\mathcal{X})$ under the initial conditions of ${\mathcal{T}}_{0}\left(\mathcal{X}\right)=1$ and ${\mathcal{T}}_{1}\left(\mathcal{X}\right)=\mathcal{X}$.

The filter $\mathcal{g}(\Lambda )$ is approximated by a linear combination of Chebyshev polynomials as follows:

$$\mathcal{g}\left(\Lambda \right)\approx {\sum }_{\mathcal{i}=0}^{\mathcal{K}}{\theta }_{\mathcal{i}}{\mathcal{T}}_{\mathcal{i}}(\widetilde{\Lambda }),$$

(4)

where ${\uptheta }_{\mathcal{i}}$ denotes the parameter to be learned; $\widetilde{\Lambda }$ is the normalized eigenvalue matrix; $\mathcal{K}$ denotes the order. The graph convolution operation is defined as:

$$\mathcal{y}=\mathcal{g}\mathcal{*}\mathcal{x}\approx {\sum }_{\mathcal{i}=0}^{\mathcal{K}}{\theta }_{\mathcal{i}}{\mathcal{T}}_{\mathcal{i}}(\widetilde{\mathcal{L}})\mathcal{x},$$

(5)

where $\widetilde{\mathcal{L}}$ denotes the normalized Laplacian matrix.

Feature coding

A gene sequence fragment, Seq, comprising $\mathcal{N}$ nucleotides, is denoted as$\text{Seq}={\text{S}}_{1},{\text{S}}_{2},{\text{S}}_{3},\cdots {\text{S}}_{\text{n}} , {\text{where S}}_{\text{i}}\in (\text{A},\text{C},\text{T},\text{G})$. In this study, three time series graphical encoding methods [59,60,61,62] and three gene sequence graphical encoding methods [63,64,65] were combined together to develop a multidimensional multivariate feature coding approach.

Coding of spectral time sequences

Spectral time sequences, denoted as $[{\text{x}}^{(1)},\cdots {\text{x}}^{(\text{N})}]$, are utilized to characterize gene sequences. The corresponding spectral time sequences are derived as:

$${x}^{(i)}=\left\{\begin{array}{c} 1, { S}_{i}=A \\ 2, {S}_{i}=G\\ 3, {S}_{i}=C\\ 4, {S}_{i}=T\end{array}\right., i=\text{1,2},\dots ,n$$

(6)

CGR time sequence

In this method, a square’s four vertices represent the four nucleotide types in a DNA sequence and respectively assigned as follows: A at (1, 1), T at (− 1, − 1), G at (− 1, 1), and C at (1, − 1). The process of plotting the DNA sequence is outlined in several steps:

Step (1): Set the initial state of the vertices as A(1, 1), T(− 1, − 1), G(− 1, 1), and C(1, − 1);

Step (2): Begin at the center point (0, 0), which serves as the starting position;

Step (3): For the first nucleotide, plot a point at the midpoint between its corresponding vertex and the center (0, 0);

Step (4): For the second nucleotide, plot a point at the midpoint between its corresponding vertex and the point plotted for the previous nucleotide.

Step (5): Proceed with this process by alternating steps 4 and 5 for each subsequent character in the DNA sequence until the sequence is fully represented. This procedure is illustrated as:

$$\genfrac{}{}{0pt}{}{{x}^{\left(\mathcal{i}\right)}={CGR}_{\mathcal{i}}={CGR}_{\mathcal{i}-1}-\frac{{CGR}_{\mathcal{i}-1}-{g}_{\mathcal{i}}}{2} }{{g}_{\mathcal{i}}=\left\{\begin{array}{c}\left( 1, 1\right),{S}_{\mathcal{i}}=A\\ \left(-1, 1\right),{S}_{\mathcal{i}}=G\\ \left( 1,-1\right),{S}_{\mathcal{i}}=C\\ \left(-1,-1\right),{S}_{\mathcal{i}}=T\end{array}\right.}$$

(7)

Z time sequence

In a specific gene sequence fragment, the cumulative counts of A, C, G and T from the starting of the sequence to the $\mathcal{i}$ th nucleotide are defined and respectively denoted as ${A}_{i},{C}_{i},{G}_{i},{T}_{i}$.Z time sequence as:

$$\genfrac{}{}{0pt}{}{{x}^{\left(\mathcal{i}\right)}=\sqrt{{X}_{\mathcal{i}}+{Y}_{\mathcal{i}}+{Z}_{\mathcal{i}}}}{\left\{\begin{array}{c}{X}_{\mathcal{i}}=\left({A}_{\mathcal{i}}+{G}_{\mathcal{i}}\right)-\left({C}_{\mathcal{i}}+{T}_{\mathcal{i}}\right)\\ {Y}_{\mathcal{i}}=\left({A}_{\mathcal{i}}+{C}_{\mathcal{i}}\right)-\left({G}_{\mathcal{i}}+{T}_{\mathcal{i}}\right)\\ {Z}_{\mathcal{i}}=\left({A}_{\mathcal{i}}+{T}_{\mathcal{i}}\right)-\left({C}_{\mathcal{i}}+{G}_{\mathcal{i}}\right)\end{array}\right.}$$

(8)

DN Curve 2D

The principle of double-nucleotide two-dimensional curve (DN Curve 2D) involves mapping 16 different nucleotide pairs onto a two-dimensional coordinate system. Starting from the origin (0,0), a curve is formed by connecting points corresponding to adjacent nucleotide pairs. In this way, the 2D double nucleotide curve is constructed. The mathematical model of the sequence is defined as:

$$\varphi ({\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1})=\left\{\begin{array}{c}(i,1) if {\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1}=AG\\ (i,2) if {\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1}=GA\\ (i,3) if {\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1}=CT\\ (i,4) if {\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1}=TC\\ (i,5) if {\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1}=AC\\ (i,6) if {\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1}=CA\\ (i,7) if {\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1}=GT\\ (i,8) if {\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1}=TG\\ (i,9) if {\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1}=AT\\ (i,10) if {\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1}=TA\\ (i,11) if {\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1}=CG\\ (i,12) if {\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1}=GC\\ (i,13) if {\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1}=AA\\ (i,14) if {\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1}=CC\\ (i,15) if {\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1}=GG\\ (i,16) if {\text{s}}_{\mathcal{i}}{\text{s}}_{\mathcal{i}+1}=TT\end{array}\right.,$$

(9)

where $\mathcal{i}=\text{1,2},3\cdots \mathcal{N}-1$ indicates the position of the nucleotides in the DNA sequence.

DN Curve 3D

The double-nucleotide three-dimensional curve (DN-Curve 3D) is a method that maps nucleotide pairs into a three-dimensional space. The nucleotide pairs encompass 16 different combinations, namely, AG, GA, CT, TC, AC, CA, GT, TG, AT, TA, CG, GC, AA, CC, GG, and TT. Based on specific classification criteria, the 16 nucleotide pairs were divided into four categories expressed in the form of a 4 × 4 matrix:

$$\left(\begin{array}{cccc}AC& CA& GT& TG\\ AG& GA& CT& TC\\ AT& TA& CG& GC\\ AA& CC& GG& TT\end{array}\right)$$

(10)

Each row in the matrix represents a set of nodes and four sets are included in total. The coordinates of nucleotide pairs in each set are denoted as${\mathcal{X}}^{\mathcal{t}},{\mathcal{Y}}^{\mathcal{t}},{\mathcal{Z}}^{\text{t}}$ for $\mathcal{t}=\text{1,2},\text{3,4}$. Initially, the coordinates of each node are calculated and then each set of nodes are plotted onto the corresponding graph to form four distinct graph sets. The coordinates of the double nucleotide are presented as:

$${\left\{\begin{array}{c}{{\varnothing }}_{1}({\mathcal{g}}_{i}{\mathcal{g}}_{\mathcal{i}+1})=\left\{\begin{array}{c}{\mathcal{x}}_{\mathcal{i}}={AC}_{\mathcal{i}}+{CA}_{\mathcal{i}}-({GT}_{\mathcal{i}}+{TG}_{\mathcal{i}})\\ {\mathcal{y}}_{\mathcal{i}}={AC}_{\mathcal{i}}+{GT}_{\mathcal{i}}-({CA}_{\mathcal{i}}+{TG}_{\mathcal{i}})\\ {\mathcal{z}}_{\mathcal{i}}={AC}_{\mathcal{i}}+{TG}_{\mathcal{i}}-({GT}_{\mathcal{i}}+{CA}_{\mathcal{i}})\end{array}\right.\\ {{\varnothing }}_{2}({\mathcal{g}}_{i}{\mathcal{g}}_{\mathcal{i}+1})=\left\{\begin{array}{c}{\mathcal{x}}_{\mathcal{i}}={AG}_{\mathcal{i}}+{GA}_{\mathcal{i}}-({CT}_{\mathcal{i}}+{TC}_{\mathcal{i}})\\ {\mathcal{y}}_{\mathcal{i}}={AG}_{\mathcal{i}}+{CT}_{\mathcal{i}}-({CA}_{\mathcal{i}}+{TC}_{\mathcal{i}})\\ {\mathcal{z}}_{\mathcal{i}}={AG}_{\mathcal{i}}+{TC}_{\mathcal{i}}-({CT}_{\mathcal{i}}+{CA}_{\mathcal{i}})\end{array}\right.\\ {{\varnothing }}_{3}({\mathcal{g}}_{i}{\mathcal{g}}_{\mathcal{i}+1})=\left\{\begin{array}{c}{\mathcal{x}}_{\mathcal{i}}={AT}_{\mathcal{i}}+{TA}_{\mathcal{i}}-({GC}_{\mathcal{i}}+{CG}_{\mathcal{i}})\\ {\mathcal{y}}_{\mathcal{i}}={AT}_{\mathcal{i}}+{CG}_{\mathcal{i}}-({TA}_{\mathcal{i}}+{GC}_{\mathcal{i}})\\ {\mathcal{z}}_{\mathcal{i}}={AT}_{\mathcal{i}}+{GC}_{\mathcal{i}}-({CG}_{\mathcal{i}}+{TA}_{\mathcal{i}})\end{array}\right.\\ {{\varnothing }}_{4}({\mathcal{g}}_{i}{\mathcal{g}}_{\mathcal{i}+1})=\left\{\begin{array}{c}{\mathcal{x}}_{\mathcal{i}}={AA}_{\mathcal{i}}+{GG}_{\mathcal{i}}-({CC}_{\mathcal{i}}+{TT}_{\mathcal{i}})\\ {\mathcal{y}}_{\mathcal{i}}={AA}_{\mathcal{i}}+{CC}_{\mathcal{i}}-\left({GG}_{\mathcal{i}}+{TT}_{\mathcal{i}}\right) \\ {\mathcal{z}}_{\mathcal{i}}={AA}_{\mathcal{i}}+{TT}_{\mathcal{i}}-({GG}_{\mathcal{i}}+{CC}_{\mathcal{i}})\end{array}\right.\end{array}\right.},$$

(11)

where$\mathcal{i}=\text{0,1},\text{2,3},\cdots ,\mathcal{N}-1$;$\mathcal{N}$ denotes the length of the DNA sequence. The initial values are set for the double nucleotide as follows:${AG}_{0}+{GA}_{0}+{CT}_{0}+{TC}_{0}+{AC}_{0}+{CA_{0}}+{GT_{0}}+{TG_{0}}+{AT_{0}}+{TA_{0}}+{CG_{0}}+{GC_{0}}++{AA_{0}}+{CC_{0}}+{TT_{0}} = 0$

C-Curve

C-Curve is a visualization method that maps codons into a three-dimensional space. According to this method, the sequence ${\text{S}={\text{s}}_{1,}\text{s}}_{2},{\text{s}}_{3},\cdots {\text{s}}_{\text{n}}$ is converted into a set of points, denoted as $\varphi(\text{S})=\varphi({\text{C}}_{1})\varphi({\text{C}}_{2})\varphi({\text{C}}_{3})\varphi({\text{C}}_{4})\cdots \varphi({\text{C}}_{\mathcal{N}})$. The coordinates of each point are defined as $\varphi ({\text{C}}_{\mathcal{i}})=({\mathcal{x}}_{\mathcal{i}},{\mathcal{y}}_{\mathcal{i}},\mathcal{i}),$, where $\mathcal{i}$ denotes the positional order of the codon in the sequence. The values of (${\mathcal{x}}_{\mathcal{i}},{\mathcal{y}}_{\mathcal{i}}$) are presented in Table 4.

Table 4 Coordinate values (${\mathcal{x}}_{\mathcal{i}},{\mathcal{y}}_{\mathcal{i}}$) of codons

Full size table

Multidimensional multivariate feature coding

Through three time series and three gene sequence graphical encoding methods, six single and three mixed encoding methods were obtained (nine methods for DNA sequence representation). The relevant statistical information is presented in Table 5.

Table 5 Graphical encoding representation of gene sequences

Full size table

Time series encoding methods effectively capture the temporal relational information in DNA sequences, whereas gene feature encoding more comprehensively illustrates the inherent physicochemical and biological characteristics in DNA sequences. Code 4 represents the mixed encoding method for time series graphics. Code 8 indicates the mixed graphical encoding of gene sequences. Code 9 denotes the graphical mixed encoding method for both time series and gene sequences.

Modeling

Model construction

The EGP Hybrid-ML model proposed in this paper encompasses five components: data acquisition, feature encoding, feature extraction, feature fusion, and prediction output. The model framework is illustrated in Fig. 17.

The data acquisition phase involves cleaning and organizing of essential gene database to form a comprehensive dataset. In order to validate the performance of the established model, the experimental database contains various categories of gene data, including essential and non-essential genes. The task of the feature encoding phase is to encode gene sequences according to gene features and time series. Each encoding method yields three different encodings, which are then fused and crossed through one encoding, two encodings, three encodings, and multi-encodings to obtain a total of nine gene encodings. In the feature extraction phase, features are extracted from different gene encodings. The encoding methods involve different dimensions such as two-dimensional and three-dimensional. Feature information is extracted from each dimension with GCN to obtain a feature information matrix. In the feature fusion phase, the extracted feature information is fused. To fully consider gene information from different dimensions and positions, an attention mechanism is used for feature fusion. The essence of feature fusion is to retain important feature information and reduce the dimensionality of feature information as possible. Feature fusion can enhance the performance of the model. Finally, in the modeling and output phase, the extracted and fused gene features are used as the inputs to establish a predictive model. To maintain the correlation of gene features, Bi-LSTM is used for modeling. The feature information is input into Bi-LSTM model for training and testing and the predictive results are obtained as the output [66,67,68].

The input for EGP Hybrid-ML consists of $\mathcal{N}$ gene sequences $[{x}^{(1)},\cdots {x}^{(\mathcal{N})}]$, which undergo encoding and fusion processes to yield an encoded matrix. This matrix is then subjected to convolution operations with GCN to yield a feature matrix:

$${\mathcal{C}}^{(k)}={\mathcal{W}}^{(k)}\times {\mathcal{X}}_{t-T:t-1},$$

(12)

where ${\mathcal{C}}^{(k)}$ is the convolution result, and ${\mathcal{W}}^{(k)}$ indicates the kth convolution kernel.

The feature matrix obtained from the GCN is input into Bi-LSTM model:

$$\genfrac{}{}{0pt}{}{\text{f}\left(\text{t}\right)=\upsigma ({\mathcal{W}}_{\text{f}}{\mathcal{k}}_{\mathcal{t}-1}+{\mathcal{U}}_{\mathcal{f}}{\mathcal{x}}_{\mathcal{t}}+{\mathcal{b}}_{\mathcal{f}})}{\begin{array}{c}\begin{array}{c}i\left(\mathcal{t}\right)=\sigma ({\mathcal{W}}_{\mathcal{i}}{\mathcal{k}}_{\mathcal{t}-1}+{\mathcal{U}}_{\mathcal{i}}{\mathcal{x}}_{\mathcal{t}}+{\mathcal{b}}_{\mathcal{i}})\\ i\left(\mathcal{t}\right)=\sigma ({\mathcal{W}}_{\mathcal{i}}{\mathcal{k}}_{\mathcal{t}-1}+{\mathcal{U}}_{\mathcal{i}}{\mathcal{x}}_{\mathcal{t}}+{\mathcal{b}}_{\mathcal{i}})\end{array}\\ o\left(\mathcal{t}\right)=\upsigma ({\mathcal{W}}_{\mathcal{o}}{\mathcal{k}}_{\mathcal{t}-1}+{\mathcal{U}}_{\mathcal{o}}{\mathcal{x}}_{\mathcal{t}}+{\mathcal{b}}_{\mathcal{o}})\\ \text{tanh}\left(\mathcal{x}\right)=\frac{1-{\mathcal{e}}^{-2\mathcal{x}}}{1+{\mathcal{e}}^{-2\mathcal{x}}}\\\upsigma \left(\mathcal{x}\right)=\frac{1}{1+{\mathcal{e}}^{-\mathcal{x}}}\end{array}},$$

(13)

where ${\mathcal{x}}_{\mathcal{t}}$ is the input at time $\mathcal{t}$; ${\mathcal{k}}_{\mathcal{t}-1}$ is the hidden layer state at time $\mathcal{t}-1$; ${\mathcal{W}}_{\mathcal{f}}$, ${\mathcal{W}}_{\mathcal{i}}$, ${\mathcal{W}}_{\mathcal{o}},\text{ and }{\mathcal{W}}_{\mathcal{a}}$ are respectively the weight coefficients corresponding to ${\mathcal{k}}_{\mathcal{t}-1}$ for the forget gate, input gate, output gate, and feature extraction process; ${\mathcal{U}}_{\mathcal{f}}$, ${\mathcal{U}}_{\mathcal{i}}$, ${\mathcal{U}}_{\mathcal{o}}$, and ${\mathcal{U}}_{\mathcal{a}}$ respectively represent the weight coefficients for ${\mathcal{x}}_{\mathcal{t}}$ associated with the forget gate, input gate, output gate, and feature extraction process; ${\mathcal{b}}_{\mathcal{f}}$, ${\mathcal{b}}_{\mathcal{i}}$, ${\mathcal{b}}_{\mathcal{o}}$, and ${\mathcal{b}}_{\mathcal{a}}$ are respectively the bias values linked to the forget gate, input gate, output gate, and feature extraction process; $\text{tanh}$ is the Hyperbolic tangent function, $\upsigma$ is the Sigmoid.

The calculation results of forget gate and input gate are applied to $\mathcal{c}(\mathcal{t}-1)$ to obtain the state $\mathcal{c}(\mathcal{t})$ at time $\mathcal{t}$, which is expressed as:

$$\mathcal{c}\left(\mathcal{t}\right)=\mathcal{c}\left(\mathcal{t}-1\right)\odot\mathcal{f}\left(\mathcal{t}\right)+\mathcal{i}(\mathcal{t})\odot\mathcal{a}(\mathcal{t}),$$

(14)

where $\odot$ denotes the Hadamard product. The final hidden layer state ${\mathcal{k}}_{\mathcal{t}}$ is computed with the input gate $\mathcal{o}\left(\mathcal{t}\right)$ and the current cell state $\mathcal{c}(\mathcal{t})$:

$$\mathcal{k}\left(\mathcal{t}\right)=\mathcal{o}\left(\mathcal{t}\right)\odot\text{ tanh}(\mathcal{c}\left(\mathcal{t}\right))$$

(15)

Evaluation metrics

Five commonly used evaluation parameters are used to evaluate the performance of the model: sensitivity (SN), specificity (SP), accuracy (ACC), Matthews correlation coefficient (MCC), and area under the curve (AUC). The 5 parameters are expressed as:

$$\left\{\begin{array}{c}SN=\frac{\text{TP}}{\text{TP}+\text{FN}} \\ SP=\frac{\text{TN}}{\text{TN}+\text{FP}} \\ ACC=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{FN}+\text{TN}+\text{FP}} \\ MCC=\frac{\text{TP}\times \text{TN}-\text{FP}\times \text{FN}}{\sqrt{\left(\text{TP}+\text{FP}\right)\times \left(\text{TP}+\text{FN}\right)\times \left(\text{TN}+\text{FP}\right)\times (\text{TN}+\text{FN})}} \\ AUC=\frac{\sum_{i\in pos}{\text{rank}}_{i}-\frac{{num}_{pos}({num}_{pos}+1)}{2}}{{num}_{pos}{num}_{neg}}\end{array}\right.,$$

(16)

where TP, TN, FP, and FN represent the number of samples whose prediction results are true positive, true negative, false positive, and false negative, respectively.

Data availability

The supplementary materials, such as codes, architecture, parameters, datasets, functions, usage and output of the proposed model are available free of charge at GitHub (https://github.com/gnnumsli/EGP-Hybrid-ML). Additionally, all data generated or analyzed during this study are included in this published article, its supplementary information files, and permanently archived in Zenodo, https://doiorg.publicaciones.saludcastillayleon.es/10.5281/zenodo.14970323.

Abbreviations

Bi-LSTM:: Bi-directional long short-term memory
GCN:: Graph convolutional neural networks
SN:: Sensitivity
SP:: Specificity
MCC:: Matthews correlation coefficient
ACC:: Accuracy
AUC:: Area under the ROC curve
DEG:: A public Database of Essential Genes
ML:: Machine learning
CNN:: Convolutional neural networks
NLP:: Natural language processing
SVM:: Support vector machine

References

Wu PK, Sun RL, Fahira A, Chen YZ, Jiangzhou HT, Wang K, Yang QZ, Dai Y, Pan D, Shi YY, et al. DROEG: a method for cancer drug response prediction based on omics and essential genes integration. Brief Bioinform. 2023;24(2):bbad003.
Article PubMed Google Scholar
Ward RD, Tran JS, Banta AB, Bacon EE, Rose WE, Peters JM: Essential gene knockdowns reveal genetic vulnerabilities and antibiotic sensitivities in Acinetobacter baumannii. mBio. 2023, 15(2):e02051–23.
Huseby DL, Brandis G, Alzrigat LP, Hughes D. Antibiotic resistance by high-level intrinsic suppression of a frameshift mutation in an essential gene. Proc Natl Acad Sci. 2020;117(6):3185–91.
Article CAS PubMed PubMed Central Google Scholar
Green RA, Kao HL, Audhya A, Arur S, Mayers JR, Fridolfsson HN, Schulman M, Schloissnig S, Niessen S, Laband K, et al. A High-Resolution C. elegans essential gene network based on phenotypic profiling of a complex tissue. Cell. 2011;145(3):470–82.
Article CAS PubMed PubMed Central Google Scholar
Rivas-Marin E, Moyano-Palazuelo D, Henriques V, Merino E, Devos DP: Essential gene complement of Planctopirus limnophila from the bacterial phylum Planctomycetes Nat Commun 2023, 14(1):7224.
Lehman TA, Rosas MA, Brew-Appiah RAT, Solanki S, York ZB, Dannay R, Wu Y, Roalson EH, Zheng P, Main D, et al. BUZZ: an essential gene for postinitiation root hair growth and a mediator of root architecture in Brachypodium distachyon. New Phytol. 2023;239(5):1723–39.
Article CAS PubMed Google Scholar
Boudehen YM, Tasrini Y, Aguilera-Correa JJ. Alcaraz M. Kremer L: Silencing essential gene expression in Mycobacterium abscessus during infection. Microbiol Spectr; 2023. p. e02836-e2923.
Google Scholar
De Giorgi M, Hurley A, Doerfler AM, Furgurson MN, Chuecos MA, Hyde S, Chickering T, Lefebvre S, Qin J, Bissig KD, et al. In vivo expansion of gene-targeted hepatocytes through inhibition of an essential gene. Mol Ther. 2022;30(4):568–9.
Google Scholar
Targa A, Larrimore KE, Wong CK, Chong YL, Fung R, Lee J, Choi H, Rancatil G. Non-genetic and genetic rewiring underlie adaptation to hypomorphic alleles of an essential gene. Embo J. 2021;40(21): e107839.
Article CAS PubMed PubMed Central Google Scholar
Hu J, Tang YX, Zhou Y, Li Z, Rao B, Zhang GJ. Improving DNA 6mA site prediction via integrating bidirectional long short-term memory, convolutional neural network, and self-attention mechanism. J Chem Inf Model. 2023;63(17):5689–700.
Article CAS PubMed Google Scholar
Hardo G, Noka M, Bakshi S. Synthetic Micrographs of Bacteria (SyMBac) allows accurate segmentation of bacterial cells using deep neural networks. BMC Biol. 2022;20(1):263.
Article PubMed PubMed Central Google Scholar
de Castro GM, Hastenreiter Z, Monteiro TAS, da Silva TTM, Lobo FP. Cross-species prediction of essential genes in insects. Bioinformatics. 2022;38(6):1504–13.
Article Google Scholar
Wein T, Wang YQ, Barz M, Stücker FT, Hammerschmidt K, Dagan T. Essential gene acquisition destabilizes plasmid inheritance. PLoS Genet. 2021;17(7): e1009656.
Article CAS PubMed PubMed Central Google Scholar
Zhang X, Xiao WX, Xiao WJ. DeepHE: Accurately predicting human essential genes based on deep learning. PLoS Comput Biol. 2020;16(9): e1008229.
Article CAS PubMed PubMed Central Google Scholar
Candek K, Candek UP, Kuntner M. Machine learning approaches identify male body size as the most accurate predictor of species richness. BMC Biol. 2020;18(1):105.
Article PubMed PubMed Central Google Scholar
Singh AK, Carette X, Potluri LP, Sharp JD, Xu RF, Prisic S, Husson RN. Investigating essential gene function in Mycobacterium tuberculosis using an efficient CRISPR interference system. Nucleic Acids Res. 2016;44(18): e143.
Article PubMed PubMed Central Google Scholar
Campos TL, Korhonen PK, Hofmann A, Gasser RB, Young ND. Harnessing model organism genomics to underpin the machine learning-based prediction of essential genes in eukaryotes-Biotechnological implications. Biotechnol Adv. 2022;54: 107822.
Article CAS PubMed Google Scholar
Campos TL, Korhonen PK, Sternberg PW, Gasser RB, Young ND. Predicting gene essentiality in Caenorhabditis elegans by feature engineering and machine-learning. Comp Struct Biotechnol J. 2020;18:1093–102.
Article CAS Google Scholar
Aromolaran O, Beder T, Oswald M, Oyelade J, Adebiyi E, Koenig R. Essential gene prediction in Drosophila melanogaster using machine learning approaches based on sequence and functional features. Comp Struct Biotechnol J. 2020;18:612–21.
Article CAS Google Scholar
Aromolaran OT, Isewon I, Adedeji E, Oswald M, Adebiyi E, Koenig R, Oyelade J. Heuristic-enabled active machine learning: A case study of predicting essential developmental stage and immune response genes in Drosophila melanogaster. PLoS ONE. 2023;18(8): e0288023.
Article CAS PubMed PubMed Central Google Scholar
Chen JH, Liu YM, Liao Q, Liu B. iEsGene-ZCPseKNC: Identify Essential Genes Based on Z Curve Pseudo k-Tuple Nucleotide Composition. IEEE Access. 2019;7(1): 165241.
Article Google Scholar
Allen AG, Zuris JA. Selection by essential-gene exon knock-in for the generation of efficient cell therapies. Nat Biotechnol. 2023;42(3):388–9.
Google Scholar
Mala U, Baral TK, Somasundaram K. Integrative analysis of cell adhesion molecules in glioblastoma identified prostaglandin F2 receptor inhibitor (PTGFRN) as an essential gene. BMC Cancer. 2022;22(1):642.
Article PubMed PubMed Central Google Scholar
Xie J, Zhao C, Sun JM, Li JX, Yang FZ, Wang J, Nie Q. Prediction of Essential Genes in Comparison States Using Machine Learning. IEEE-ACM Trans Comput Biol Bioinform. 2021;18(5):1784–92.
Article CAS PubMed Google Scholar
Floro J, Dai AQ, Metzger A, Mora-Martin A, Ganem NJ, Cifuentes D, Wu CS, Dalal J, Lyons SM, Labadorf A, et al. SDE2 is an essential gene required for ribosome biogenesis and the regulation of alternative splicing. Nucleic Acids Res. 2021;49(16):9424–43.
Article CAS PubMed PubMed Central Google Scholar
Campos TL, Korhonen PK, Young ND. Cross-predicting essential genes between two model eukaryotic species using machine learning. Int J Mol Sci. 2021;22(10): 5056.
Article CAS PubMed PubMed Central Google Scholar
Lapolice TM, Huang YF. An unsupervised deep learning framework for predicting human essential genes from population and functional genomic data. BMC Bioinf. 2023;24(1):347.
Article Google Scholar
Nandi S, Ganguli P, Sarkar RR. Essential gene prediction using limited gene essentiality information-An integrative semi-supervised machine learning strategy. PLoS ONE. 2020;15(11): e0242943.
Article CAS PubMed PubMed Central Google Scholar
Xu L, Guo ZR, Liu X. Prediction of essential genes in prokaryote based on artificial neural network. Genes Genom. 2020;42(1):97–106.
Article CAS Google Scholar
Amaral-Silva L, Santin JM. Synaptic modifications transform neural networks to function without oxygen. BMC Biol. 2023;21(1):54.
Article CAS PubMed PubMed Central Google Scholar
MacLeod N, Horwitz LK. Machine-learning strategies for testing patterns of morphological variation in small samples: sexual dimorphism in gray wolf Canis lupus crania. BMC Biol. 2020;18(1):113.
Article PubMed PubMed Central Google Scholar
Otoupal PB, Eller KA, Erickson KE, Campos J, Aunins TR, Chatterjee A. Potentiating antibiotic efficacy via perturbation of non-essential gene expression. Commun Biol. 2021;4(1):1267.
Article CAS PubMed PubMed Central Google Scholar
Zhang YQ, Qiao SJ, Ji SJ, Li YZ. DeepSite: bidirectional LSTM and CNN models for predicting DNA-protein binding. Int J Mach Learn Cybern. 2020;11(4):841–51.
Article Google Scholar
Lu WJ, Wang Y, Zhang MQ, Gu JW. Physics guided neural network: Remaining useful life prediction of rolling bearings using long short-term memory network through dynamic weighting of degradation process. Eng Appl Artif Intel. 2024;127: 107350.
Article Google Scholar
Zhao BW, Xing HL, Wang XH, Song FH, Xiao ZW. Rethinking attention mechanism in time series classification. Inform Sciences. 2023;627:97–114.
Article Google Scholar
Weber RZ, Mulders G, Kaiser J, Tackenberg C, Rust R. Deep learning-based behavioral profiling of rodent stroke recovery. BMC Biol. 2022;20(1):232.
Article PubMed PubMed Central Google Scholar
Hou C, Li YX, Wang MY, Wu H, Li TT. Systematic prediction of degrons and E3 ubiquitin ligase binding via deep learning. BMC Biol. 2022;20(1):162.
Article CAS PubMed PubMed Central Google Scholar
Villemin JP, Lorenzi C, Cabrillac MS, Oldfield A, Ritchie W, Luco RF. A cell-to-patient machine learning transfer approach uncovers novel basal-like breast cancer prognostic markers amongst alternative splice variants. BMC Biol. 2021;19(1):70.
Article CAS PubMed PubMed Central Google Scholar
Zhang HF, Zhang F, Wang H, Ma C, Zhu PC. A novel privacy-preserving graph convolutional network via secure matrix multiplication. Inform Sciences. 2024;657: 119897.
Article Google Scholar
Saberi-Bosari S, Flores KB, San-Miguel A. Deep learning-enabled analysis reveals distinct neuronal phenotypes induced by aging and cold-shock. BMC Biol. 2020;18(1):130.
Article PubMed PubMed Central Google Scholar
Jarvela AMC, Trelstad CS, Pick L. Regulatory gene function handoff allows essential gene loss in mosquitoes. Commun Biol. 2020;3(1):540.
Article Google Scholar
Ye YN, Hua ZG, Huang J, Rao N, Guo FB. CEG: a database of essential gene clusters. BMC Genomics. 2013;14: 769.
Article CAS PubMed PubMed Central Google Scholar
Luo H, Lin Y, Gao F, Zhang CT, Zhang R. DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements. Nucleic Acids Res. 2014;42(D1):D574–80.
Article CAS PubMed Google Scholar
Luo H, Lin Y, Liu T, Lai FL, Zhang CT, Gao F, Zhang R. DEG 15, an update of the Database of Essential Genes that includes built-in analysis tools. Nucleic Acids Res. 2021;49(D1):D677–86.
Article CAS PubMed Google Scholar
Hasan MA, Lonardi S. DeeplyEssential: a deep neural network for predicting essential genes in microbes. BMC Bioinf. 2020;21:367.
Article Google Scholar
Le NQK, Do DT, Hung TNK, Lam LHT, Huynh TT, Nguyen NTK. A Computational framework based on ensemble deep neural networks for essential genes identification. Int J Mol Sci. 2020;21(23): 9070.
Article CAS PubMed PubMed Central Google Scholar
Chen J, Liu Y, Liao Q, Liu B. iEsGene-ZCPseKNC: Identify Essential Genes Based on Z Curve Pseudo $k$ -Tuple Nucleotide Composition. IEEE Access. 2019;7:165241–7.
Article Google Scholar
Shi H, Wu CJ, Bai T, Chen JH, Li Y, Wu H. Identify essential genes based on clustering based synthetic minority oversampling technique. Comput Biol Med. 2023;153: 106523.
Article CAS PubMed Google Scholar
Hu WX, Li MS, Xiao HY, Guan LX. Essential genes identification model based on sequence feature map and graph convolutional neural network. BMC Genomics. 2024;25(1):47.
Article PubMed PubMed Central Google Scholar
Ma JN, Song JN, Young ND, Chang BCH, Korhonen PK, Campos TL, Liu H, Gasser RB. ’Bingo’-a large language model- and graph neural network-based workflow for the prediction of essential genes from protein data. Brief Bioinform. 2024;25(1)::bbad472.
Article Google Scholar
Xiao J, Yuan GH, He JH, Fang K, Wang ZR. Graph attention mechanism based reinforcement learning for multi-agent flocking control in communication-restricted environment. Inform Sciences. 2023;620:142–57.
Article Google Scholar
Wu YJ, Zhou JT. A neighborhood-aware graph self-attention mechanism-based pre-training model for Knowledge Graph Reasoning. Inform Sciences. 2023;647: 119473.
Article Google Scholar
Ranjan A, Fahad MS, Deepak A. 1-Scaled-attention: A novel fast attention mechanism for efficient modeling of protein sequences. Inform Sciences. 2022;609:1098–112.
Article Google Scholar
Wu ZQ, Chen SY, Feng F, Qi JR, Feng LC, Tao N, Zhang CL. Automatic defect detection and three-dimensional reconstruction from pulsed thermography images based on a bidirectional long-short term memory network. Eng Appl Artif Intel. 2023;124: 106574.
Article Google Scholar
Shahid M, Ilyas M, Hussain W, Khan YD. ORI-Deep: improving the accuracy for predicting origin of replication sites by using a blend of features and long short-term memory network. Brief Bioinform. 2022;23(2):bbac001.
Article PubMed Google Scholar
Zhang YQ, Yan JR, Chen SY, Gong MQ, Gao DR, Zhu M, Gan W. Review of the Applications of Deep Learning in Bioinformatics. Curr Bioinform. 2020;15(8):898–911.
Article CAS Google Scholar
Kim S, Yun S, Lee J, Chang G, Roh W, Sohn DN, Lee JT, Park H. Self-supervised Multimodal Graph Convolutional Network for collaborative filtering. Inform Sciences. 2024;653: 119760.
Article Google Scholar
Yin X, Zhang WY, Zhang S. Spatiotemporal dynamic graph convolutional network for traffic speed forecasting. Inform Sciences. 2023;641: 119056.
Article Google Scholar
Lichtblau D, Stoean C. Chaos game representation for authorship attribution. Artif Intell. 2023;317: 103858.
Article Google Scholar
Chan EYS, Corless RM. Chaos Game Representation\ast. Siam Rev. 2023;65(1):261–90.
Article Google Scholar
Xu Y, Zhu FK. A new GJR-GARCH model for DOUBLE-STRUCK CAPITAL Z-valued time series. J Time Ser Anal. 2022;43(3):490–500.
Article Google Scholar
Lochel HF, Eger D, Sperlea T, Heider D. Deep learning on chaos game representation for proteins. Bioinformatics. 2020;36(1):272–9.
Article PubMed Google Scholar
Al Bazzal A, Hatami P, Abedini R, Etesami I, Ayanian Z, Ghandi N. A prospective comparative study of two regimens of diphenylcyclopropenone (DPCP) in the treatment of alopecia areata. Int Immunopharmacol. 2021;101: 108186.
Article CAS PubMed Google Scholar
Qin C, Chen XQ, Luo XY, Zhang XP, Sun XM. Perceptual image hashing via dual-cross pattern encoding and salient structure detection. Inform Sciences. 2018;423:284–302.
Article Google Scholar
Huang GH, Li JC. Feature Extractions for Computationally Predicting Protein Post-Translational Modifications. Curr Bioinform. 2018;13(4):387–95.
Article CAS Google Scholar
Zhang ZC, Zhang YH, Wang Y, Ma MY, Xu J. Complex exponential graph convolutional networks. Inform Sci. 2023;640: 119041.
Article Google Scholar
Zhang YM, Song Y, Wei GL. A feature-enhanced long short-term memory network combined with residual-driven v support vector regression for financial market prediction. Eng Appl Artif Intel. 2023;118: 105663.
Article Google Scholar
Wang Y, Zhang YM, Wang GG. Forecasting ENSO using convolutional LSTM network with improved attention mechanism and models recombined by genetic algorithm in CMIP5/6. Inform Sciences. 2023;642: 119106.
Article Google Scholar

Download references

Acknowledgements

We thank LetPub (www.letpub.com) for its linguistic assistance during the preparation of this manuscript.

Institutional review board statement

Not applicable.

Informed consent statement

Not applicable.

Funding

We gratefully acknowledge the support from the National Natural Science Foundation of China (Grant Numbers: 51663001, 52063002, and 42061067), the Development Projects (Modern Agriculture) of Jiangsu Province (BE2021344), the Key Research and Development Plan of Zhenjiang Science and Technology Innovation Fund (NY2020005), and the Degree and Postgraduate Education and Teaching Reform Research Projects of Jiangxi Province (Multi Thinking and “439” Scientific Research Ability Training Mode for Interdisciplinary Academic Degree Postgraduate Training).

National Natural Science Foundation of China,51663001,Li Mengshan,52063002,Li Mengshan,42061067,Li Mengshan,61741202,Li Mengshan

Author information

Authors and Affiliations

Gannan Normal University, Ganzhou, Jiangxi, 341000, China
Wu Yan, Li Tan, Li Mengshan & Xie Xiaojun
Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, 212018, China
Wu Yan, Zhou Weihong, Sheng Sheng, Wang Jun & Wu Fu-an
Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, 212018, China
Wu Yan, Zhou Weihong, Sheng Sheng, Wang Jun & Wu Fu-an
Ganzhou Power Supply Branch of State Grid, Jiangxi Electric Power Co., Ltd, Ganzhou, Jiangxi, 341000, China
Fu Yu & Li Mengshan

Authors

Wu Yan
View author publications
You can also search for this author inPubMed Google Scholar
Fu Yu
View author publications
You can also search for this author inPubMed Google Scholar
Li Tan
View author publications
You can also search for this author inPubMed Google Scholar
Li Mengshan
View author publications
You can also search for this author inPubMed Google Scholar
Xie Xiaojun
View author publications
You can also search for this author inPubMed Google Scholar
Zhou Weihong
View author publications
You can also search for this author inPubMed Google Scholar
Sheng Sheng
View author publications
You can also search for this author inPubMed Google Scholar
Wang Jun
View author publications
You can also search for this author inPubMed Google Scholar
Wu Fu-an
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Conceptualization: WY, LT, and SS; methodology: WY, FY, XX, and LM; investigation: ZW, WJ, and WF; resources: WJ and WF; writing—original draft preparation: WY and LM; project administration: ZW, WJ, and WF; funding acquisition: FY, WJ, and WF. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Wu Yan, Li Mengshan or Wu Fu-an.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Yan, W., Yu, F., Tan, L. et al. A hybrid machine learning model with attention mechanism and multidimensional multivariate feature coding for essential gene prediction. BMC Biol 23, 108 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12915-025-02209-8

Download citation

Received: 03 January 2024
Accepted: 07 April 2025
Published: 24 April 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12915-025-02209-8

A hybrid machine learning model with attention mechanism and multidimensional multivariate feature coding for essential gene prediction

Abstract

Background

Results

Conclusions

Graphical Abstract

Background

Results

Data collection

Results

Discussion

Discussion of essential genes and non-essential genes

Discussion of different feature encodings

Analysis of data imbalance

Analysis of cross-validation under cross-species

Comparison with other benchmark models

Ablation study

Conclusions

Methods

Attention mechanism

Machine learning

Bi-LSTM

Graph convolutional neural networks

Information aggregation

Information update

Feature coding

Coding of spectral time sequences

CGR time sequence

Z time sequence

DN Curve 2D

DN Curve 3D

C-Curve

Multidimensional multivariate feature coding

Modeling

Model construction

Evaluation metrics

Data availability

Abbreviations

References

Acknowledgements

Institutional review board statement

Informed consent statement

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Biology

Contact us