An ensemble deep learning framework for multi-class LncRNA subcellular localization with innovative encoding strategy

Hu, Wenxing; Yue, Yan; Yan, Ruomei; Guan, Lixin; Li, Mengshan

doi:10.1186/s12915-025-02148-4

Research
Open access
Published: 21 February 2025

An ensemble deep learning framework for multi-class LncRNA subcellular localization with innovative encoding strategy

Wenxing Hu¹,
Yan Yue¹,
Ruomei Yan¹,
Lixin Guan¹ &
…
Mengshan Li¹

BMC Biology volume 23, Article number: 47 (2025) Cite this article

618 Accesses
Metrics details

Abstract

Background

Long non-coding RNA (LncRNA) play pivotal roles in various cellular processes, and elucidating their subcellular localization can offer crucial insights into their functional significance. Accurate prediction of lncRNA subcellular localization is of paramount importance. Despite numerous computational methods developed for this purpose, existing approaches still encounter challenges stemming from the complexity of data representation and the difficulty in capturing nucleotide distribution information within sequences.

Results

In this study, we propose a novel deep learning-based model, termed MGBLncLoc, which incorporates a unique multi-class encoding technique known as generalized encoding based on the Distribution Density of Multi-Class Nucleotide Groups (MCD-ND). This encoding approach enables more precise reflection of nucleotide distributions, distinguishing between constant and discriminative regions within sequences, thereby enhancing prediction performance. Additionally, our deep learning model integrates advanced neural network modules, including Multi-Dconv Head Transposed Attention, Gated-Dconv Feed-forward Network, Convolutional Neural Network, and Bidirectional Gated Recurrent Unit, to comprehensively exploit sequence features of lncRNA.

Conclusions

Comparative analysis against commonly used sequence feature encoding methods and existing prediction models validates the effectiveness of MGBLncLoc, demonstrating superior performance. This research offers novel insights and effective solutions for predicting lncRNA subcellular localization, thereby providing valuable support for related biological investigations.

Graphical abstract

Background

Long non-coding RNA (LncRNA) are a class of RNA molecules composed of over 200 nucleotides [1], typically lacking the ability for protein translation [2]. This class of RNA molecules plays various crucial biological functions within cells, closely associated with activities such as gene expression regulation, cell cycle regulation, cell differentiation, and tumorigenesis [3, 4]. With the continuous development of bioinformatics and molecular biology research methods, our understanding of lncRNA has deepened gradually [5]. Compared to traditional RNAs, lncRNA exhibit more complex and diverse structures and functions. In addition to their involvement in gene expression regulation, alternative splicing, and nuclear organization, recent studies have also revealed that lncRNA can function as signaling molecules and modulators of molecular proteins, participating in cellular signal transduction and metabolic regulation [6, 7]. Regarding diseases, abnormal expression of lncRNA is closely associated with the occurrence and progression of various diseases, such as various types of cancers and neurodegenerative diseases [8]. Therefore, lncRNA are not only vital components of cellular function and regulatory networks but also one of the current hotspots in biomedical research [9,10,11,12].

Although the importance of lncRNA in cellular processes is increasingly evident, research on their subcellular localization still faces challenges [13]. In recent years, mounting evidence indicates the significant importance of the subcellular localization of lncRNA for understanding their biological functions [14]. Currently, traditional wet laboratory techniques, especially single-molecule fluorescence in situ hybridization (smFISH) technology [15], although capable of accurately localizing RNA subcellular information, suffer from issues such as high cost, time consumption, and complex operations. Fluorescence in situ sequencing (FISSEQ) is another experimental method that combines in situ hybridization with high-throughput sequencing technology, enabling effective molecular counting at subcellular resolution [16]. However, it is restricted by numerous highly expressed lncRNA. These limitations not only increase research costs but also restrict the acquisition and analysis speed of large-scale data, thus hindering the progress of research on lncRNA subcellular localization [17]. Therefore, there is an urgent need to develop new computational methods to address these issues [18,19,20,21,22,23,24]. These methods aim to improve the accuracy and efficiency of predicting lncRNA subcellular localization, providing researchers with faster, more economical, and reliable tools. This will promote further development in the study of lncRNA subcellular localization, thereby deepening our understanding of the roles and mechanisms of lncRNA in cell biology.

In lncRNA subcellular localization studies, numerous methods have been proposed, categorized into three main types: Traditional Feature-based Methods, Deep Learning-based Methods, and Hybrid Methods. (i) Traditional Feature-based Methods: These rely on sequence features for prediction. In 2018, Cao et al. introduced lncLocator [25], which used 4-mer frequency features combined with stacked autoencoders, random forests, and SVMs. That year, Su et al. proposed iLoc-lncRNA [26], which applied 8-mer frequency features, binomial distribution for feature selection, and SVMs. Gudenas and Wang developed DeepLncRNA [27], which leveraged deep learning, 2–5-mer features, RNA binding motifs, and genomic locations. In 2020, Ahmad et al. presented Locate-R [28], which used local deep SVMs and selected 655 optimal k-mer features. That year, Fan et al. introduced lncLocPred [29], a logistic regression-based predictor using k-mer, PseKNC, and Triplet features. lncLocation (Feng et al.) integrated multiple features with autoencoders for feature extraction and hybrid selection [30]. Zhang et al. improved iLoc-lncRNA to version 2.0 [31], adding mutual information-based and incremental feature selection strategies. (ii) Deep Learning-based Methods: These methods use deep learning models, including CNNs and graph-based networks. Zeng et al. developed DeepLncLoc [32], based on a text CNN using subsequence embedding techniques. Jeon et al. introduced TACOS, a tree-based stacking classifier for predicting lncRNA localization in 10 cell types [33]. In 2023, Li et al. proposed GraphLncLoc [34], which converted lncRNA sequences into graphs and used graph convolutional networks. Zeng et al. also proposed LncLocFormer [35], which used 8 Transformer blocks to model long-range sequence dependencies with a localization-specific attention mechanism. In 2024, Li et al. developed SGCL-LncLoc [36], converting sequences into de Bruijn graphs, using Word2Vec for node representations, and refining them through graph convolutional networks. (iii) Hybrid Methods: These combine feature-based and deep learning approaches. In 2023, Yuan et al. proposed RNALight [37], which used k-mer features and LightGBM for mRNA and lncRNA localization prediction. Yang et al. introduced lncSLPre [38], which integrated sequence composition, physicochemical properties, and structural data, combining classifier outputs. Wang et al. fine-tuned a pre-trained multitask RNA binding protein model to develop DeepLocRNA for predicting subcellular localization of various RNAs [39]. In summary, an increasing number of studies have explored the application of machine learning methods to predict the subcellular localization of lncRNA, demonstrating promising performance and significant progress. However, existing methods have limitations in both prediction accuracy and universality. Most predictors encode the original lncRNA sequences using k-mer features. Simply using k-mer numerical representation features cannot retain the sequential order information of the original lncRNA sequences and fail to capture the position-specific distribution information of nucleotides. It is crucial to effectively characterize the features of nucleotides in the sequences to accurately predict subcellular localization categories.

Understanding the subcellular localization of long non-coding RNA (lncRNA) is crucial for deciphering their biological functions. Existing computational methods, including k-mer frequency-based models and deep learning architectures, have made significant progress in predicting lncRNA localization. However, these approaches still face two major limitations: (i) they often fail to capture position-specific nucleotide distribution patterns, which are essential for distinguishing localization signals; (ii) they lack an effective feature representation strategy that can generalize across multiple subcellular compartments. To address these challenges, we propose MGBLncLoc, a novel deep learning framework that integrates a newly designed encoding strategy—Multi-Class Nucleotide Distribution-based Generalized Encoding (MCD-ND). Unlike traditional k-mer frequency-based encodings, which treat sequences as simple frequency distributions, MCD-ND explicitly models nucleotide distribution density across different positions, thereby preserving both sequence composition and spatial organization. This encoding mechanism enhances feature differentiation between localization classes, improving model interpretability and prediction accuracy. To further leverage the distinctive features of MCD-ND, we design a deep learning architecture that incorporates multiple specialized neural network modules. Specifically, our model integrates Multi-Dconv Head Transposed Attention (MDTA) for capturing hierarchical sequence dependencies, Gated-Dconv Feed-forward Network (GDFN) for refining feature representations, Convolutional Neural Networks (CNNs) for extracting local sequence motifs, and Bidirectional Gated Recurrent Units (BiGRUs) for modeling long-range contextual dependencies. This synergistic combination allows MGBLncLoc to effectively learn both global and local sequence features, surpassing existing models in predictive performance. Through extensive comparative experiments, we demonstrate that MGBLncLoc significantly outperforms state-of-the-art lncRNA subcellular localization predictors, achieving superior accuracy, F1-score, and AUC. Our method provides a more precise and robust solution to multi-class localization prediction, offering valuable insights for future research on lncRNA functionality and regulatory mechanisms.

The novelty of this study lies in two key aspects: (i) The MGBLncLoc model incorporates the unique encoder MCD-ND, which enables a more precise representation of nucleotide regions within the sequence, distinguishing between conserved and discriminatory regions. The MCD-ND encoder not only considers structural characteristics among different nucleotide groups within the sequence but also emphasizes the analysis of nucleotide distribution across the sequence. (ii) The MGBLncLoc model integrates multiple advanced DNN modules, and through ablation experiments, it demonstrates the effectiveness of the combination and synergy of various modules.

Methods

Datasets

The development of machine learning models critically depends on high-quality datasets, as their quality directly impacts model performance and generalization ability. In current research on lncRNA subcellular localization, researchers have primarily constructed three benchmark datasets: one comprising data from five subcellular compartments derived by Zeng et al. from the RNALocate 1.0 database [40], another comprising data from two subcellular compartments built by Yang et al., and four subcellular compartment datasets constructed by other researchers. To obtain a reliable dataset, this study followed the methods of previous research and downloaded known lncRNA subcellular localization sequences from the RNALocate 2.0 database [41]. The RNALocate 2.0 database contains records of over 210,000 RNA-related subcellular localization entries and experimental evidence, encompassing more than 110,000 RNAs across 171 subcellular localizations of 104 species. Compared to version 1.0, version 2.0 has expanded data sources and species coverage. We downloaded 9256 subcellular localization sequences related to lncRNA, some of which belong to multiple subcellular localizations. Therefore, only data located in a single compartment were retained to objectively assess the predictive ability of the model. Sequences are often divided into multiple entries, which are merged if they have the same gene symbol. The resulting dataset contains lncRNA sequences distributed among 10 subcellular localizations. Due to the insufficient number of sequences for certain cell localizations such as endoplasmic reticulum, mitochondria, synapses, nucleoplasm, and exosomes, which were not enough to construct statistically meaningful benchmark datasets, sequences from these subcellular localizations were excluded. To avoid redundancy and reduce homology bias while preserving the original distribution, the CD-HIT program [42] was used to exclude sequences with more than 20% similarity. This approach minimized similarity among sequences, allowing for more accurate identification and prediction of sequences in different subcellular locations. Consequently, datasets were obtained for four compartments: nucleus, cytoplasm, ribosome, and cytosol. As illustrated in Fig. 1, these datasets display the number of sequences for each subcellular localization and the distribution of sequence lengths.

For experimental and evaluative purposes, each category is randomly partitioned from the dataset, with 20% allocated as an independent test set and the remaining portion as the training set. Throughout the training process, a validation set is extracted from the training set to fine-tune the model’s hyperparameters and detect potential overfitting. Following the completion of model training, the test set is employed to comprehensively assess the model’s performance. The benchmark dataset setup can be summarized as follows:

$${\mathbb{S}}={\mathbb{S}}^{Nucleus} \cup {\mathbb{S}}^{Cytoplasm}\cup {\mathbb{S}}^{Cytosol}\cup {\mathbb{S}}^{Ribosome}$$

(1)

Multi-class modified nucleotide position-aware encoding

Deep learning models typically cannot directly handle raw genetic sequences due to their inherent reliance on numerical inputs. To convert the four basic nucleotide compositions within sequences into numerical formats suitable for model learning, numerous sequence encoding methods have been proposed. Among the popular encodings are One-hot encoding [43,44,45], Nucleic acid Chemical Properties (NCP) encoding [45, 46], and Dinucleotide Physical and Chemical Properties (DPCP) encoding [45, 47]. These methods encode genetic sequences into numerical matrices based on manually defined rules, significantly enhancing the efficiency and performance of model learning. However, they still fail to comprehensively capture the positional features of nucleotides within the sequences. In order to generate more comprehensive numerical representations of sequences, extract the distribution patterns of nucleotides at different positions within lncRNA sequences for subcellular localization, and better address the multi-class problem of subcellular localization, we propose the Normalized Differential Position-aware K-mer Encoding method based on Multi-class modified nucleotide density (MCD-ND).

To convert DNA sequences into numerical representations, it is necessary to first segment them into k-mers using a fixed-size window of length k and a specified step size. Each of these segments, or k-mers, represents a distinct set of nucleotides. The size of these k-mers depends on the window size used for segmentation. Collecting all unique k-mers forms a vocabulary, the size of which is determined by the value of k, assuming a sufficient number of sequence samples. For instance, when k = 1, the vocabulary consists of ${4}^{1}=4$ possible k-mers, corresponding to the nucleotides A, C, G, and T; when k = 2, the vocabulary expands to ${4}^{2}=16$ distinct k-mers, including AA, AC, AG, AT, and so forth. The positions of generated k-mers within each sequence can be denoted as ${P}_{i}={P}_{1},{P}_{2}\cdots$. As the size of the k-mers increases, so does the size of the vocabulary, and with longer sequences, the number of positions also increases accordingly.

After segmenting the sequences into k-mers, the frequencies of the vocabulary at different positions within sequences of various categories are tallied individually. This procedure yields multiple matrices, labeled as ${A}_{class c}$, with each matrix having dimensions of z positions and n vocabulary terms, as illustrated in Eq. 2.

$${A}_{class \ 1}=\left[\begin{array}{cccc}{f}_{\text{1,1}}& {f}_{\text{1,2}}& \cdots & {f}_{1,n}\\ \vdots & \vdots & \ddots & \vdots \\ {f}_{z,1}& {f}_{z,2}& \cdots & {f}_{z,n}\end{array}\right],{A}_{class \ 2}=\left[\begin{array}{cccc}{f}_{\text{1,1}}& {f}_{\text{1,2}}& \cdots & {f}_{1,n}\\ \vdots & \vdots & \ddots & \vdots \\ {f}_{z,1}& {f}_{z,2}& \cdots & {f}_{z,n}\end{array}\right] \cdots {A}_{class \ c}=\left[\begin{array}{cccc}{f}_{\text{1,1}}& {f}_{\text{1,2}}& \cdots & {f}_{1,n}\\ \vdots & \vdots & \ddots & \vdots \\ {f}_{z,1}& {f}_{z,2}& \cdots & {f}_{z,n}\end{array}\right]$$

(2)

Each element ${f}_{i,j}$ of the matrix represents the frequency of the jth word at the ith position in the sequence, where $c$ denotes the number of classes, indicating different types of gene sequences.

Upon acquiring the frequency distribution matrices for each sequence category, proceed to tally the total number of k-mers in the sequences, denoted as ${NS}_{c}$. Normalize the frequency distribution matrices to obtain their density distribution matrices ${A}_{class c}^{den}$, as illustrated in Eq. 3.

$$\left\{\begin{array}{c}A_{class\;1}^{den}=\frac{A_{class1}}{{NS}_1},0\leq A_{class1}^{den}\leq1\\A_{class\;2}^{den}=\frac{A_{class2}}{{NS}_2},0\leq A_{class2}^{den}\leq1\\\dots\\A_{class\;c}^{den}=\frac{A_{classc}}{{NS}_c},0\leq A_{classk}^{den}\leq1\end{array}\right.$$

(3)

To achieve a more comprehensive statistical representation of DNA sequences, we then utilize the density distribution matrices of each category to compute the Position-Specific Trinucleotide Propensity (PSTNPss) score for the jth k-mer at the ith position of the gene sequence. Equation 4 offers the general mathematical expression for calculating the $PSTNPss$ score.

$$PSTNPss=A_{class\;1}^{den}-A_{class\;2}^{den}-\dots A_{class\;c}^{den}$$

(4)

The density distributions of k-mers at the same position across different categories often exhibit disparities, and in some cases, substantial discrepancies. To enhance the discriminative capacity of the PSTNPss score, we introduce category balancing factors, yielding the MCD-ND score matrix, as depicted in Eq. 5.

$$MCD-ND=\frac{PSTNPss}{\text{min}\left(A_{class\;1}^{den},A_{class\;2}^{den},\cdot\cdot\cdot A_{class\;c}^{den}\right)},\text{min}\left(A_{class\;1}^{den},A_{class\;2}^{den},\cdot\cdot\cdot A_{class\;c}^{den}\right)>0$$

(5)

The gene sequence, of length L, is segmented into L-k + 1 k-mers. These k-mers are then encoded into a 1 × (L-k + 1) matrix in accordance with their sequential positions in the sequence, based on the MCD-ND score matrix. Subsequently, this matrix is inputted into the model for learning. The encoding procedure of gene sequences is depicted in Fig. 2 when class = 2 and k = 2.

Model construction

After encoding the subcellularly localized lncRNA sequences into numerical matrices required by machine learning models, we utilize various deep learning neural network (DNN) algorithms to construct multi-class predictors, aiming to unveil hidden information within the feature matrices containing nucleotide-specific positional distributions. Figure 3 illustrates the construction of the multi- class predictor for subcellular localization. The model primarily comprises four parts, collectively referred to as MGBLncLoc. The first part constitutes the feature enhancement module, which combines a Multi-Dconv Head Transposed Attention (MDTA) module and a Gated-Dconv Feed-forward Network (GDFN) module [48]. The second part encompasses the Multi-Scale Convolutional Neural Network (CNN) module. The third part involves the Bidirectional Gated Recurrent Unit (BiGRU) module [49], responsible for handling context-related sequences. Lastly, the classification module, consisting of fully connected layers, is employed for nonlinear processing of the information extracted by the preceding network layers.

To effectively fuse the feature information extracted from the encoded sequences, we devised a feature enhancement module based on channel attention. This module initially computes channel attention to further consolidate the feature information. Subsequently, the fused feature channels are reduced to a singular channel through a reconstruction layer, yielding the final enhanced features. This design facilitates the model in comprehensively capturing the feature relationships within the sequence data, thereby generating more expressive feature representations and enhancing the model’s performance. The module comprises two components: the MDTA module and the GDFN module. To mitigate the computational burden of the network, the MDTA calculates cross-covariance across channels. Initially, the input features ${M}_{i}$ undergo processing using pointwise convolution (PW Conv) and depthwise convolution (DW Conv) to yield $Q\in B\times N\times M$,$K\in B\times N\times M$,$V\in B\times N\times M$. PW Conv operates on channels for content encoding and integrates contextual information among channels, while DW Conv further encodes spatial context. Subsequently, reshaping operations are applied to obtain $\widehat{Q}\in B\times N\times M$,$\widehat{K}\in B\times N\times M$,$\widehat{V}\in B\times N\times M$. The dot product of $\widehat{Q}$ and $\widehat{K}$ is computed to generate the channel attention map $A$ of size $M\times M$, as depicted by Eq. 6.

$$A=V\cdot Softmax( K\cdot Q )$$

(6)

After obtaining the attention map $A$, we perform a weighting operation on $V$, multiplying it by $A$ to obtain the enhanced feature representation. The specific operation is as follows:

$$V'=A\cdot V$$

(7)

Here, $V'$ represents the weighted feature map, enhancing inter-channel relationships. Subsequently, through reconstruction and additive residual connections, these feature maps are seamlessly integrated into the initial features. For more precise residual information, we employ GDFN for intricate operations. Initially, input features undergo deep convolution to encode spatial context information. Following this, features are reconstructed, fused, and connected additively with the initial features to yield the final output feature matrix $X'$ To achieve more accurate residual information, GDFN enriches the feedforward network (FN) by incorporating a GELU activation branch and depthwise convolution. This enriches feature representation and, using spatial context information, enhances the recovery of local details. The GELU activation function is defined as:

$$GELU(x)=x\cdot {\varvec{\phi}}( x )$$

(8)

Here, ${\varvec{\phi}}( x )$ is the cumulative distribution function (CDF) of the standard normal distribution, typically computed as:

$${\varvec{\phi}}\left( x \right)=0.5\cdot\left(1+\text{erf}\left(\frac{x}{\sqrt{2}}\right)\right)$$

(9)

In this section, the MDTA module is primarily employed to capture local features within sequence data and emphasize relationships between different positions through an attention mechanism. Combining multi-head decomposable convolution and transpose attention mechanisms, it aids the model in better understanding the local structures within the input sequence. Meanwhile, the GDFN module integrates the structures of gated convolutions and feedforward neural networks. Gated convolutions are typically used to control the flow of information, while feedforward neural networks handle feature representations. The primary role of GDFN is to more effectively process sequence data. Through gate mechanisms, it regulates information flow and assists the model in capturing global features within the sequence data more effectively.

In the CNN module, the convolutional layers perform an operation akin to using a sliding window to extract motifs from sequences with high activation feature information. Therefore, MGBLncLoc utilizes three 1D convolutional layers, which conduct convolution in parallel with multiple convolutional blocks on the feature matrix. The feature vectors are derived through multi-scale convolutional layers with Rectified Linear Units (ReLU) as the activation function. Following the acquisition of the convolutional feature matrix, max-pooling is applied to mitigate overfitting by reducing the number of features. The convolutional layers can be mathematically represented as shown in Eq. 10.

$${Conv({X'})}_{i,j}=ReLU({\sum }_{s=0}^{S-1}{\sum }_{n=0}^{N-1}{W}_{s,n}^{j} {{X'}}_{i+s,n})$$

(10)

In this equation, $A$ represents the enhanced feature matrix obtained after encoding the gene sequence, where $i$ is the index of the output position and $j$ is the index of the filter. Each convolutional filter ${W}^{j}$ is an $S\times N$ matrix, where $\text{S}$ is the filter size determined by hyperparameter optimization, and $\text{N}$ is the number of input channels. For the first convolutional layer, $\text{N}$ is the input dimension of the feature matrix after sequence encoding. ReLU is represented as:

$$ReLU=\left\{\begin{array}{c}x,\;if\;x\geq0\\0,\;if\;x<0\end{array}\right.$$

(11)

Nowadays, the processing of most gene sequence data relies on recurrent neural network architectures such as LSTM, GRU, etc. [50, 51], among which BiGRU has demonstrated remarkable success in classification tasks. BiGRU can capture time-series features in both the forward and backward directions of the sequence, resulting in more robust and information-rich feature representations compared to individual GRUs. In the MGBLncLoc model, BiGRUs leverage their internal states to process sequence vectors, fully exploiting sequence context information in both directions. Gated Recurrent Units (GRUs) constitute the main component of BiGRUs, which are utilized to dynamically remember or forget sequence information. The primary operations in the GRU layer include update gates and reset gates. Update gates control the degree to which new input information is fused with the previous hidden state, while reset gates regulate how the previous hidden state is used to compute candidate activation values. The computation of these gates involves a linear combination of the weight matrix with the inputs and the previous hidden states, activated by a Sigmoid function. Through the integration of these gates, BiGRU effectively controls the flow and retention of information. Specifically, the update gate ${z}_{j}^{t}$ and the reset gate ${r}_{j}^{t}$ for the $jth$ hidden unit at time step t are computed as shown in Eq. 12.

$$\left\{\begin{array}{c}{z}_{j}^{t}=\sigma {({W}_{z}{X}_{t}+{U}_{z}{h}_{t-1})}^{j}\\ {r}_{j}^{t}=\sigma {({W}_{r}{X}_{t}+{U}_{r}{h}_{t-1})}^{j}\end{array}\right.$$

(12)

Here, $\sigma$ denotes the logistic sigmoid function, ${W}_{z}$ and ${U}_{z}$ represent the learned different weight matrices, and ${h}_{t-1}$ denotes the previous hidden state. ${X}_{t}$ stands for the input, where in the first unit, the output of the multi-scale convolutional network layer serves as the input.

After updating and resetting the gates, the activation value ${h}_{j}^{t}$ t of the $jth$ hidden unit at time step $t$ is computed according to Eq. 11., where the candidate activation value ${\widetilde{h}}_{j}^{t}$ is determined by applying the hyperbolic tangent function tanh to the combination of input data and previous hidden state.

BiGRU dynamically updates hidden states in this manner on sequence data to better capture important features and relationships within the sequence. This bidirectional processing enables the model to more comprehensively understand sequence data, thereby enhancing its performance across various sequence tasks.

In the classification module, the information extracted by the preceding network layers undergoes nonlinear processing through fully connected layers. The output vector of these fully connected layers serves as input to the Rectified Linear Unit (ReLU) activation function. In the context of the multi-class subcellular localization problem, the probabilities of each node are independent, resulting in scores ranging from 0 to 1 for each node. Ultimately, MGBLncLoc seeks to establish the following mapping relationship:

$$\widehat{Y}=\text{arg}\mathit\ {max}\ f\left( {MCD-ND}_{n}\left(i\right);W\right)$$

(13)

where $\widehat{Y}$ represents the predicted scores of the neural network for the subcellular localization of lncRNA sequences; ${MCD-ND}_{n}$ denotes the feature matrix of the sequences encoded through MCD-ND; $W$ represents the parameters of the neural network; and $f$ is the mapping function sought by the neural network.

To establish this mapping relationship, it is essential to define a loss function that quantifies the disparity between the predicted labels and the ground truth labels. We employ the commonly used cross-entropy loss function [52], as depicted in Eq. 14.

$$L=-\frac{1}{N}{\sum }_{n=1}^{N}\left({y}^{(n)}\mathit{log}{p}_{i}^{(n)}+(1-{y}^{\left(n\right)}) \mathit{log}(1-{p}_{i}^{(n)})\right)$$

(14)

where $N$ represents the sample capacity, ${y}^{(n)}$ is a dichotomous variable, and ${p}_{i}^{(n)}$ is the predicted probability of the neural network for the nth sample of the ith subcellular localization.

Model performance evaluation

To assess the classification performance of the machine learning model, we employed several widely used evaluation metrics for multi-class classification problems, following the approach used in previous studies [25, 26]. These evaluation metrics encompass Macro-Precision, Macro-Recall, Macro-F1-Score, and Macro-Accuracy, alongside the Area Under the ROC Curve (AUC) for comprehensive assessment. Initially, precision, recall, and accuracy were computed for each subcellular localization category, followed by macro-averaging computations. Below outlines the computation process for each evaluation metric.

$$\left\{\begin{array}{c}{Precision}_{i}=\frac{{TP}_{i}}{{TP}_{i}+{FP}_{i}} \\ {Recall}_{i}=\frac{{TP}_{i}}{{TP}_{i}+{FN}_{i}} \\ {F1-Socre}_{i}=\frac{2\times {Precision}_{i}\times {Recall}_{i}}{{Precision}_{i}+{Recall}_{i}}\\ {Accuracy}_{i}=\frac{{TP}_{i}+{TN}_{i}}{{TP}_{i}+{FN}_{i}+{TN}_{i}+{FP}_{i}} \\ Macro\ Precision=\frac{1}{N}\sum_{i=1}^{N}{Precision}_{i}\\ Macro\ Recall=\frac{1}{N}\sum_{i=1}^{N}{Recall}_{i}\\ Macro \ F1-Score=\frac{1}{N}\sum_{i=1}^{N}{F1-Socre}_{i} \\ Macro \ Accuracy=\frac{1}{N}\sum_{i=1}^{N}{Accuracy}_{i}\end{array}\right.$$

(15)

where N represents the total number of classes, TP_i denotes the number of true positives for class i, TN_i stands for the number of true negatives for class i, FP_i represents the number of false positives for class i, and FN_i indicates the number of false negatives for class i. AUC is defined as the area enclosed by the ROC curve and the coordinate axes [53]. A higher AUC value, closer to 1.0, indicates better model performance.

Results and discussion

Comparison of encoder parameters on model performance

The performance of the model varies depending on the parameters of the encoder. To optimize the performance of MGBLncLoc, we conducted experiments on the encoder’s parameters. The encoding method used, MCD-ND, primarily depends on the number and length of k-mers in the sequence. The number of k-mer positions in the sequence determines the rows in the MCD-ND score matrix, which is influenced by the sequence length. The lengths of lncRNA sequence data for subcellular localization downloaded from RNALocate v2.0 vary, as illustrated in Fig. 1. The longest sequence, found in the cell nucleus, spans 551120 bp, while the shortest is 254 bp, with similar lengths observed for sequences in other locations. To determine the optimal number of k-mer positions in the MCD-ND encoding, original data entries were divided into six lengths: 101 bp, 141 bp, 181 bp, 221 bp, 261 bp, and 301 bp, respectively, to standardize the data sequence lengths. This division significantly expands the number of original sequence entries and enables more efficient model training on relatively shorter sequences, thereby better capturing sequence features and achieving improved training outcomes. Considering computing resource limitations, longer sequence lengths may entail higher computational costs. Opting for a shorter range helps better balance performance and efficiency within the constraints of computing resources. Shorter sequences may not sufficiently capture the critical biological features of lncRNA sequences. Choosing these lengths ensures that enough biological information is contained within the limited length, thereby enhancing the model’s predictive power. Table 1 presents the information of the divided data.

Table 1 Number of the different sequence lengths in the datasets

Full size table

When adjusting the size of k-mers, the encoder computes different MCD-ND score matrices, directly impacting the feature representation of subcellular localization lncRNA sequences, thereby influencing the model’s recognition performance. Sequence features of 1-mer and 2-mer exhibit relatively high and similar frequencies across sequences of different categories, hence classified as frequent features. Conversely, 3-mer, 4-mer, and 5-mer, as k increases, display significant differences in distribution among sequences of different categories, thus not being common features. Therefore, experiments were conducted to analyze whether different k-mers could generate better features for the model. We established models based on varying sequence lengths and combinations of different k-mers. To achieve a model with robust generalization capabilities, its hyperparameters were optimized, and the models established were assessed through ten-fold cross-validation. The evaluation results of the trained models are depicted in Fig. 4.

In Fig. 4a, the Macro-Precision of the 1-mer model for lncRNA sequences with six different lengths is generally low, reaching its highest value at a length of 221 bp. This suggests that the MCD-ND score matrix generated by the 1-mer model fails to adequately differentiate the features of subcellular localization sequences, thereby reducing the classification performance of the model. For sequences of length 221 bp, the Macro-Precision of the 3-mer model reaches 68.3%. Macro-Precision measures the model’s classification accuracy on each subcellular localization category, i.e., the Accuracy of the model’s classification of samples in each category, regardless of the differences in the number of samples in each category. The results in Fig. 4d are similar, where the Macro-Accuracy of the various k-mer encoders in extracting feature representations of sequences with a length of 101 bp is lowest among sequences of other lengths, and highest at a length of 221 bp, at 68.8%. However, there are differences in the performance of models with different k-mer features. For sequences of length 221 bp, the performance of the various k-mer models in identifying the four subcellular localizations differs, as shown in Fig. 4e–h, with the 3-mer model learning the best feature representation performance, where the Accuracy of identifying the cell nucleus is the highest, reaching 78.1%. This may be because the nucleus has the most lncRNA sequences, leading to better training results for the model and thereby improving its classification performance. In Fig. 4i–l, the F1-Scores of the model for the four subcellular localizations with a length of 221 bp are also higher than those for other lengths. In summary, the different parameters of the encoder affect the model’s recognition performance for subcellular localization sequences, with the best performance observed when the k-mer length is 3-mer and the sequence length is 221 bp, yielding the optimal performance of the generated MCD-ND score matrix.

Effect of different feature encoding methods on model performance

The method used to encode sequence features also directly impacts the model’s performance. Different feature encoding methods transform sequences into various feature representations, aiding machine learning models in handling gene sequences. Various encoding methods are commonly employed in bioinformatics research. These include One-hot encoding, NCP encoding based on nucleotide chemical properties, DPCP encoding based on dinucleotide chemical properties, encoding methods based on K-mer frequency, and natural language processing encoding techniques like Word2Vec. These methods have gained popularity in recent years for their effectiveness in representing biological sequences. Similar to MCD-ND encoding, these methods involve manually defined rules that represent sequences as different features. Therefore, we regenerated these features as inputs for the model and compared them with the 3-mer MCD-ND encoding to demonstrate the effectiveness of MCD-ND encoding. To obtain more reliable experimental results, each group of models was trained ten times, and the average values of each metric were calculated and listed in Table 2. The evaluation results obtained are shown in Fig. 5.

Table 2 Mean values for model identification of subcellular localization

Full size table

From Fig. 5b, it is evident that the model learns features extracted from four feature encoding methods: One-hot, NCP, DPCP, and K-mer frequency. The evaluation macro-average recall rates for identifying subcellular localization data hover around 50% for these methods, indicating moderate performance. However, the average values obtained from ten training experiments are consistently inferior to those of the MCD-ND encoding. The macro-average recall rate for MCD-ND encoding reaches 67.3%, reflecting the model’s recognition capability for each category. This rate measures the proportion of correctly predicted positive samples among actual positive samples. As shown in Fig. 5d and Table 2, the model learns feature representations from MCD-ND encoding, achieving a macro-average accuracy rate of 67.2%, surpassing Word2Vec encoding by 8.1%. This indicates that MCD-ND can extract more distinctive features for subcellular localization of lncRNA sequence data, effectively aiding the model in localization recognition. Comparison of the recognition results for various subcellular localizations reveals similar trends. In Fig. 5f, g, the F1 scores for identifying nucleus and cytoplasm exceed 70%, higher than other feature encodings. The MCD-ND feature encoding can capture the specific positional distributions of nucleotides for different k-mers, with PSTNPss assigning appropriate scores to different levels of k-mers, enhancing their discriminative power. Moreover, sequence encoders based on physical–chemical properties rarely consider the positional information of k-mers, while One-hot encoding only focuses on the distribution information of 1-mer. These results indicate that existing encoders fail to effectively capture the positional differentiation distribution of k-mers. In contrast, the MCD-ND encoder can effectively capture the comprehensive discriminative patterns of nucleotides in subcellular localization lncRNA sequences, thereby significantly aiding the model in subcellular localization tasks.

Ablation experiment

To explore the impact of different modules and network layers on the overall performance of the model, we conducted ablation experiments. The deep learning model designed for the subcellular localization multi-class problem was divided into 5 models: the base model CNN (Base), MDTA module combined with CNN (MDTA-Base), CNN combined with BiGRU (BiGRU-Base), MDTA and GDFN modules combined with CNN (MDTA-GDFN-Base), and the model MGBLncLoc (MDTA-GDFN-BiGRU-Base). The aim was to clarify the specific effects of different module variations on the model MGBLncLoc, aiming to determine the best-performing model structure. The performance results of different models trained on the test set are evaluated, as shown in Fig. 6.

Through ablation experiments, a deeper understanding of the roles of each module in the model and the impact of the depth of each network layer on feature extraction and learning can be obtained. In Fig. 6d, it is evident that as the number of modules increases, the Macro-Accuracy evaluation metric of the model gradually improves, indicating a progressive enhancement in the model’s ability to recognize subcellular localization of lncRNA sequences. Figure 6f–m respectively depicts the F1-scores and accuracy of each model in recognizing the four subcellular localization sequences. It can be observed that compared to the base model solely based on CNN, the performance of the BiGRU-Base model exhibits significant improvement. This suggests that models integrating the BiGRU mechanism not only acquire convolutional features but also dynamically retain or discard information flow. With only two gate mechanisms (update gate and reset gate), BiGRU demonstrates a shorter convergence time and better problem-solving ability compared to traditional RNN networks. On the testset, the MGBLncloc model outperformed others in all five overall evaluation metrics, indicating an enhancement in the model’s classification performance and its capacity to more accurately distinguish subcellular localization. Finally, we opt to utilize the MDTA-GDFN-CNN-BiGRU configuration to construct a multi-class subcellular recognition model, ensuring optimal predictive performance.

Comparison with existing predictive models for subcellular localization

To further demonstrate the generalization and superiority of the proposed method in predicting lncRNA subcellular localization, we compared MGBLncLoc with existing state-of-the-art subcellular localization predictors. The MGBLncLoc model utilizes 3-mer for encoding, and a detailed comparison of the models is presented in Table 3.

Table 3 Existing advanced subcellular localization predictors

Full size table

Different datasets can lead to variations in model recognition results. Models like lncLocator, iLocLncRNA, Locate-R, DeepLncLoc, and GraphLncLoc were all trained and tested using the RNALocate v1.0 database [40]. To ensure a fair comparison and demonstrate the superiority of the MGBLncLoc model, we also retrained it using RNALocate v1.0 for comparison purposes. In contrast to RNALocate v2.0, the number of lncRNA sequences located in the cytoplasm slightly exceeds those in the nucleus in the v1.0 dataset, indicating distinct distribution patterns between the two datasets. The evaluation results obtained from this comparison are depicted in Fig. 7.

In Fig. 7a, the MGBLncLoc model outperformed the relatively higher models LncLocFormer, lncSLPre, SGCL-LncLoc, and DeepLocRNA in terms of macro-average recall rate. Figure 7b demonstrates that MGBLncLoc achieved the highest evaluation of macro-average F1 score, reaching 61.1%. The F1 score, representing the harmonic mean of precision and recall, indicates the model’s classification accuracy and recognition capability for each subcellular localization category. Moreover, MGBLncLoc attained the highest macro-average accuracy of 63.5%, surpassing other subcellular prediction models, yielding promising results. These findings underscore the stability and effectiveness of the proposed model in performing subcellular localization classification tasks for lncRNA sequences.

Motif analysis

To unveil common patterns and conserved regions among lncRNA sequences for subcellular localization, we utilized the MUSCLE tool to compare sequences from the nucleus, cytoplasm, cytosol, and ribosomes, as depicted in Fig. 8. MUSCLE is employed to assess the similarity among multiple biological sequences, aiding in subsequent functional annotation, evolutionary analysis, and structure prediction studies. By juxtaposing multiple sequences, our aim is to unearth shared features, structures, and functions among them, thus providing deeper insights into their biological significance.

To investigate whether the patterns of lncRNA sequences for different subcellular localization categories are conserved, we conducted further analysis using the probability-based motif visualization tool, kplogo, focusing on two crucial subcellular localization categories: the nucleus and cytoplasm, as depicted in Fig. 9. To demonstrate the consistency of lncRNA sequence patterns, 3-mer logos were generated using kplogo, illustrating the frequency of different k-mers at each position in the nucleus and cytoplasm sequences. In the k-mer logo, the height at each position represents the frequency of the k-mer at that position, while the color of the k-mer indicates its composition. Additionally, the generated Probability logos display the relative frequencies of different bases at each position in the sequence, aiding in understanding the base composition within the sequence and its distribution across different subcellular localizations, thereby delving deeper into the characteristics and differences of sequences across various subcellular localizations.

In Fig. 8, color intensity is used to represent the similarity between sequences, where darker shades indicate higher similarity and lighter shades indicate lower similarity. We observed some conserved regions and common patterns in the comparative plot, which exhibit significant similarity among lncRNA sequences with different subcellular localizations, possibly representing functional domains or important structural features within the sequences. Notably, in the nucleus, there are two AGCCC motifs that differ significantly from other subcellular localizations, consistent with Zhang et al.’s findings that the AGCCC motif serves as a universal nuclear localization signal. In Fig. 9, it can be seen that in lncRNA sequences of the nucleus, the most frequent 3-mers are ANA and TNT (N representing any of the four nucleotides), while there are no significant differences in the frequencies of 3-mers in the cytoplasmic k-mer logo. This indicates that extracting the 3-mer frequency features from subcellular lncRNA sequences can effectively distinguish between different subcellular localizations, consistent with the results obtained in Experiment 3.1, where the MGBLncLoc model using the 3-mer MCD-ND encoder performed better. This suggests that the MCD-ND feature encoding extracts the positional distribution features of 3-mers in subcellular localization sequences, thereby enhancing the model’s nucleotide comprehensive discriminative pattern and effectively improving the performance of subcellular localization recognition tasks.

Model feature interpretation

To further explore the knowledge acquired by the MGBLncLoc model, we can use model interpretation methods such as attention scores [54], SHAP (SHapley Additive exPlanations) values [55, 56], and saliency maps [57]. Specifically, we chose SHAP values to interpret the normalized differential position-aware K-mer encoding features based on multi-category modified nucleotide density. This approach allows us to assess how much each MCD-ND feature influences the model’s predictions, rather than directly determining their biological role in localization. SHAP is a game-theoretic method that attributes the model’s prediction to individual input features, providing insights into the model’s decision-making process and enhancing its interpretability. We utilize functions provided by the SHAP library to compute SHAP values for MCD-ND encoding features generated from different k-mers and sequence lengths. By calculating the marginal contribution of each feature to the prediction output, SHAP values reveal the relative importance of different encoding features in model predictions, as depicted in Fig. 10.

Figure 10a presents a scatter plot of SHAP values and feature density for each feature in the MGBLncLoc model, showcasing the importance of features and the distribution of feature values. The features are sorted based on the mean absolute SHAP value, with the most important features ranked first. The highest contributions are observed for the 3-mer and 221 bp MCD-ND features, followed by the 3-mer and 181 bp features. The color of the points in Fig. 10a represents the magnitude of the feature values, with red indicating higher values and blue indicating lower values. By observing these colors, we can intuitively understand the range and distribution of each feature in the data. Notably, the 3-mer and 221 bp MCD-ND features exhibit positive contributions to the model, indicating that an increase in feature values positively impacts the model’s prediction results. This suggests that features extracted from these parameters play a crucial role in the model’s recognition of subcellular localization. In the importance ranking of various MCD-ND features in Fig. 10b, the 3-mer and 221 bp features also contribute the most positively, reaching 47%. Figure 10c is a heatmap of feature distribution under sample clustering, helping us understand the relationship between sample clustering results and features. It can be observed that within the sample count range of 1500–2500, samples are greatly influenced by the positive impact of the 3-mer and 221 bp features, indicating that an increase in lncRNA sample sequences leads to an increase in the model’s prediction of positive results, thus exerting a positive influence on the samples. Figure 10d–k display the dependency analysis of features, focusing on the interaction dependency effects of the top-contributing features on the most significant 3-mer and 221 bp features. It can be seen that in Fig. 10d, the self-impact of the 3-mer and 221 bp features is positively influenced at a high level, while for the other features, the 3-mer and 221 bp features at high levels have a negative impact, as shown in Fig. 10e–g. This indicates that the higher-level 3-mer and 221 bp features possess stronger discriminative power and effectiveness. In summary, the MGBLncLoc model can more effectively recognize subcellular localizations using the 3-mer and 221 bp MCD-ND features extracted from lncRNA sequences, facilitating further research on subcellular localization of lncRNA sequences.

Conclusions

In this study, we proposed MGBLncLoc, a novel deep learning framework for lncRNA subcellular localization, integrating Multi-Class Nucleotide Distribution-based Generalized Encoding with an advanced deep neural architecture. The key contributions of this work are twofold: (i) MCD-ND encoding, which effectively captures the positional distribution of nucleotides, addressing the limitations of conventional k-mer frequency-based methods, and (ii) a hybrid deep learning model, incorporating MDTA, GDFN, CNN, and BiGRU, which enables both local and global sequence feature extraction, significantly improving predictive performance. Comparative experimental analyses demonstrate that MGBLncLoc outperforms state-of-the-art models in terms of accuracy, F1-score, and AUC, providing a robust and interpretable approach for multi-class lncRNA localization prediction. Despite these advancements, several limitations remain: (i) The current study is based on lncRNA sequences with a single subcellular localization, due to the limited availability of experimentally verified multi-compartmental localization data. However, in reality, many lncRNAs localize to multiple cellular compartments, and ignoring this aspect may introduce biases. Future work will focus on curating a more comprehensive dataset that includes multi-localized lncRNAs and exploring multi-label classification approaches to enhance the model’s applicability. (ii) The RNALocate dataset exhibits uneven sample distribution, with most sequences concentrated in a few subcellular compartments. This imbalance may lead to model bias and reduced predictive performance for underrepresented localizations. While techniques such as data augmentation and weighted loss functions can mitigate this issue, the collection of larger, more balanced datasets remains crucial for improving model robustness and generalization across different cellular contexts. (iii) Currently, MGBLncLoc is designed for lncRNA subcellular localization and has not been evaluated on other RNA types such as mRNA, circRNA, or miRNA. Given that different RNA molecules exhibit distinct localization patterns and regulatory mechanisms, future research should explore whether the proposed framework can be generalized to predict the localization of diverse RNA species. This could involve fine-tuning the model on heterogeneous RNA datasets or developing transfer learning approaches to improve adaptability. (iv) Although the proposed encoding strategy improves feature extraction and classification performance, the biological relevance of the learned features remains to be fully explored. Future studies should integrate functional enrichment analyses [58] and experimental validation to determine whether the model captures biologically meaningful sequence motifs associated with RNA localization. Moving forward, we aim to expand the dataset, refine the model architecture, and investigate cross-species applicability to develop a more versatile and biologically interpretable RNA localization predictor. These efforts will contribute to a deeper understanding of RNA localization mechanisms and their functional implications in cellular processes.

Data availability

The datasets generated and/or analysed during the current study are available in the RNALocate Database repository, [https://rnalocate.org/]. The codes, architecture, parameters, dataset, functions, usage and output of the proposed model are available free of charge at GitHub(https://github.com/xing1999/MGBLncLoc). Additionally, all data generated or analyzed during this study are included in this published article, its supplementary information files, and permanently archived in Zenodo, DOI: 10.5281/zenodo.14776765.

Abbreviations

lncRNA:: Long non-coding RNA
MCD-ND:: Multi-Class Nucleotide Distribution-based Generalized Encoding
MDTA:: Multi-Dconv Head Transposed Attention
GDFN:: Gated-Dconv Feed-forward Network
DPCP:: Bidirectional Gated Recurrent Unit
CNN:: Convolutional Neural Network
NCP:: Nucleic acid Chemical Properties
DPCP:: Dinucleotide physical and chemical properties
DNN:: Deep learning neural network
ReLU:: Rectified Linear Unit
ACC:: Accuracy
AUC:: Area Under the ROC Curve
smFISH:: Single-molecule fluorescence in situ hybridization
FISSEQ:: Fluorescence in situ sequencing
SVM:: Support vector machine
SHAP:: SHapley Additive exPlanations

References

Taft RJ, Pang KC, Mercer TR, Dinger M, Mattick JS. Non-coding RNAs: regulators of disease. The Journal of Pathology: A Journal of the Pathological Society of Great Britain and Ireland. 2010;220(2):126–39.
Article CAS Google Scholar
Zhang Y, Lei X, Fang Z, Pan Y. CircRNA-disease associations prediction based on metapath2vec++ and matrix factorization. Big Data Mining and Analytics. 2020;3(4):280–91.
Article Google Scholar
Feng Z, Sun H, Han X. Regulation of proliferation, differentiation and apoptosis of bone-related cells by long-stranded non-coding RNA. Chinese Journal of Tissue Engineering Research. 2022;26(1):112.
Google Scholar
Gupta RA, Shah N, Wang KC, Kim J, Horlings HM, Wong DJ, et al. Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer metastasis. Nature. 2010;464(7291):1071–6.
Article CAS PubMed PubMed Central Google Scholar
Lu C, Yang M, Luo F, Wu F-X, Li M, Pan Y, et al. Prediction of lncRNA–disease associations based on inductive matrix completion. Bioinformatics. 2018;34(19):3357–64.
Article CAS PubMed Google Scholar
Moran VA, Perera RJ, Khalil AM. Emerging functional and mechanistic paradigms of mammalian long non-coding RNAs. Nucleic Acids Res. 2012;40(14):6391–400.
Article CAS PubMed PubMed Central Google Scholar
Esteller M. Non-coding RNAs in human disease. Nat Rev Genet. 2011;12(12):861–74.
Article CAS PubMed Google Scholar
Lu Q, Ren S, Lu M, Zhang Y, Zhu D, Zhang X, et al. Computational prediction of associations between long non-coding RNAs and proteins. BMC Genomics. 2013;14:1–10.
Article Google Scholar
Kretz M, Siprashvili Z, Chu C, Webster DE, Zehnder A, Qu K, et al. Control of somatic tissue differentiation by the long non-coding RNA TINCR. Nature. 2013;493(7431):231–5.
Article CAS PubMed Google Scholar
Wu Z, Liu X, Liu L, Deng H, Zhang J, Xu Q, et al. Regulation of lncRNA expression. Cell Mol Biol Lett. 2014;19:561–75.
Article CAS PubMed PubMed Central Google Scholar
Martianov I, Ramadass A, Serra Barros A, Chow N, Akoulitchev A. Repression of the human dihydrofolate reductase gene by a non-coding interfering transcript. Nature. 2007;445(7128):666–70.
Article CAS PubMed Google Scholar
Zhou B, Yang H, Yang C, Bao Y-l, Yang S-m, Liu J, et al. Translation of noncoding RNAs and cancer. Cancer letters. 2021;497:89–99.
Article CAS PubMed Google Scholar
Hansji H, Leung EY, Baguley BC, Finlay GJ, Cameron-Smith D, Figueiredo VC, et al. ZFAS1: a long noncoding RNA associated with ribosomes in breast cancer cells. Biol Direct. 2016;11:1–25.
Article Google Scholar
Cabili MN, Dunagin MC, McClanahan PD, Biaesch A, Padovan-Merhar O, Regev A, et al. Localization and abundance analysis of human lncRNAs at single-cell and single-molecule resolution. Genome Biol. 2015;16:1–16.
Article CAS Google Scholar
Ji N, Van Oudenaarden A. Single molecule fluorescent in situ hybridization (smFISH) of C. elegans worms and embryos. WormBook. 2012:1–16.
Lee JH, Daugharthy ER, Scheiman J, Kalhor R, Ferrante TC, Terry R, et al. Fluorescent in situ sequencing (FISSEQ) of RNA for gene expression profiling in intact cells and tissues. Nat Protoc. 2015;10(3):442–58.
Article CAS PubMed PubMed Central Google Scholar
Yao R-W, Wang Y, Chen L-L. Cellular functions of long noncoding RNAs. Nat Cell Biol. 2019;21(5):542–51.
Article CAS PubMed Google Scholar
Almagro Armenteros JJ, Sonderby CK, Sonderby SK, Nielsen H, Winther O. DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics. 2017;33(21):3387–95.
Article PubMed Google Scholar
Chen L-L. Linking long noncoding RNA localization and function. Trends Biochem Sci. 2016;41(9):761–72.
Article CAS PubMed Google Scholar
Zhao Y, Wang X, Che T, Bao G, Li S. Multi-task deep learning for medical image computing and analysis: a review. Comput Biol Med. 2023;153: 106496.
Article PubMed Google Scholar
Gu X, Ding Y, Xiao P. MLapRVFL: protein sequence prediction based on multi-laplacian regularized random vector functional link. Comput Biol Med. 2023;167: 107618.
Article CAS PubMed Google Scholar
Zhou T, Cheng Q, Lu H, Li Q, Zhang X, Qiu S. Deep learning methods for medical image fusion: a review. Comput Biol Med. 2023;160:106959.
Zhang Y, Qiao S, Ji S, Li Y. DeepSite: bidirectional LSTM and CNN models for predicting DNA–protein binding. Int J Mach Learn Cybern. 2020;11:841–51.
Article Google Scholar
Zhang Y, Yan J, Chen S, Gong M, Gao D, Zhu M, et al. Review of the applications of deep learning in bioinformatics. Curr Bioinform. 2020;15(8):898–911.
Article CAS Google Scholar
Cao Z, Pan X, Yang Y, Huang Y, Shen HB. The lncLocator: a subcellular localization predictor for long non-coding RNAs based on a stacked ensemble classifier. Bioinformatics. 2018;34(13):2185–94.
Article CAS PubMed Google Scholar
Su Z-D, Huang Y, Zhang Z-Y, Zhao Y-W, Wang D, Chen W, et al. iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics. 2018;34(24):4196–204.
Article CAS PubMed Google Scholar
Gudenas BL, Wang L. Prediction of LncRNA subcellular localization with deep learning from sequence features. Sci Rep. 2018;8(1):16385.
Article PubMed PubMed Central Google Scholar
Ahmad A, Lin H, Shatabda S. Locate-R: subcellular localization of long non-coding RNAs using nucleotide compositions. Genomics. 2020;112(3):2583–9.
Article CAS PubMed Google Scholar
Fan Y, Chen M, Zhu Q. lncLocPred: predicting LncRNA subcellular localization using multiple sequence feature information. Ieee Access. 2020;8:124702–11.
Article Google Scholar
Feng S, Liang Y, Du W, Lv W, Li Y. LncLocation: efficient subcellular location prediction of long non-coding RNA-based multi-source heterogeneous feature fusion. Int J Mol Sci. 2020;21(19): 7271.
Article CAS PubMed PubMed Central Google Scholar
Zhang Z-Y, Sun Z-J, Yang Y-H, Lin H. Towards a better prediction of subcellular location of long non-coding RNA. Front Comp Sci. 2022;16:1–7.
CAS Google Scholar
Zeng M, Wu Y, Lu C, Zhang F, Wu FX, Li M. DeepLncLoc: a deep learning framework for long non-coding RNA subcellular localization prediction based on subsequence embedding. Briefings in Bioinformatics. 2022;23(1):bbab360.
Article PubMed Google Scholar
Jeon Y-J, Hasan MM, Park HW, Lee KW, Manavalan B. TACOS: a novel approach for accurate prediction of cell-specific long noncoding RNAs subcellular localization. Briefings in bioinformatics. 2022;23(4):bbac243.
Article PubMed PubMed Central Google Scholar
Li M, Zhao B, Yin R, Lu C, Guo F, Zeng M. GraphLncLoc: long non-coding RNA subcellular localization prediction using graph convolutional networks based on sequence to graph transformation. Briefings in Bioinformatics. 2023;24(1):bbac565.
Article PubMed Google Scholar
Zeng M, Wu Y, Li Y, Yin R, Lu C, Duan J, et al. LncLocFormer: a transformer-based deep learning model for multi-label lncRNA subcellular localization prediction by using localization-specific attention mechanism. Bioinformatics. 2023;39(12):752.
Article Google Scholar
Li M, Zhao B, Li Y, Ding P, Yin R, Kan S, et al. SGCL-LncLoc: an interpretable deep learning model for improving lncRNA subcellular localization prediction with supervised graph contrastive learning. Big Data Mining Anal. 2024;7(3):765–80.
Yuan GH, Wang Y, Wang GZ, Yang L. RNAlight: a machine learning model to identify nucleotide features determining RNA subcellular localization. Briefings in bioinformatics. 2023;24(1):bbac509.
Article PubMed Google Scholar
Yang R, Gao S, Fu Y, Zhang L. lncSLPre: an ensemble method with multi-source sequence descriptors to predict lncRNAsubcellular localizations from imbalanced data. Available at SSRN: https://ssrn.com/abstract=4515036.
Wang J, Horlacher M, Cheng L, Winther O. DeepLocRNA: an interpretable deep learning model for predicting RNA subcellular localisation with domain-specific transfer-learning. Bioinformatics. 2024;40(2):btae065.
Zhang T, Tan P, Wang L, Jin N, Li Y, Zhang L, et al. RNALocate: a resource for RNA subcellular localizations. Nucleic Acids Res. 2017;45(D1):D135–8.
CAS PubMed Google Scholar
Cui T, Dou Y, Tan P, Ni Z, Liu T, Wang D, et al. RNALocate v2. 0: an updated resource for RNA subcellular localization with increased coverage and annotation. Nucleic acids research. 2022;50(D1):D333–9.
Article CAS PubMed Google Scholar
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2.
Article CAS PubMed PubMed Central Google Scholar
Lv H, Dao FY, Zhang D, Guan ZX, Yang H, Su W, et al. iDNA-MS: an integrated computational tool for detecting DNA modification sites in multiple genomes. Iscience. 2020;23(4):100991.
Article CAS PubMed PubMed Central Google Scholar
Wei L, He W, Malik A, Su R, Cui L, Manavalan B. Computational prediction and interpretation of cell-specific replication origin sites from multiple eukaryotes by exploiting stacking framework. Briefings in Bioinformatics. 2021;22(4):bbaa275.
Article PubMed Google Scholar
Zhang T, Tang Q, Nie F, Zhao Q, Chen W. DeepLncPro: an interpretable convolutional neural network model for identifying long non-coding RNA promoters. Briefings in Bioinformatics. 2022;23(6):bbac447.
Article PubMed Google Scholar
Chen W, Yang H, Feng P, Ding H, Lin H. iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics. 2017;33(22):3518–23.
Article CAS PubMed Google Scholar
Goñi JR, Pérez A, Torrents D, Orozco M. Determining promoter location based on DNA structure first-principles calculations. Genome Biol. 2007;8(12): R263.
Article PubMed PubMed Central Google Scholar
Jang S-I, Pan T, Li Y, Heidari P, Chen J, Li Q, et al. Spach transformer: spatial and channel-wise transformer based on local and global self-attentions for PET image denoising. IEEE Transact Med Imaging. 2023;43(6):2036–49.
Dey R, Salem FM, editors. Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE; 2017. Dey R, Salem FM, editors. Gate-variants of gated recurrent unit (GRU) neural networks. 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS); 2017 Aug 6-9; Boston: IEEE; 2017. p. 1597–1600.
Yu Y, Si X, Hu C, Zhang J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019;31(7):1235–70.
Article PubMed Google Scholar
Qin D, Jiao L, Wang R, Zhao Y, Hao Y, Liang G. Prediction of antioxidant peptides using a quantitative structure− activity relationship predictor (AnOxPP) based on bidirectional long short-term memory neural network and interpretable amino acid descriptors. Comput Biol Med. 2023;154: 106591.
Article CAS PubMed Google Scholar
Zhang Z, Sabuncu M. Generalized cross entropy loss for training deep neural networks with noisy labels. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, editors. Advances in neural information processing systems, vol. 31. Curran Associates, Inc.; 2018. Available from: https://proceedings.neurips.cc/paper_files/paper/2018/file/f2925f97bc13ad2852a7a551802feea0-Paper.pdf.
Lobo JM, Jiménez-Valverde A, Real R. AUC: a misleading measure of the performance of predictive distribution models. Glob Ecol Biogeogr. 2008;17(2):145–51.
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, editors. Advances in neural information processing systems, vol. 30. Curran Associates, Inc.; 2017. Available from: https://proceedings.neurips.cc/paper_files/paper/2017/file/ea4e6ef6b6537fd1b46b7e2499e9dbf0-Paper.pdf.
Mangalathu S, Hwang S-H, Jeon J-S. Failure mode and effects analysis of RC members based on machine-learning-based SHapley Additive exPlanations (SHAP) approach. Eng Struct. 2020;219: 110927.
Article Google Scholar
Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nature machine intelligence. 2020;2(1):56–67.
Article PubMed PubMed Central Google Scholar
Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. 2013. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.1312.6034.
Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, et al. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 2022;50(W1):W216–21.
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge the support from the National Natural Science Foundation of China (Grant Numbers: 51663001, 52063002, 42061067 and 61741202), the science and technology research project of the education department of Jiangxi province (Grant Numbers: 190745), and the Degree and postgraduate education and teaching reform research project of Jiangxi province(Multi thinking and "439" scientific research ability training mode for interdisciplinary academic degree postgraduate training ). We thank LetPub (www.letpub.com) for its linguistic assistance during the preparation of this manuscript.

Funding

This research was funded by National Natural Science Foundation of China (Grant Numbers: 51663001, 52063002, 42061067 and 61741202).

Author information

Authors and Affiliations

College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China
Wenxing Hu, Yan Yue, Ruomei Yan, Lixin Guan & Mengshan Li

Authors

Wenxing Hu
View author publications
You can also search for this author inPubMed Google Scholar
Yan Yue
View author publications
You can also search for this author inPubMed Google Scholar
Ruomei Yan
View author publications
You can also search for this author inPubMed Google Scholar
Lixin Guan
View author publications
You can also search for this author inPubMed Google Scholar
Mengshan Li
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

W.H: Methodology, Software, Formal analysis, Writing - Original Draft, Writing - Review & Editing, Visualization. M.L: Conceptualization, Resources, Writing - Review & Editing, Supervision, Project administration, Funding acquisition. Y.Y: Validation, Investigation, Data Curation. R.Y: Validation, Data Curation. L.G: Supervision, Funding acquisition.All authors read and approved the final manuscript.

Corresponding author

Correspondence to Mengshan Li.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Hu, W., Yue, Y., Yan, R. et al. An ensemble deep learning framework for multi-class LncRNA subcellular localization with innovative encoding strategy. BMC Biol 23, 47 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12915-025-02148-4

Download citation

Received: 08 May 2024
Accepted: 03 February 2025
Published: 21 February 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12915-025-02148-4

An ensemble deep learning framework for multi-class LncRNA subcellular localization with innovative encoding strategy

Abstract

Background

Results

Conclusions

Graphical abstract

Background

Methods

Datasets

Multi-class modified nucleotide position-aware encoding

Model construction

Model performance evaluation

Results and discussion

Comparison of encoder parameters on model performance

Effect of different feature encoding methods on model performance

Ablation experiment

Comparison with existing predictive models for subcellular localization

Motif analysis

Model feature interpretation

Conclusions

Data availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Biology

Contact us