1. About This Work

Lysine malonylation is a chemical reaction of one or more enzymes that catalyze the transfer of malonyl groups from malonyl-CoA to lysine residues and plays a key role in regulating protein functions(Xie et al. 2012). In addition, recent research has shown that using the liver tissues of db/db and ob/ob mice to detect the number of lysine malonylation which initially demonstrates that the enrichment of malonylated proteins have a influence on metabolic pathways, especially those involved in glucose and fatty acid metabolism(Du et al. 2015).

In this survey work, first, we systematically analyzed, benchmarked and compared 11 different types of feature encoding methods categorized in three major groups, which are carefully designed from different aspects to capture the useful features for predicting lysine malonylation sites. Using the benchmark datasets of the three species E. coli, M. musculus, and H. sapiens, we further explored the performance of the combination of different machine learning methods and obtained the optimal feature sets for each species based on feature selection. Second, using the optimal feature sets, different prediction models were trained using five machine learning methods (including classic and recently proposed machine learning methods) and compared on rigorous both 10 times of 10-fold cross-validation and independent tests. Third, we explored the integration of these single machine learning method-based models and showed that the ensemble strategy could further improve the prediction performance and robustness of the model. The optimal ensemble models outperformed current state-of-the-art predictors for identifying lysine malonylation sites on the independent datasets across the three different species. Last, we have developed a user-friendly web server called kmal-sp based on the optimal ensemble models for the wider research community to use. We anticipate that our findings, the proposed computational framework, integration and ensemble strategy, together with the implemented online web server, could instil a new momentum for bioinformatics studies of lysine malonylation sites and other functionally important PTM types and inspire users to develop new computational methods. Such efforts and studies will continually make a contribution to improving our understanding of the important determinants of protein PTM and facilitating their discovery.

2. Corresponding Authors

Yanju Zhang, Bioinformatics group, Guilin University of Electronic Technology, Guilin 541004, China.
Email: yanjuzhang@guet.edu.cn

In this study, we use the experimentally affirmed malonyllysine dataset which was acquired from a previously large research report(L. N. Wang et al. 2017). The peptide was defined as positive peptide when central K was malonylated; otherwise it was defined negative peptide. Ultimately, we collected 1553, 2609, 3885 positive peptide samples and 7830, 26655, 52027 negative peptide samples for E.coli, M.musculus and H.sapiens, respectively. The Non-redundant datasets were randomly segmented into the benchmark datasets and the independent datasets.

1. KMAL-SP

We developed an online bioinformatics server, termed KMAL-SP (Lysine malonylation site prediction system), to provide a user-friendly KMAL site prediction service. To the best of our knowledge, KMAL-SP could predict malonylation site better based on machine learning methods supplied by this server. We envisage this server will be widely used to facilitate discovery of novel KMAL site.

2. Using KMAL-SP

KMAL-SP is an online server with a user-friendly interface and few parameters, therefore it is easy to use. All you need to do is to fill the input box and select species. The prediction job will be put into the queue system. All the jobs will be executed by KMAL-SP server successively. After your job is finished, you will receive an e-mail with a url of your job result.

2.1 Input Formats

Two types of input are accepted by KMAL-SP: sequences in FASTA format or FASTA file.

For sequences in FASTA format, you can input as follows:

>P00350_320
AQPAGDKAEFIEKVRRALYLGKIVS
>P00370_394
LEMAQNAARLGWKAEKVDARLHHIM
>sp|P0ABD5|
MSLNFLDFEQPIAELEAKIDSLTAVSRQDEKLDINIDEEVHRLREKSVELTRKIFADLGAWQIAQLARHPQRPYTLDYVRLAFDEFDELAGDRAYADDKAIVGGIARLDGRPVMIIGHQKGRETKEKIRRNFGMPAPEGYRKALRLMQMAERFKMPIITFIDTPGAYPGVGAEERGQSEAIARNLREMSRLGVPVVCTVIGEGGSGGALAIGVGDKVNMLQYSTYSVISPEGCASILWKSADKAPLAAEAMGIIAPRLKELKLIDSIIPEPLGGAHRNPEAMAASLKAQLLADLADLDVLSTEDLKNRRYQRLMSYGYA
2.2 Model used to custom KMAL-SP ensemble predictor

The prediction performance of the ensemble model was further improved by integrating five machine learning methods. Following the above method, we finally built the optimal model for each of the three species.

2.3 Input limits

a. The length of each submitted sequence should be in the range of 13 and 5000.

b. Since KMAL-SP prediction is a time-consuming job, the max number of submitted sequences each time should be no more than 50.

3. KMAL-SP Prediction Result Instructions

The input sequences were truncated into 25-residue sequence segments with lysine (K) located at the center (if sequence is not enough, the vector O is represented by the missing amino acids). We will predict the probability that all lysines (K) in the sequence are malonylation.

For a computationally predicted protein, The result includes the sequence No, the sequence name, the position of the sequence of K, the length of the sequence of residues after the interception of 25, and the score of the optimal fitted model (shown in the following figure).