Browsing by Author "Abhigyan Nath"

Now showing 1 - 20 of 24

An insight into the molecular basis for convergent evolution in fish antifreeze Proteins
(2013) Abhigyan Nath; Radha Chaube; Karthikeyan Subbiah
Antifreeze proteins (AFPs) prevent the growth of ice-crystals in order to enable certain organisms to survive under sub-zero temperature surroundings. These AFPs have evolved from different types of proteins without having any significant structural and sequence similarities among them. However, all the AFPs perform the same function of anti-freeze activity and are a classical example of convergent evolution. We have analyzed fish AFPs at the sequence level, the residue level and the physicochemical property group composition to discover molecular basis for this convergent evolution. Our study on amino acid distribution does not reveal any distinctive feature among AFPs, but comparative study of the AFPs with their close non-AFP homologs based on the physicochemical property group residues revealed some useful information. In particular (a) there is a similar pattern of avoidance and preference of amino acids in Fish AFP subtypes II, III and IV-Aromatic residues are avoided whereas small residues are preferred, (b) like other psychrophilic proteins, AFPs have a similar pattern of preference/avoidance for most of the residues except for Ile, Leu and Arg, and (c) most of the computed amino acids in preferred list are the key functional residues as obtained in previous predicted model of Doxey et al. For the first time this study revealed common patterns of avoidance/preference in fish AFP subtypes II, III and IV. These avoidance/preference lists can further facilitate the identification of key functional residues and can shed more light into the mechanism of antifreeze function. © 2013 Elsevier Ltd.
Comparative study on machine learning techniques in predicting the QoS-values for web-services recommendations
(Institute of Electrical and Electronics Engineers Inc., 2015) Sunil Kumar; Manish Kumar Pandey; Abhigyan Nath; Karthikeyan Subbiah; Manoj Kumar Singh
This is an era of Internet computing and computing as a service on the internet is called cloud computing. Mainly three services like SaaS (applications), PaaS, and IaaS are being accessed through internet on demand, pay as per usage basis. Quality of Service (QoS) is the main issue in internet based computing for service providers and user-dependent as well as user-independent QoS parameters. In the current work we compared different machine learning algorithms for predicting the response time and throughput QoS values using past usage data. Bagging and support vector machines are found to be better performing prediction methods in comparison with other learning algorithms. © 2015 IEEE.
Covering assisted intuitionistic fuzzy bi-selection technique for data reduction and its applications
(Nature Research, 2024) Rajat Saini; Anoop Kumar Tiwari; Abhigyan Nath; Phool Singh; S.P. Maurya; Mohd Asif Shah
The dimension and size of data is growing rapidly with the extensive applications of computer science and lab based engineering in daily life. Due to availability of vagueness, later uncertainty, redundancy, irrelevancy, and noise, which imposes concerns in building effective learning models. Fuzzy rough set and its extensions have been applied to deal with these issues by various data reduction approaches. However, construction of a model that can cope with all these issues simultaneously is always a challenging task. None of the studies till date has addressed all these issues simultaneously. This paper investigates a method based on the notions of intuitionistic fuzzy (IF) and rough sets to avoid these obstacles simultaneously by putting forward an interesting data reduction technique. To accomplish this task, firstly, a novel IF similarity relation is addressed. Secondly, we establish an IF rough set model on the basis of this similarity relation. Thirdly, an IF granular structure is presented by using the established similarity relation and the lower approximation. Next, the mathematical theorems are used to validate the proposed notions. Then, the importance-degree of the IF granules is employed for redundant size elimination. Further, significance-degree-preserved dimensionality reduction is discussed. Hence, simultaneous instance and feature selection for large volume of high-dimensional datasets can be performed to eliminate redundancy and irrelevancy in both dimension and size, where vagueness and later uncertainty are handled with rough and IF sets respectively, whilst noise is tackled with IF granular structure. Thereafter, a comprehensive experiment is carried out over the benchmark datasets to demonstrate the effectiveness of simultaneous feature and data point selection methods. Finally, our proposed methodology aided framework is discussed to enhance the regression performance for IC50 of Antiviral Peptides. © The Author(s) 2024.
Discrimination of psychrophilic and mesophilic proteins using random forest algorithm
(2012) Abhigyan Nath; Radha Chaube; Subbiah Karthikeyan
Psychrophilic organisms are those organisms which thrive at very low temperatures. In order to carry out the normal physiological and biochemical functions, these organisms produces psychrophilic proteins that have evolved through a vast amount of physicochemical adaptations at the sequence and structural levels. Our study is focussed on selecting suitable classification algorithm and appropriate input features for better discrimination of psychrophilic protein sequences from mesophilic protein sequences. We have used amino acid composition and hydrophobic residue patterns as input features and found Random Forest algorithm, a recently developed ensemble machine learning technique for better discriminating between mesophilic and psychrophilic proteins. A balanced dataset with 6000 mesophilic and 6000 psychrophilic sequences for training, and with 8432 psychrophilic and 3169 mesophilic sequences for testing was created and used for experiments. Discrimination using only the statistically significant amino acids taken from previous literature was also experimented. For the first time 70.3% testing accuracy is being reported with 71.3 % correctly predicted psychrophilic and 67.7 % correctly predicted mesophilic proteins. © 2012 IEEE.
Effect of varying degree of resampling on prediction accuracy for observed peptide count in protein mass spectrometry data
(IEEE Computer Society, 2016) Anoop Kumar Tiwari; Abhigyan Nath; Karthikeyan Subbiah; Kaushal Kumar Shukla
Class imbalance affects the learning of classifiers and it is almost ubiquitous in biological data sets. Resampling methods are one of the common methods for balancing imbalanced data sets. SMOTE (Synthetic Minority Oversampling Techniques) is one of the intelligent methods of oversampling. This study examines the performance of learning of machine learning algorithms at different balancing ratios of positive and negative samples in the training set, consisting of the observed peptides and absent peptides in MS experiment. Using SMOTE at different rates we achieved the best result with optimal balancing on boosted random forest that resulted in sensitivity of 92.1%, specificity value of 94.7%, and overall accuracy of 93.4%, MCC of 0.869 and AUC of 0.982 that are better than previously reported results. From the results of current experiments, it can be inferred that suitably modifying the class distribution, the performance of machine learning algorithms on the classification tasks can be enhanced. © 2015 IEEE.
Enhanced identification of β-lactamases and its classes using sequence, physicochemical and evolutionary information with sequence feature characterization of the classes
(Elsevier Ltd, 2017) Abhigyan Nath; S. Karthikeyan
β-lactamases provides one of the most successful means of evading the therapeutic effects of β lactam class of antibiotics by many gram positive and gram negative bacteria. On the basis of sequence identity, β-lactamases have been identified into four distinct classes- A, B, C and D. The classes A, C and D are the serine β-lactamases and class B is the metallo-lactamse. In the present study, we developed a two stage cascade classification system. The first-stage performs the classification of β-lactamases from non-β-lactamases and the second-stage performs the further classification of β-lactamases into four different β–lactamase classes. In the first-stage binary classification, we obtained an accuracy of 97.3% with a sensitivity of 89.1% and specificity of 98.0% and for the second stage multi-class classification, we obtained an accuracy of 87.3% for the class A, 91.0% for the class B, 96.3% for the class C and 96.4% for class D. A systematic statistical analysis is carried out on the sieved-out, correctly-predicted instances from the second stage classifier, which revealed some interesting patterns. We analyzed different classes of β-lactamases on the basis of sequence and physicochemical property differences between them. Among amino acid composition, H, W, Y and V showed significant differences between the different β-lactamases classes. Differences in average physicochemical properties are observed for isoelectric point, volume, flexibility, hydrophobicity, bulkiness and charge in one or more β-lactamase classes. The key differences in physicochemical property groups can be observed in small and aromatic groups. Among amino acid property group n-grams except charged n-grams, all other property group n-grams are significant in one or more classes. Statistically significant differences in dipeptide counts among different β-lactamase classes are also reported. © 2017 Elsevier Ltd
Enhanced Prediction and Characterization of CDK Inhibitors Using Optimal Class Distribution
(International Association of Scientists in the International Association of Scientists in the, 2017) Abhigyan Nath; S. Karthikeyan
Cyclin-dependent kinase inhibitors (CDKIs) govern the regulation of cyclin-dependent kinases, which are responsible for controlling cell cycle progression. The members of the CDKI protein family play important roles in many processes like tumor suppression, apoptosis, transcriptional regulation. The sequence similarity-based search methods to annotate putative CDKIs do not yield optimal performance due to sequence diversity of CDKIs. As a consequence, machine learning-based models have become viable choices for predicting CDKI. In this work, we have developed a framework for handling the class imbalance factor (which is encountered very frequently in biological datasets) in order to enhance the prediction of CDKI through machine learning approaches. We have designed our experiments to achieve the optimal performance of machine learning-based methods in predicting CDKI by investigating the dataset-related prediction enhancement issues, like: (1) What should be the optimal class distribution ratio in the training set? (2) Should we oversample or undersample? (3) At what ratio, positive and negative samples should be oversampled or undersampled? and (4) How to select the best-performing classifier? We have addressed these issues through comparing the results from an imbalanced training set with training sets which are created at different resampling rates by using synthetic minority over-sampling technique and undersampling technique to have varied class distributions. The proposed framework resulted in 100 % sensitivity, 93.7 % specificity, 96.4 % accuracy, 0.929 MCC with 0.981 AUC using simple sequence-based features on a leave-one-out cross-validation test. The generalization ability of the trained model was further tested on four separate blind testing sets. Our work supports the fact that the performance of the algorithms can be enhanced by creating an optimal class distribution in the training set besides fine-tuning of the parameters of the algorithms. This optimal ratio of positive and negative samples in the training set is an important learning enhancement parameter for prediction models based on machine learning algorithms. © 2016, International Association of Scientists in the Interdisciplinary Areas and Springer-Verlag Berlin Heidelberg.
Enhanced Prediction for Observed Peptide Count in Protein Mass Spectrometry Data by Optimally Balancing the Training Dataset
(World Scientific Publishing Co. Pte Ltd, 2017) Anoop Kumar Tiwari; Abhigyan Nath; Karthikeyan Subbiah; Kaushal Kumar Shukla
Imbalanced dataset affects the learning of classifiers. This imbalance problem is almost ubiquitous in biological datasets. Resampling is one of the common methods to deal with the imbalanced dataset problem. In this study, we explore the learning performance by varying the balancing ratios of training datasets, consisting of the observed peptides and absent peptides in the Mass Spectrometry experiment on the different machine learning algorithms. It has been observed that the ideal balancing ratio has yielded better performance than the imbalanced dataset, but it was not the best as compared to some intermediate ratio. By experimenting using Synthetic Minority Oversampling Technique (SMOTE) at different balancing ratios, we obtained the best results by achieving sensitivity of 92.1%, specificity value of 94.7%, overall accuracy of 93.4%, MCC of 0.869, and AUC of 0.982 with boosted random forest algorithm. This study also identifies the most discriminating features by applying the feature ranking algorithm. From the results of current experiments, it can be inferred that the performance of machine learning algorithms for the classification tasks can be enhanced by selecting optimally balanced training dataset, which can be obtained by suitably modifying the class distribution. © 2017 World Scientific Publishing Company.
Enhanced prediction of recombination hotspots using input features extracted by class specific autoencoders
(Academic Press, 2018) Abhigyan Nath; S. Karthikeyan
In yeast and in some mammals the frequencies of recombination are high in some genomic locations which are known as recombination hotspots and in the locations where the recombination is below average are consequently known as coldspots. Knowledge of the hotspot regions gives clues about understanding the meiotic process and also in understanding the possible effects of sequence variation in these regions. Moreover, accurate information about the hotspot and coldspot regions can reveal insights into the genome evolution. In the present work, we have used class specific autoencoders for feature extraction and reduction. Subsequently the deep features that are extracted from the autoencoders were used to train three different classifiers, namely: gradient boosting machines, random forest and deep learning neural networks for predicting the hotspot and coldspot regions. A comparative performance analysis was carried out by experimenting on deep features extracted from different sets of the training data using autoencoders for selecting the best set of deep features. It was observed that learning algorithms trained on features extracted from the combined class specific autoencoder out performed when compared with the performances of these learning algorithms trained with other sets of deep features. So the combined class-specific autoencoder based feature extraction can be applied to a growing range of biological problems to achieve superior prediction performance. © 2018 Elsevier Ltd
Evaluation and use of in silico structure based epitope prediction for listeriolysin O of Listeria monocytogenes
(National Institute of Science Communication and Information Resources (NISCAIR), 2015) Dharmendra Kumar Soni; Abhigyan Nath; Suresh Kumar Dubey
Listeria infection is major health problem causing listeriosis that manifests as abortion, stillbirth, septicemia, meningitis and meningoencephalitis. Listeriolysin O is the cholesterol-dependent cytolysin toxin involved in the escape of L. monocytogenes from primary and secondary intracellular vacuoles and, therefore, can serve as the vital target for vaccine development. Consequently, the present study was aimed to design epitope-based vaccine against Listeria. LLO, ILO, and SLO proteins from L. monocytogenes, L. ivanovii and L. seeligeri, respectively were analyzed using various bioinformatics and immuonoinformatics tools, including sequence and structure-based ones. A total of 11 antigenic B-cell epitopes, and 4 and 3 allelic classes for MHC class I and MHC class II binding peptides, respectively were predicted for LLO protein. The unique peptide 363LGDLRD368 was identified in the LLO protein. Further, we also observed that IgG class of B-cells were predominant in these proteins. The study revealed potential B-cell and T-cell epitope that can raise the desired immune response against these proteins. The present study would, therefore, be helpful in designing and predicting novel vaccine candidates, which in near future might offer the source for eradicating listeriosis.
Identification of human drug targets using machine-learning algorithms
(Elsevier Ltd, 2015) Priyanka Kumari; Abhigyan Nath; Radha Chaube
Identification of potential drug targets is a crucial task in the drug-discovery pipeline. Successful identification of candidate drug targets in entire genomes is very useful, and computational prediction methods can speed up this process. In the current work we have developed a sequence-based prediction method for the successful identification and discrimination of human drug target proteins, from human non-drug target proteins. The training features include sequence-based features, such as amino acid composition, amino acid property group composition, and dipeptide composition for generating predictive models. The classification of human drug target proteins presents a classic example of class imbalance. We have addressed this issue by using SMOTE (Synthetic Minority Over-sampling Technique) as a preprocessing step, for balancing the training data with a ratio of 1:1 between drug targets (minority samples) and non-drug targets (majority samples). Using ensemble classification learning method-Rotation Forest and ReliefF feature-selection technique for selecting the optimal subset of salient features, the best model with selected features can achieve 87.1% sensitivity, 83.6% specificity, and 85.3% accuracy, with 0.71 Matthews correlation coefficient (mcc) on a tenfold stratified cross-validation test. The subset of identified optimal features may help in assessing the compositional patterns in human drug targets. For further validation, using a rigorous leave-one-out cross-validation test, the model achieved 88.1% sensitivity, 83.0% specificity, 85.5% accuracy, and 0.712 mcc. The proposed method was tested on a second dataset, for which the current pipeline gave promising results. We suggest that the present approach can be applied successfully as a complementary tool to existing methods for novel drug target prediction. •Identification of candidate drug targets.•Development of sequence based prediction method.•SMOTE as a pre-processing step for balancing the training data.•Complementary tool for novel drug target prediction. © 2014 Elsevier Ltd.
Inferring biological basis about psychrophilicity by interpreting the rules generated from the correctly classified input instances by a classifier
(Elsevier Ltd, 2014) Abhigyan Nath; Karthikeyan Subbiah
Organisms thriving at extreme cold surroundings are called as psychrophiles and they present a wealth of knowledge about sequence adjustments in proteins that had occurred during the adaptation to low temperatures. In this paper, we propose a new cascading model to investigate the basis for psychrophilicity. In this model, a superior classifier was used to discriminate psychrophilic from mesophilic protein sequences, and then the PART rule generating algorithm was applied on the input instances that are correctly classified by the classifier, to generate human interpretable rules. These derived rules were further validated on a structural dataset and finally analyzed to discover the underlying biological basis about the psychrophilicity. In this study, we have used one of the key features of psychrophilic proteins accountable for remaining functional in extreme cold temperature surroundings i.e., global patterns of amino acid composition as the input features. The rotation forest classifier outperformed all the other classifiers with maximum accuracy of 70.5% and maximum AUC of 0.78. The effect of sequence length on the classification accuracy was also investigated. The analysis of the derived rules and interpretation of the analyzed results had revealed some interesting phenomena such as the amino acids A, D, G, F, and S are over-represented, and T is under-represented in psychrophilic proteins. These findings augment the existing domain knowledge for psychrophilic sequence features. © 2014 Elsevier Ltd.
Insights into the molecular basis of piezophilic adaptation: Extraction of piezophilic signatures
(Academic Press, 2016) Abhigyan Nath; Karthikeyan Subbiah
Piezophiles are the organisms which can successfully survive at extreme pressure conditions. However, the molecular basis of piezophilic adaptation is still poorly understood. Analysis of the protein sequence adjustments that had taken place during evolution can help to reveal the sequence adaptation parameters responsible for protein functional and structural adaptation at such high pressure conditions. In this current work we have used SVM classifier for filtering strong instances and generated human interpretable rules from these strong instances by using the PART algorithm. These generated rules were analyzed for getting insights into the molecular signature patterns present in the piezophilic proteins. The experiments were performed on three different temperature ranges piezophilic groups, namely psychrophilic-piezophilic, mesophilic-piezophilic, and thermophilic-piezophilic for the detailed comparative study. The best classification results were obtained as we move up the temperature range from psychrophilic-piezophilic to thermophilic-piezophilic. Based on the physicochemical classification of amino acids and using feature ranking algorithms, hydrophilic and polar amino acid groups have higher discriminative ability for psychrophilic-piezophilic and mesophilic-piezophilic groups along with hydrophobic and nonpolar amino acids for the thermophilic-piezophilic groups. We also observed an overrepresentation of polar, hydrophilic and small amino acid groups in the discriminatory rules of all the three temperature range piezophiles along with aliphatic, nonpolar and hydrophobic groups in the mesophilic-piezophilic and thermophilic-piezophilic groups. © 2015 Elsevier Ltd.
Insights into the sequence parameters for halophilic adaptation
(Springer-Verlag Wien, 2016) Abhigyan Nath
The sequence parameters for halophilic adaptation are still not fully understood. To understand the molecular basis of protein hypersaline adaptation, a detailed analysis is carried out, and investigated the likely association of protein sequence attributes to halophilic adaptation. A two-stage strategy is implemented, where in the first stage a supervised machine learning classifier is build, giving an overall accuracy of 86 % on stratified tenfold cross validation and 90 % on blind testing set, which are better than the previously reported results. The second stage consists of statistical analysis of sequence features and possible extraction of halophilic molecular signatures. The results of this study showed that, halophilic proteins are characterized by lower average charge, lower K content, and lower S content. A statistically significant preference/avoidance list of sequence parameters is also reported giving insights into the molecular basis of halophilic adaptation. D, Q, E, H, P, T, V are significantly preferred while N, C, I, K, M, F, S are significantly avoided. Among amino acid physicochemical groups, small, polar, charged, acidic and hydrophilic groups are preferred over other groups. The halophilic proteins also showed a preference for higher average flexibility, higher average polarity and avoidance for higher average positive charge, average bulkiness and average hydrophobicity. Some interesting trends observed in dipeptide counts are also reported. Further a systematic statistical comparison is undertaken for gaining insights into the sequence feature distribution in different residue structural states. The current analysis may facilitate the understanding of the mechanism of halophilic adaptation clearer, which can be further used for rational design of halophilic proteins. © 2015 Springer-Verlag.
Maximizing lipocalin prediction through balanced and diversified training set and decision fusion
(Elsevier Ltd, 2015) Abhigyan Nath; Karthikeyan Subbiah
Lipocalins are short in sequence length and perform several important biological functions. These proteins are having less than 20% sequence similarity among paralogs. Experimentally identifying them is an expensive and time consuming process. The computational methods based on the sequence similarity for allocating putative members to this family are also far elusive due to the low sequence similarity existing among the members of this family. Consequently, the machine learning methods become a viable alternative for their prediction by using the underlying sequence/structurally derived features as the input. Ideally, any machine learning based prediction method must be trained with all possible variations in the input feature vector (all the sub-class input patterns) to achieve perfect learning. A near perfect learning can be achieved by training the model with diverse types of input instances belonging to the different regions of the entire input space. Furthermore, the prediction performance can be improved through balancing the training set as the imbalanced data sets will tend to produce the prediction bias towards majority class and its sub-classes. This paper is aimed to achieve (i) the high generalization ability without any classification bias through the diversified and balanced training sets as well as (ii) enhanced the prediction accuracy by combining the results of individual classifiers with an appropriate fusion scheme. Instead of creating the training set randomly, we have first used the unsupervised Kmeans clustering algorithm to create diversified clusters of input patterns and created the diversified and balanced training set by selecting an equal number of patterns from each of these clusters. Finally, probability based classifier fusion scheme was applied on boosted random forest algorithm (which produced greater sensitivity) and K nearest neighbour algorithm (which produced greater specificity) to achieve the enhanced predictive performance than that of individual base classifiers. The performance of the learned models trained on Kmeans preprocessed training set is far better than the randomly generated training sets. The proposed method achieved a sensitivity of 90.6%, specificity of 91.4% and accuracy of 91.0% on the first test set and sensitivity of 92.9%, specificity of 96.2% and accuracy of 94.7% on the second blind test set. These results have established that diversifying training set improves the performance of predictive models through superior generalization ability and balancing the training set improves prediction accuracy. For smaller data sets, unsupervised Kmeans based sampling can be an effective technique to increase generalization than that of the usual random splitting method. © 2015 Elsevier Ltd. All rights reserved.
Mining Chemogenomic Spaces for Prediction of Drug–Target Interactions
(Humana Press Inc., 2024) Abhigyan Nath; Radha Chaube
The pipeline of drug discovery consists of a number of processes; drug–target interaction determination is one of the salient steps among them. Computational prediction of drug–target interactions can facilitate in reducing the search space of experimental wet lab-based verifications steps, thus considerably reducing time and other resources dedicated to the drug discovery pipeline. While machine learning-based methods are more widespread for drug–target interaction prediction, network-centric methods are also evolving. In this chapter, we focus on the process of the drug–target interaction prediction from the perspective of using machine learning algorithms and the various stages involved for developing an accurate predictor. © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024.
Missing QoS-values predictions using neural networks for cloud computing environments
(Institute of Electrical and Electronics Engineers Inc., 2016) Sunil Kumar; Manish Kumar Pandey; Abhigyan Nath; Karthikeyan Subbiah
Cloud computing environment is influenced by user-dependent quality of service (QoS) parameters in evaluating the performance of Web services apart from others factors. Among the performance QoS parameters, mainly response-time and throughput could be modulated to provide very efficient services for cloud users. As per user's requirement, the service provider's recommendation of appropriate Web services to the end-users with proper QoS satisfaction is one of the critical issues. This can be recommended to end-users in the Service Level Agreement (SLA) under Web Service Modeling Ontology (WSMO) of WS-Policy. Generally, the matrix of collected QoS parameter values is sparse and the accurate prediction of the missing QoS values is important for the recommendation of appropriate web services to the end users. To address this issue, we worked out an artificial neural network model for the prediction of missing QoS-values using past QoS performance parameter data. In this current work, the performances of different learning algorithms of ANN are analyzed for enhanced prediction of QoS performance values. The ANN model with Bayesian-Regularization is found to be better performing when compared to other learning algorithms. © 2015 IEEE.
Performance Analysis of Ensemble Supervised Machine Learning Algorithms for Missing Value Imputation
(Institute of Electrical and Electronics Engineers Inc., 2016) Sunil Kumar; Manish Kumar Pandey; Abhigyan Nath; Karthikeyan Subbiah
In this era of cloud computing, web services based solutions are gaining popularity. The applications running on distributed environment seek new parameters for them to perform efficiently to satisfy end user's requirements. Finding these parameters for increasing efficiency has become a talk of researchers now days. Non functional performance of a web service is described through User dependent QoS properties. These QoS parameters are generally described in WS-Policy in Service Level Agreement (SLA). Usually in web service QoS datasets, web service QoS values are missing, which makes missing value imputations an important job while working with cloud web services. In the current work we compared the prediction accuracy of two groups of supervised machine learning ensembles based Meta learners: bagging and additive regression (boosting) with a fusion of the seven base learners in both. Random forest is found to be better performing in both Meta learners: bagging and boosting than other learning algorithms. © 2016 IEEE.
Prediction of human drug targets and their interactions using machine learning methods: Current and future perspectives
(Humana Press Inc., 2018) Abhigyan Nath; Priyanka Kumari; Radha Chaube
Identification of drug targets and drug target interactions are important steps in the drug-discovery pipeline. Successful computational prediction methods can reduce the cost and time demanded by the experimental methods. Knowledge of putative drug targets and their interactions can be very useful for drug repurposing. Supervised machine learning methods have been very useful in drug target prediction and in prediction of drug target interactions. Here, we describe the details for developing prediction models using supervised learning techniques for human drug target prediction and their interactions. © 2018, Springer Science+Business Media, LLC, part of Springer Nature.
Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors
(Springer Verlag, 2016) Abhigyan Nath; Karthikeyan Subbiah
To counter the host RNA silencing defense mechanism, many plant viruses encode RNA silencing suppressor proteins. These groups of proteins share very low sequence and structural similarities among them, which consequently hamper their annotation using sequence similarity-based search methods. Alternatively the machine learning-based methods can become a suitable choice, but the optimal performance through machine learning-based methods is being affected by various factors such as class imbalance, incomplete learning, selection of inappropriate features, etc. In this paper, we have proposed a novel approach to deal with the class imbalance problem by finding the optimal class distribution for enhancing the prediction accuracy for the RNA silencing suppressors. The optimal class distribution was obtained using different resampling techniques with varying degrees of class distribution starting from natural distribution to ideal distribution, i.e., equal distribution. The experimental results support the fact that optimal class distribution plays an important role to achieve near perfect learning. The best prediction results are obtained with Sequential Minimal Optimization (SMO) learning algorithm. We could achieve a sensitivity of 98.5 %, specificity of 92.6 % with an overall accuracy of 95.3 % on a tenfold cross validation and is further validated using leave one out cross validation test. It was also observed that the machine learning models trained on oversampled training sets using synthetic minority oversampling technique (SMOTE) have relatively performed better than on both randomly undersampled and imbalanced training data sets. Further, we have characterized the important discriminatory sequence features of RNA-silencing suppressors which distinguish these groups of proteins from other protein families. © 2016, The Author(s).