Protein encodings
Protein encoding for deep learning models
By Niels van der Velden in Deep Learning Machine Learning Protein
June 15, 2022
Introduction
There are countless examples in the literature where deep learning is used to make predictions using protein sequences. In order to use protein sequences as inputs for deep learning models they need to be transformed into numerical descriptors. This can be performed in several different ways depending on the underlying data and the purpose of the model. In this notebook I highlight four different ways to generate such numerical features from protein sequences using; 1) the Amino Acid composition, 2) generation of N-grams, 3) creation of word embeddings and 4) BLOSUM matrices. To illustrate how to use these descriptors I use them to train a deep learning model to predict protein classifications. The models themselves were kept as basic as possible and although reaching good accuracy were not fully optimized.
Libraries
The below libraries were used to process the data and build the models.
# import the necessary packages
from collections import Counter
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split, KFold
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, LabelBinarizer, OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, Embedding, Flatten, Concatenate, Input, Conv1D, MaxPooling1D, LeakyReLU, LSTM, BatchNormalization
from tensorflow.keras.optimizers import Adam
from keras.losses import mean_squared_error
from sklearn.metrics import r2_score, accuracy_score
from keras import backend as K
from keras.utils.np_utils import to_categorical
from sklearn import linear_model
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import cv2
np.random.seed(1337) # for reproducibility
Data
The dataset was compiled form the Protein Data Bank and posted by SHAHIR on kaggle for people to use to test different modeling approaches. The PDB database is used to visualize and store three-dimensional structural data of proteins. Each structure comes with a comprehensive list of all the experimental details that was used to generate the structure and the characteristics of the protein. It also classifies the protein into a family based on structural and sequence similarity with other proteins.
Processing
Before using the data several steps were performed to make it more presentable for our deep learning model which is outlined below.
The data consists of two separate files. One file containing all the details about the structure and a second file containing the details on the protein sequence. Both files are loaded and merged together based on Structure ID. The file also contains structural data from DNA and RNA. These will be removed as we are only interested in the protein structures. We are only interested in the protein IDs and the sequences so we drop all other columns.
pdb_data_no_dups = pd.read_csv("Data/pdb_data_no_dups.csv")
#pdb_data_no_dups = pdb_data_no_dups[[""]]
pdb_data_seq = pd.read_csv("Data/pdb_data_seq.csv")
#Merge data on structureId and select only protein structures
data_merged = pdb_data_no_dups.merge(pdb_data_seq, how='inner', on='structureId').query('macromoleculeType_x == "Protein"')
#Select columns we are interested in and remove duplicates
data_merged = data_merged[["structureId", "classification", "sequence"]]
data_merged = data_merged.drop_duplicates(["structureId", "classification", "sequence"])
data_merged = data_merged.reset_index(drop=True)
#Print data
'The data has {} rows and {} columns'.format(data_merged.shape[0], data_merged.shape[1])
## 'The data has 166681 rows and 3 columns'
pd.set_option('display.max_columns', None)
data_merged
## structureId classification \
## 0 101M OXYGEN TRANSPORT
## 1 102L HYDROLASE(O-GLYCOSYL)
## 2 102M OXYGEN TRANSPORT
## 3 103L HYDROLASE(O-GLYCOSYL)
## 4 103M OXYGEN TRANSPORT
## ... ... ...
## 166676 9RSA HYDROLASE (PHOSPHORIC DIESTER)
## 166677 9RUB LYASE(CARBON-CARBON)
## 166678 9WGA LECTIN (AGGLUTININ)
## 166679 9XIA ISOMERASE(INTRAMOLECULAR OXIDOREDUCTASE)
## 166680 9XIM ISOMERASE(INTRAMOLECULAR OXIDOREDUCTASE)
##
## sequence
## 0 MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
## 1 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSE...
## 2 MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
## 3 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAK...
## 4 MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
## ... ...
## 166676 KETAAAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRCKPVNTF...
## 166677 MDQSSRYVNLALKEEDLIAGGEHVLCAYIMKPKAGYGYVATAAHFA...
## 166678 ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...
## 166679 MNYQPTPEDRFTFGLWTVGWQGRDPFGDATRRALDPVESVQRLAEL...
## 166680 SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...
##
## [166681 rows x 3 columns]
Rows with missing values are removed
data_merged.isna().sum()
## structureId 0
## classification 1
## sequence 3
## dtype: int64
data_merged = data_merged.dropna()
Having a closer look at the data one can see that some of the sequences contain AAs other then the 21 standard ones (B|O|U|X|Z).
#Count all different AAs
sequences = data_merged["sequence"].dropna()
charCounter = CountVectorizer(analyzer="char", ngram_range=(1,1))
charVec = charCounter.fit_transform(sequences.to_numpy())
AA = charCounter.get_feature_names_out()
AAcounts = np.asarray(charVec.sum(axis=0))[0]
print(dict(zip(AA, AAcounts)))
## {'a': 3378009, 'b': 19, 'c': 642515, 'd': 2484519, 'e': 2835673, 'f': 1746745, 'g': 3211970, 'h': 1176659, 'i': 2389749, 'k': 2517051, 'l': 3940616, 'm': 1001217, 'n': 1857711, 'o': 2, 'p': 2048893, 'q': 1659623, 'r': 2136127, 's': 2837198, 't': 2460573, 'u': 50, 'v': 2990099, 'w': 628670, 'x': 177570, 'y': 1546907, 'z': 28}
For the sake of simplicity I decided to remove all those sequences.
#Remove sequences that contain non standard AAs
AA_mask = sequences.str.contains('^[RHKDESTNQCGPAILMFWYV]+$')
data_merged = data_merged.loc[AA_mask]
'{} rows were removed'.format((AA_mask == False).sum())
## '4377 rows were removed'
If we look at the size distribution of the sequences we can see that quite a large proportion consists of sequences shorter then 20 AAs.
seqLengths = data_merged["sequence"].str.len()
lengths = Counter(seqLengths.to_list())
plt.clf()
# visualize
_ = plt.hist(seqLengths.to_list(), bins=120, edgecolor='black')
_ = plt.xlim(0, 1200)
plt.show()
These are most likely from proteins for which only part of the structure has been solved and it will be hard to predict an accurate classification for those. To get more accurate models I removed all sequences that have a length with less then 20 AAs.
#Filter out sequences that are shorter then 20 AAs
AA_length_mask = data_merged["sequence"].str.len() >= 20
data_merged = data_merged.loc[AA_length_mask]
There are >4000 different classifications defined. Most of them consisting of far to little sequences to train a model.
data_merged.groupby(["classification"])["classification"] \
.count() \
.sort_values(ascending = False)
## classification
## HYDROLASE 23465
## TRANSFERASE 16557
## OXIDOREDUCTASE 14288
## IMMUNE SYSTEM 7920
## HYDROLASE/HYDROLASE INHIBITOR 5661
## ...
## NOVEL SEQUENCE 1
## NONSTRUCTURAL PROTEIN 1
## NON-HEME IRON PROTEIN 1
## NK CELL 1
## virus/inhibitor 1
## Name: classification, Length: 4383, dtype: int64
To build the models I focus on the top 10 classifications.
#Filter for top10 classifications
classes = 10
data_merged['count'] = data_merged.groupby(["classification"])["classification"].transform('count')
top_class = data_merged["count"].sort_values(ascending=False).unique()[classes]
data_top_classes = data_merged[data_merged["count"] > top_class].sort_values(by=["count"], ascending=False)
data_top_classes = data_top_classes.reset_index(drop=True)
#Print results
data_top_classes.groupby(["classification"])["classification"] \
.count() \
.sort_values(ascending = False)
## classification
## HYDROLASE 23465
## TRANSFERASE 16557
## OXIDOREDUCTASE 14288
## IMMUNE SYSTEM 7920
## HYDROLASE/HYDROLASE INHIBITOR 5661
## LYASE 4476
## TRANSCRIPTION 4288
## TRANSPORT PROTEIN 3751
## SIGNALING PROTEIN 3284
## VIRAL PROTEIN 2761
## Name: classification, dtype: int64
Transformation
To build the models we binarize the classes. See: link for an explanation on choosing the right encoding method.
lb = LabelBinarizer()
y = lb.fit_transform(data_top_classes["classification"])
y[1:10]
## array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
## [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
## [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
## [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
## [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
## [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
## [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
## [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
## [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
The sequences are transformed to a numpy array.
X = data_top_classes["sequence"].to_numpy()
Amino Acid composition model
As a first most basic model we can use the Amino Acid (AA) composition of each sequence to train our model. To do this we count the number of AAs for each sequence.
vectorizer = CountVectorizer(analyzer="char")
X_vec = vectorizer.fit_transform(X)
X_AAC = X_vec.toarray()
feature_names = vectorizer.get_feature_names_out()
pd.DataFrame(X_AAC, columns = feature_names)
## a c d e f g h i k l m n p q r s t v \
## 0 22 1 23 20 11 16 6 27 29 18 2 16 19 7 6 13 19 14
## 1 19 4 20 15 8 18 11 17 18 24 8 9 13 4 11 18 17 10
## 2 23 3 21 23 19 29 7 26 17 23 8 24 22 11 18 31 26 12
## 3 23 3 21 22 19 29 7 26 17 23 8 24 22 12 18 31 26 12
## 4 40 0 22 16 9 26 13 16 9 23 5 11 20 13 15 13 28 28
## ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
## 86446 13 4 4 3 4 18 0 5 5 11 2 4 6 8 5 16 8 9
## 86447 20 5 12 13 18 25 6 10 7 30 4 21 29 17 12 20 30 22
## 86448 12 2 5 5 5 15 1 2 4 7 3 4 3 8 8 13 9 7
## 86449 16 5 20 9 20 25 7 10 6 28 3 19 28 20 14 22 20 22
## 86450 53 3 28 28 17 32 6 30 35 45 7 28 24 19 13 19 24 25
##
## w y
## 0 6 11
## 1 1 9
## 2 11 26
## 3 11 26
## 4 4 12
## ... .. ..
## 86446 2 8
## 86447 5 9
## 86448 3 8
## 86449 4 10
## 86450 10 17
##
## [86451 rows x 20 columns]
AAC Model
For the AA model I use three dense layers with the last layer having 10 units which is equal to the amount of classes that we try to predict. I use early stopping based on the accuracy of the test data to avoid over fitting. To limit the training time I increased the batch size to 128. The model is wrapped into a function such that I can re-use the model for Cross Validation.
def Model_AAC(X_train, y_train, X_test, y_test, classes = 10):
#Define model
model = Sequential()
model.add(Dense(16, input_dim=X_train.shape[1], activation="relu"))
model.add(Dense(8, activation="relu"))
model.add(Dense(classes, activation="softmax"))
#Define optimizer
opt = Adam(learning_rate=0.01)
early_stopping = EarlyStopping(min_delta=0.01,
patience=5,
monitor='val_accuracy',
verbose=0,
mode="max",
restore_best_weights=True
)
#Compile model
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
model.fit(x=X_train, y=y_train,
validation_data=(X_test, y_test),
epochs=100,
callbacks=[early_stopping],
verbose=0,
batch_size=128)
y_pred = model.predict(X_test)
test_acc = accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1))
return test_acc
Valdidation
To test the model we check its performance on the test data that we set apart and using 5-fold cross validation.
Training data
To asses the model the data is split into a train set and 20% of the data is held back to evaluate our model. To avoid bias I normalized the data using the MinMaxScaler() (see: link for an explanation on normalization of data).
#Generate train and test data
X_train, X_test, y_train, y_test = train_test_split(X_AAC, y, test_size=.2, random_state=42, shuffle=True)
#Scale data
scaler = MinMaxScaler()
X_train = scaler.fit_transform(pd.DataFrame(X_train, columns = feature_names)) #Fit and transform data on training data
X_test = scaler.transform(pd.DataFrame(X_test, columns = feature_names)) #Transform data using the training parameters for fitting.
#Fit data to Model
train_acc = Model_AAC(X_train, y_train, X_test, y_test)
print("Accuracy-train = {:.3f}".format(train_acc))
## Accuracy-train = 0.452
Cross Validation
Estimate accuracy using 5-fold cross validation.
#Generate X, y data
ACC = []
kf = KFold(n_splits=5, random_state=42, shuffle=True)
i = 1
for train_index, test_index in kf.split(X_AAC):
X_train, X_test = X_AAC[train_index], X_AAC[test_index]
y_train, y_test = y[train_index], y[test_index]
scaler = MinMaxScaler()
X_train = scaler.fit_transform(pd.DataFrame(X_train))
X_test = scaler.transform(pd.DataFrame(X_test))
pred = Model_AAC(X_train, y_train, X_test, y_test)
print("Accuracy-fold-{} = {:.3f}".format(i, pred))
i = i+1
ACC += [pred]
## Accuracy-fold-1 = 0.435
## Accuracy-fold-2 = 0.448
## Accuracy-fold-3 = 0.428
## Accuracy-fold-4 = 0.443
## Accuracy-fold-5 = 0.429
print("Accuracy-mean = {:.3f}".format(np.mean(ACC, dtype=object)))
## Accuracy-mean = 0.437
Conclusion
Using the AA composition alone we get an accuracy of around 0.4. This is not too bad considering we are trying to predict 10 different classes. This type of encoding can work well for shorter sequences like peptides. However, with this type of encoding any information on the order of the sequences is lost. Especially for longer sequences this is important as many classes are defined by specific domains within the sequence.
N-grams model
To preserve some information on the order of AAs in the sequences we can count the occurrences of the most common AA combinations across all the sequences. In the below example we look at the 20 most common 3 letter AA combinations and then count the occurrence of each for all the sequences.
vectorizer = CountVectorizer(analyzer="char", ngram_range=(3,3), max_features = 20)
X_train_vec = vectorizer.fit_transform(X)
feature_names = vectorizer.get_feature_names_out()
#Print results
pd.DataFrame(X_train_vec.toarray(), columns = feature_names)
## aaa aag aal ala ale all avl eal ell gla gsg gss hhh laa \
## 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0
## 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## 4 0 0 1 0 0 0 0 0 0 0 0 0 4 0
## ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
## 86446 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## 86447 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## 86448 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 86449 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 86450 2 0 2 1 0 0 0 1 0 1 0 0 0 1
##
## lke lla lle lll sgs ssg
## 0 0 0 0 0 1 0
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 1 0 0 0 0
## ... ... ... ... ... ... ...
## 86446 0 0 0 0 1 0
## 86447 0 0 0 0 0 1
## 86448 0 0 0 0 0 0
## 86449 0 0 0 0 0 0
## 86450 0 1 0 0 0 0
##
## [86451 rows x 20 columns]
We use the above principle to count the occurrence of AA combinations with a length of 1 to 3 AAs and use the top 8000 features as input for our model. Note that we split the data in a train and test set first. We then generate all the N-grams based on the training data. This makes sure that we don’t have data leakage when we estimate the accuracy on the test data. The occurrences are then counted for both the test and train data. As the final step we scale the values (see the AA composition model).
#Generate train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42, shuffle=True)
#Generate N-grams and fit on train and test data
vectorizer = CountVectorizer(analyzer="char", ngram_range=(1,3), max_features = 8000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
feature_names = vectorizer.get_feature_names_out()
#Scale values
scaler = MinMaxScaler()
X_train_AAN_scaled = scaler.fit_transform(pd.DataFrame(X_train_vec.toarray(), columns = feature_names))
X_test_AAN_scaled = scaler.transform(pd.DataFrame(X_test_vec.toarray(), columns = feature_names)) #Use X_train parameters for fitting.
#Print results
pd.set_option('display.max_columns', 10)
pd.DataFrame(X_train_AAN_scaled, columns = feature_names)
## a aa aaa aac aad ... yys yyt yyv yyw \
## 0 0.036585 0.000000 0.000000 0.0 0.0 ... 0.0 0.0 0.000000 0.0
## 1 0.004878 0.000000 0.000000 0.0 0.0 ... 0.0 0.5 0.000000 0.0
## 2 0.029268 0.013245 0.000000 0.0 0.2 ... 0.0 0.0 0.000000 0.0
## 3 0.151220 0.119205 0.040000 0.0 0.0 ... 0.0 0.0 0.000000 0.0
## 4 0.007317 0.000000 0.000000 0.0 0.0 ... 0.0 0.0 0.000000 0.0
## ... ... ... ... ... ... ... ... ... ... ...
## 69155 0.080488 0.033113 0.006667 0.0 0.0 ... 0.0 0.5 0.000000 0.0
## 69156 0.007317 0.000000 0.000000 0.0 0.0 ... 0.0 0.0 0.333333 0.0
## 69157 0.119512 0.039735 0.000000 0.0 0.2 ... 0.0 0.0 0.000000 0.0
## 69158 0.007317 0.000000 0.000000 0.0 0.0 ... 0.0 0.0 0.000000 0.0
## 69159 0.078049 0.013245 0.000000 0.0 0.0 ... 0.0 0.0 0.000000 0.0
##
## yyy
## 0 0.0
## 1 0.0
## 2 0.0
## 3 0.0
## 4 0.0
## ... ...
## 69155 0.0
## 69156 0.0
## 69157 0.0
## 69158 0.0
## 69159 0.0
##
## [69160 rows x 8000 columns]
Model
For comparison we use the same set-up as the Amino Acid composition model. However, more layers and nodes will most likely be able to improve the accuracy since we have far more features.
def Model_AAN(X_train, y_train, X_test, y_test, classes = 10):
#Define model
model = Sequential()
model.add(Dense(16, input_dim=X_train.shape[1], activation="relu"))
model.add(Dense(8, activation="relu"))
model.add(Dense(classes, activation="softmax"))
#Define optimizer
opt = Adam(learning_rate=0.01)
early_stopping = EarlyStopping(min_delta=0.01,
patience=5,
monitor='val_accuracy',
verbose=0,
mode="max",
restore_best_weights=True
)
#Compile model
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
model.fit(x=X_train, y=y_train,
validation_data=(X_test, y_test),
epochs=100,
callbacks=[early_stopping],
verbose=0,
batch_size=128)
y_pred = model.predict(X_test)
test_acc = accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1))
return test_acc
Validation
Training data
The same as with the AA composition model the values are scaled and fit to the model
#Fit data to Model
train_acc = Model_AAN(X_train_AAN_scaled, y_train, X_test_AAN_scaled, y_test)
print("Accuracy-train = {:.3f}".format(train_acc))
## Accuracy-train = 0.806
Cross Validation
#Generate X, y data
ACC = []
kf = KFold(n_splits=5, random_state=42, shuffle=True)
i = 1
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
#Generate N-grams
vectorizer = CountVectorizer(analyzer="char", ngram_range=(1,3), max_features = 8000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
feature_names = vectorizer.get_feature_names_out()
#Scale values
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(pd.DataFrame(X_train_vec.toarray(), columns = feature_names))
X_test_scaled = scaler.transform(pd.DataFrame(X_test_vec.toarray(), columns = feature_names)) #Use X_train parameters for fitting.
pred = Model_AAN(X_train_scaled, y_train, X_test_scaled, y_test)
print("Accuracy-fold-{} = {:.3f}".format(i, pred))
i = i+1
ACC += [pred]
## Accuracy-fold-1 = 0.807
## Accuracy-fold-2 = 0.802
## Accuracy-fold-3 = 0.811
## Accuracy-fold-4 = 0.794
## Accuracy-fold-5 = 0.804
print("Accuracy-mean = {:.3f}".format(np.mean(ACC)))
## Accuracy-mean = 0.804
Conclusion
Using N-grams we get an accuracy of around 0.8. This is a very big improvement over using only the AA composition. The increase in accuracy is mostly because with this kind of encoding we are capturing the AA motifs that are most common among the sequences. We might be able to increase the accuracy even further by playing around with the length of the words and increasing the complexity of the model by adding more layers and nodes.
Embedding model
For this encoding the sequences are first split into AAs and each AA is assigned a number (tokenized).
#Tokenize sequences
tk = Tokenizer(char_level=True)
tk.fit_on_texts(X)
seq_tok_all = tk.texts_to_sequences(X)
seq_tok_all[1]
## [18, 15, 10, 6, 1, 4, 12, 20, 1, 9, 7, 13, 16, 2, 16, 10, 1, 17, 7, 6, 7, 8, 3, 8, 4, 3, 4, 4, 7, 12, 5, 6, 2, 6, 12, 10, 10, 7, 5, 1, 9, 11, 5, 3, 11, 13, 1, 8, 16, 10, 1, 13, 8, 17, 17, 17, 16, 7, 17, 8, 3, 3, 13, 1, 6, 1, 9, 7, 11, 16, 3, 2, 9, 4, 10, 3, 5, 2, 18, 7, 9, 7, 11, 10, 12, 3, 10, 7, 18, 2, 1, 9, 7, 3, 7, 9, 19, 18, 14, 2, 3, 17, 6, 4, 17, 4, 18, 7, 8, 12, 3, 17, 8, 9, 3, 17, 10, 5, 1, 16, 14, 12, 3, 5, 11, 2, 10, 14, 8, 3, 7, 8, 18, 14, 5, 1, 5, 20, 3, 9, 1, 14, 6, 3, 8, 12, 9, 15, 18, 1, 2, 5, 1, 15, 9, 10, 8, 5, 1, 12, 7, 7, 8, 5, 10, 16, 20, 3, 17, 6, 16, 8, 1, 5, 13, 5, 9, 14, 2, 1, 5, 1, 6, 12, 13, 13, 6, 4, 1, 15, 5, 16, 2, 2, 17, 4, 2, 6, 1, 11, 5, 9, 9, 1, 12, 8, 10, 12, 8, 8, 4, 9, 18, 6, 9, 2, 20, 13, 12, 14, 1, 11, 5, 5, 13, 8, 7, 10, 11, 11, 2, 1, 11, 10, 12, 6, 2, 2, 7, 6, 2, 6, 2, 1, 3, 10, 10, 11, 9, 2, 9, 7, 7, 14]
The protein sequences have variable lengths while for deep learning models equal size sequences are required. For this reason shorter sequences need to be padded. The longest sequence in the data is around 5000 AAs long therefore by default all other sequences will be padded to the same length. This will result in a very large data set that will take very long to train. For this reason I decided to add a maximum length of 256 AAs. Sequences shorther then 256 AAs will be padded and sequences longer then 256 AAs will be trimmed.
max_length = 256
#max_length = len(max(X, key=len))
X_vec = pad_sequences(seq_tok_all, maxlen=max_length, padding='post') #padding='post'
Model
For this model a Convolutional kernel (Conv1D) is used as this type seemed to work better then a dense layer with this type of sequence representation. Before the data is fed into the model an Embedding layer is used which which transforms the tokenized sequences into a continues numeric vector of length 8.
def Model_embed(X_train, y_train, X_test, y_test, max_length=max_length):
#Define model
model = Sequential()
model.add(Embedding(input_dim=len(tk.word_index)+1, output_dim=8, input_length=max_length, trainable=True))
model.add(Conv1D(filters=64, kernel_size=6, padding="same", activation="relu"))
model.add(Flatten())
model.add(Dense(10, activation="softmax"))
opt = Adam(learning_rate=0.01)
early_stopping = EarlyStopping(min_delta=0.01,
patience=5,
monitor='val_accuracy',
verbose=0,
mode="max",
restore_best_weights=True
)
#Compile model
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
model.fit(x=X_train, y=y_train,
validation_data=(X_test, y_test),
epochs=100,
callbacks=[early_stopping],
verbose=0,
batch_size=128)
y_pred = model.predict(X_test)
test_acc = accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1))
return test_acc
Training data
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=.2, random_state=42, shuffle=True)
train_acc_embed = Model_embed(X_train, y_train, X_test, y_test, max_length=max_length)
print("Accuracy-train = {:.3f}".format(train_acc_embed))
## Accuracy-train = 0.762
Cross Validation
#Generate X, y data
ACC = []
kf = KFold(n_splits=5, random_state=42, shuffle=True)
i = 1
for train_index, test_index in kf.split(X_vec):
X_train, X_test = X_vec[train_index], X_vec[test_index]
y_train, y_test = y[train_index], y[test_index]
pred = Model_embed(X_train, y_train, X_test, y_test, max_length=max_length)
print("Accuracy-fold-{} = {:.3f}".format(i, pred))
i = i+1
ACC += [pred]
## Accuracy-fold-1 = 0.749
## Accuracy-fold-2 = 0.756
## Accuracy-fold-3 = 0.757
## Accuracy-fold-4 = 0.764
## Accuracy-fold-5 = 0.758
print("Accuracy-mean = {:.3f}".format(np.mean(ACC)))
## Accuracy-mean = 0.757
Conclusion
The embedded model reaches an accuracy of around 0.75. This is much better then the AA composition model but still lower then the N-grams model. This is most likely because we loose some of the sequence information due to the padding and trimming that is required to get equal length vectors for our deep learning model.
BLOSUM encoding
BLOSUM matrices are widely used in bioinformatics to generate sequence alignments. We can use this encoding also as an input for deep learning. BLOSUM encoding of the sequences was performed using the protr R package using the below code.
library(protr)
#Function for BLOSUM encoding of the AA sequences.
BLOSUM_df <- function(sequences, submat="AABLOSUM62", k=5, lag=7, silent=TRUE){
df <- t(sapply(sequences, function(i) protr::extractBLOSUM(i, submat = submat, k=k, lag=lag, silent=silent)))
df <- data.frame(df)
rownames(df) <- c()
return(df)
}
#Import data
sequences <- py$data_top_classes["sequence"]
X_BLOSUM <-
BLOSUM_df(sequences$sequence, submat="AABLOSUM62", k=9, lag=18, silent=TRUE) #9 / 18
#write_csv(X_BLOSUM, "BLOSUM_k9_lag19.csv")
BLOSUM sequence matrices generated in R were saved into a .csv file and loaded back into python
BLOSUM_seq_matrixes = pd.read_csv("Data/BLOSUM_k9_lag19.csv")
pd.set_option('display.max_columns', 6)
BLOSUM_seq_matrixes
## scl1.lag1 scl2.lag1 scl3.lag1 ... scl8.7.lag18 scl9.7.lag18 \
## 0 -0.002615 0.000283 -0.002937 ... -0.003019 0.000230
## 1 -0.000948 0.000633 0.003090 ... 0.005362 -0.001396
## 2 0.005699 0.004398 0.003010 ... -0.002283 -0.003664
## 3 0.005689 0.004419 0.002920 ... -0.002295 -0.003665
## 4 0.002448 0.008009 0.000728 ... 0.000387 0.002231
## ... ... ... ... ... ... ...
## 86446 -0.004528 0.012028 0.008579 ... -0.002316 -0.006078
## 86447 -0.002841 0.004932 -0.007188 ... 0.000123 0.001226
## 86448 0.004192 0.007186 0.023998 ... -0.008741 -0.006935
## 86449 -0.002852 0.005154 -0.003875 ... -0.000257 0.000208
## 86450 -0.001217 0.003631 -0.001252 ... -0.000895 0.000950
##
## scl9.8.lag18
## 0 0.000997
## 1 0.001682
## 2 0.003491
## 3 0.003506
## 4 0.004065
## ... ...
## 86446 0.000662
## 86447 0.002007
## 86448 0.002186
## 86449 0.002532
## 86450 0.000996
##
## [86451 rows x 1458 columns]
X_BLOSUM = BLOSUM_seq_matrixes.to_numpy()
Model
def Model_BLOSUM(X_train, y_train, X_test, y_test, classes = 10):
#Define model
model = Sequential()
model.add(Dense(16, input_dim=X_train.shape[1], activation="relu"))
model.add(Dense(8, activation="relu"))
model.add(Dense(classes, activation="softmax"))
#Define optimizer
opt = Adam(learning_rate=0.01)
early_stopping = EarlyStopping(min_delta=0.01,
patience=5,
monitor='val_accuracy',
verbose=0,
mode="max",
restore_best_weights=True
)
#Compile model
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
model.fit(x=X_train, y=y_train,
validation_data=(X_test, y_test),
epochs=100,
callbacks=[early_stopping],
verbose=0,
batch_size=128)
y_pred = model.predict(X_test)
test_acc = accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1))
return test_acc
Validation
Training data
X_train, X_test, y_train, y_test = train_test_split(X_BLOSUM, y, test_size=.2, random_state=42, shuffle=True)
train_acc_BLOSUM = Model_BLOSUM(X_train, y_train, X_test, y_test)
print("Accuracy-train = {:.3f}".format(train_acc_BLOSUM))
## Accuracy-train = 0.753
Cross Validation
ACC = []
kf = KFold(n_splits=5, random_state=42, shuffle=True)
i = 1
for train_index, test_index in kf.split(X_BLOSUM):
X_train, X_test = X_BLOSUM[train_index], X_BLOSUM[test_index]
y_train, y_test = y[train_index], y[test_index]
pred = Model_BLOSUM(X_train, y_train, X_test, y_test)
print("Accuracy-fold-{} = {:.3f}".format(i, pred))
i = i+1
ACC += [pred]
## Accuracy-fold-1 = 0.777
## Accuracy-fold-2 = 0.760
## Accuracy-fold-3 = 0.776
## Accuracy-fold-4 = 0.757
## Accuracy-fold-5 = 0.766
print("Accuracy-mean = {:.3f}".format(np.mean(ACC)))
## Accuracy-mean = 0.767
Conclusion
The BLOSUM model reaches an accuracy of around 0.75. This is similar to the N-grams model. However, generating the BLOSUM matrixes took much langer then generating the N-grams. Because there are so many features generated a higher accuracy might be reached by increasing layers and nodes.