In this post, we will be looking at using Keras to build a multiclass
classification using Deep Learning.
What is multiclass classification?¶
Multiclass classification is a more general form classifying training samples in categories. The strict form of this is probably what you guys have already heard of binary
classification( Spam/Not Spam or Fraud/No Fraud).
For our example, we will be using the stack overflow dataset and assigning tags to posts. You can find the dataset here.
I have grabbed around 2k sample for 4 tags iPhone, java, javascript and python.
We will be building a deep learning model using Keras. Why Keras?
It’s easy and I am more comfortable with it.
As any thumb rule, we should always look at our data before we start building any model.
import keras
import numpy as np
from keras.preprocessing.text import Tokenizer
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, Dropout, Embedding, LSTM, Flatten
from keras.models import Model
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
plt.style.use('ggplot')
%matplotlib inline
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
data = pd.read_csv('stackoverflow.csv')
data.head()
Check class distributions¶
data.tags.value_counts()
Convert tags to integers as most of the machine learning¶
models deal with integer or float
Alternative way would be to use LabelEncoder and
fit the tags columns on it
data['target'] = data.tags.astype('category').cat.codes
Calculate the number of words in each posts¶
We would like to look at the word distribution across all posts. This information would be key later when we are passing the data to Keras Deep Model.
data['num_words'] = data.post.apply(lambda x : len(x.split()))
Binning the posts by word count.¶
Ideally we would want to know how many posts are short, medium and large posts. Binning is a technique is efficient mechanism to do that
bins=[0,50,75, np.inf]
data['bins']=pd.cut(data.num_words, bins=[0,100,300,500,800, np.inf], labels=['0-100', '100-300', '300-500','500-800' ,'>800'])
word_distribution = data.groupby('bins').size().reset_index().rename(columns={0:'counts'})
word_distribution.head()
sns.barplot(x='bins', y='counts', data=word_distribution).set_title("Word distribution per bin")
Post transformation this is how our pandas dataframe will look like
data.head()
Set number of classes and target variable¶
num_class = len(np.unique(data.tags.values))
y = data['target'].values
Tokenize the input¶
For a deep learning model we need to know what the input sequence length for our model should be. The distribution graph about shows us that for we have less than 200 posts with more than 500 words.
Given the above information we can set the Input sequence length to be max(words per post). By doing so we are essentially wasting a lot of resources so we make a tradeoff and set the the Input sequence length to 500
MAX_LENGTH = 500
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data.post.values)
post_seq = tokenizer.texts_to_sequences(data.post.values)
post_seq_padded = pad_sequences(post_seq, maxlen=MAX_LENGTH)
X_train, X_test, y_train, y_test = train_test_split(post_seq_padded, y, test_size=0.05)
vocab_size = len(tokenizer.word_index) + 1
Deep Learning Model : Simple¶
Let start with a simple model where the build an embedded layer, Dense followed by our prediction
inputs = Input(shape=(MAX_LENGTH, ))
embedding_layer = Embedding(vocab_size,
128,
input_length=MAX_LENGTH)(inputs)
x = Flatten()(embedding_layer)
x = Dense(32, activation='relu')(x)
predictions = Dense(num_class, activation='softmax')(x)
model = Model(inputs=[inputs], outputs=predictions)
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['acc'])
model.summary()
filepath="weights-simple.hdf5"
checkpointer = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
history = model.fit([X_train], batch_size=64, y=to_categorical(y_train), verbose=1, validation_split=0.25,
shuffle=True, epochs=5, callbacks=[checkpointer])
Understanding the model fit¶
Once we run model fit we can see that around 5th epoch the accuracy for training was 100% where as validation accuracy was around 93.56 which suggests that we are overfitting the data and not able to generalize the model
df = pd.DataFrame({'epochs':history.epoch, 'accuracy': history.history['acc'], 'validation_accuracy': history.history['val_acc']})
g = sns.pointplot(x="epochs", y="accuracy", data=df, fit_reg=False)
g = sns.pointplot(x="epochs", y="validation_accuracy", data=df, fit_reg=False, color='green')
Lets look at accuracy¶
predicted = model.predict(X_test)
predicted = np.argmax(predicted, axis=1)
accuracy_score(y_test, predicted)
With a simple model we were able to get around 94.5% accuracy on the test set.
Recurrent Neural Networks¶
A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed graph along a sequence. This allows it to exhibit dynamic temporal behavior for a time sequence.
For our example we will use LSTM’s to capture the notion of time in our posts. Which means we tend to see certain words after or before some other context word X
and we would want to capture that.
inputs = Input(shape=(MAX_LENGTH, ))
embedding_layer = Embedding(vocab_size,
128,
input_length=MAX_LENGTH)(inputs)
x = LSTM(64)(embedding_layer)
x = Dense(32, activation='relu')(x)
predictions = Dense(num_class, activation='softmax')(x)
model = Model(inputs=[inputs], outputs=predictions)
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['acc'])
model.summary()
filepath="weights.hdf5"
checkpointer = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
history = model.fit([X_train], batch_size=64, y=to_categorical(y_train), verbose=1, validation_split=0.25,
shuffle=True, epochs=10, callbacks=[checkpointer])
df = pd.DataFrame({'epochs':history.epoch, 'accuracy': history.history['acc'], 'validation_accuracy': history.history['val_acc']})
g = sns.pointplot(x="epochs", y="accuracy", data=df, fit_reg=False)
g = sns.pointplot(x="epochs", y="validation_accuracy", data=df, fit_reg=False, color='green')
model.load_weights('weights.hdf5')
predicted = model.predict(X_test)
predicted
Understanding Softmax¶
If you look at the last layer of your neural network you can see that we are setting the output to be equal to number of classes which mean the model will give us the probability that the input is belong to a particular class. Hence to get the predicted we need to use argmax to find the one with highest probability
predicted = np.argmax(predicted, axis=1)
Lets look at the accuracy¶
accuracy_score(y_test, predicted)
We were able to achieve an accuracy score of 95.25% which is pretty good and a huge jump over our simple model. On a side note, I also found this book to be super helpful Deep Learning with Python