Deep learning with Keras and python for Multiclass Classification

March 17, 2018

In this post, we will be looking at using Keras to build a multiclass classification using Deep Learning.

What is multiclass classification?

Multiclass classification is a more general form classifying training samples in categories. The strict form of this is probably what you guys have already heard of binary classification( Spam/Not Spam or Fraud/No Fraud).

For our example, we will be using the stack overflow dataset and assigning tags to posts. You can find the dataset here.

I have grabbed around 2k sample for 4 tags iPhone, java, javascript and python.

We will be building a deep learning model using Keras. Why Keras?

It's easy and I am more comfortable with it.

As any thumb rule, we should always look at our data before we start building any model.


In [92]:

import keras
import numpy as np
from keras.preprocessing.text import Tokenizer
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, Dropout, Embedding, LSTM, Flatten
from keras.models import Model
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
plt.style.use('ggplot')
%matplotlib inline
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [84]:

data = pd.read_csv('stackoverflow.csv')

In [85]:

data.head()

Out[85]:

|  | post | tags |
| --- | --- | --- |
| 0 | conventions of importing python main programs ... | python |
| 1 | python write to file based on offset i want t... | python |
| 2 | enable a textbox on the selection of no from t... | javascript |
| 3 | sending mms and email from within app how doe... | iphone |
| 4 | why aren t java weak references counted as ref... | java |

## Check class distributions
In [86]:

data.tags.value_counts()

Out[86]:

python        2000
iphone        2000
java          2000
javascript    2000
Name: tags, dtype: int64

Convert tags to integers as most of the machine learning

Models deal with integer or float given we have string we need a way to convert the categories into numbers. Alternative way would be to use LabelEncoder and fit the tags columns on it

In [87]:

data['target'] = data.tags.astype('category').cat.codes

Calculate the number of words in each posts

We would like to look at the word distribution across all posts. This information would be key later when we are passing the data to Keras Deep Model.

In [88]:

data['num_words'] = data.post.apply(lambda x : len(x.split()))

Binning the posts by word count Ideally we would want to know how many posts are short, medium and large posts. Binning is a technique is efficient mechanism to do that

In [7]:

bins=[0,50,75, np.inf]
data['bins']=pd.cut(data.num_words, bins=[0,100,300,500,800, np.inf], labels=['0-100', '100-300', '300-500','500-800' ,'>800'])

In [8]:

word_distribution = data.groupby('bins').size().reset_index().rename(columns={0:'counts'})

In [9]:

word_distribution.head()

Out[9]:

|  | bins | counts |
| --- | --- | --- |
| 0 | 0-100 | 3922 |
| 1 | 100-300 | 3521 |
| 2 | 300-500 | 396 |
| 3 | 500-800 | 105 |
| 4 | >800 | 56 |

In [10]:

sns.barplot(x='bins', y='counts', data=word_distribution).set_title("Word distribution per bin")

Out[10]:

<matplotlib.text.Text at 0x11cdd64d0>

Post transformation this is how our pandas dataframe will look like

In [11]:

data.head()

Out[11]:

|  | post | tags | target | num_words | bins |
| --- | --- | --- | --- | --- | --- |
| 0 | conventions of importing python main programs ... | python | 3 | 207 | 100-300 |
| 1 | python write to file based on offset i want t... | python | 3 | 306 | 300-500 |
| 2 | enable a textbox on the selection of no from t... | javascript | 2 | 60 | 0-100 |
| 3 | sending mms and email from within app how doe... | iphone | 0 | 59 | 0-100 |
| 4 | why aren t java weak references counted as ref... | java | 1 | 26 | 0-100 |

Set number of classes and target variable

In [12]:

num_class = len(np.unique(data.tags.values))
y = data['target'].values

Tokenize the input

For a deep learning model we need to know what the input sequence length for our model should be. The distribution graph about shows us that for we have less than 200 posts with more than 500 words.

Given the above information we can set the Input sequence length to be max(words per post). By doing so we are essentially wasting a lot of resources so we make a tradeoff and set the the Input sequence length to 500

In [13]:

MAX_LENGTH = 500
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data.post.values)
post_seq = tokenizer.texts_to_sequences(data.post.values)
post_seq_padded = pad_sequences(post_seq, maxlen=MAX_LENGTH)

In [14]:

X_train, X_test, y_train, y_test = train_test_split(post_seq_padded, y, test_size=0.05)

In [ ]:

In [15]:

vocab_size = len(tokenizer.word_index) + 1

<Heading>Deep Learning Model[Simple]</Heading>

Let start with a simple model where the build an embedded layer, Dense followed by our prediction

inputs = Input(shape=(MAX_LENGTH, ))
embedding_layer = Embedding(vocab_size,
                            128,
                            input_length=MAX_LENGTH)(inputs)
x = Flatten()(embedding_layer)
x = Dense(32, activation='relu')(x)

predictions = Dense(num_class, activation='softmax')(x)
model = Model(inputs=[inputs], outputs=predictions)
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['acc'])

model.summary()
filepath="weights-simple.hdf5"
checkpointer = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
history = model.fit([X_train], batch_size=64, y=to_categorical(y_train), verbose=1, validation_split=0.25,
          shuffle=True, epochs=5, callbacks=[checkpointer])

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_18 (InputLayer)        (None, 500)               0
_________________________________________________________________
embedding_18 (Embedding)     (None, 500, 128)          5868800
_________________________________________________________________
flatten_6 (Flatten)          (None, 64000)             0
_________________________________________________________________
dense_37 (Dense)             (None, 32)                2048032
_________________________________________________________________
dense_38 (Dense)             (None, 4)                 132
=================================================================
Total params: 7,916,964
Trainable params: 7,916,964
Non-trainable params: 0
_________________________________________________________________
Train on 5700 samples, validate on 1900 samples
Epoch 1/5
5696/5700 [============================>.] - ETA: 0s - loss: 1.0156 - acc: 0.5621Epoch 00001: val_acc improved from -inf to 0.80474, saving model to weights-simple.hdf5
5700/5700 [==============================] - 12s 2ms/step - loss: 1.0151 - acc: 0.5625 - val_loss: 0.4896 - val_acc: 0.8047
Epoch 2/5
5696/5700 [============================>.] - ETA: 0s - loss: 0.1775 - acc: 0.9477Epoch 00002: val_acc improved from 0.80474 to 0.93316, saving model to weights-simple.hdf5
5700/5700 [==============================] - 11s 2ms/step - loss: 0.1775 - acc: 0.9477 - val_loss: 0.2152 - val_acc: 0.9332
Epoch 3/5
5696/5700 [============================>.] - ETA: 0s - loss: 0.0272 - acc: 0.9961Epoch 00003: val_acc did not improve
5700/5700 [==============================] - 11s 2ms/step - loss: 0.0272 - acc: 0.9961 - val_loss: 0.2079 - val_acc: 0.9326
Epoch 4/5
5696/5700 [============================>.] - ETA: 0s - loss: 0.0074 - acc: 0.9998Epoch 00004: val_acc did not improve
5700/5700 [==============================] - 10s 2ms/step - loss: 0.0074 - acc: 0.9998 - val_loss: 0.2035 - val_acc: 0.9321
Epoch 5/5
5696/5700 [============================>.] - ETA: 0s - loss: 0.0035 - acc: 1.0000Epoch 00005: val_acc did not improve
5700/5700 [==============================] - 11s 2ms/step - loss: 0.0035 - acc: 1.0000 - val_loss: 0.2083 - val_acc: 0.9332

**Understanding the model fit**

Once we run model fit we can see that around 5th epoch the accuracy for training was 100% where as validation accuracy was around 93.56 which suggests that we are overfitting the data and not able to generalize the model

In [79]:

df = pd.DataFrame({'epochs':history.epoch, 'accuracy': history.history['acc'], 'validation_accuracy': history.history['val_acc']})
g = sns.pointplot(x="epochs", y="accuracy", data=df, fit_reg=False)
g = sns.pointplot(x="epochs", y="validation_accuracy", data=df, fit_reg=False, color='green')

**Lets look at accuracy

In [81]:

predicted = model.predict(X_test)
predicted = np.argmax(predicted, axis=1)
accuracy_score(y_test, predicted)

Out[81]:

0.945

With a simple model we were able to get around 94.5% accuracy on the test set.

<Heading>Recurrent Neural Networks</Heading>

A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed graph along a sequence. This allows it to exhibit dynamic temporal behavior for a time sequence.

For our example we will use LSTM's to capture the notion of time in our posts. Which means we tend to see certain words after or before some other context word X and we would want to capture that.

In [68]:

inputs = Input(shape=(MAX_LENGTH, ))
embedding_layer = Embedding(vocab_size,
                            128,
                            input_length=MAX_LENGTH)(inputs)

x = LSTM(64)(embedding_layer)
x = Dense(32, activation='relu')(x)
predictions = Dense(num_class, activation='softmax')(x)
model = Model(inputs=[inputs], outputs=predictions)
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['acc'])

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_16 (InputLayer)        (None, 500)               0
_________________________________________________________________
embedding_16 (Embedding)     (None, 500, 128)          5868800
_________________________________________________________________
lstm_12 (LSTM)               (None, 64)                49408
_________________________________________________________________
dense_33 (Dense)             (None, 32)                2080
_________________________________________________________________
dense_34 (Dense)             (None, 4)                 132
=================================================================
Total params: 5,920,420
Trainable params: 5,920,420
Non-trainable params: 0
_________________________________________________________________

In [70]:

filepath="weights.hdf5"
checkpointer = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
history = model.fit([X_train], batch_size=64, y=to_categorical(y_train), verbose=1, validation_split=0.25,
          shuffle=True, epochs=10, callbacks=[checkpointer])

Train on 5700 samples, validate on 1900 samples
Epoch 1/10
5696/5700 [============================>.] - ETA: 0s - loss: 1.1493 - acc: 0.5081Epoch 00001: val_acc improved from -inf to 0.65000, saving model to weights.hdf5
5700/5700 [==============================] - 61s 11ms/step - loss: 1.1488 - acc: 0.5082 - val_loss: 0.8437 - val_acc: 0.6500
Epoch 2/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.5963 - acc: 0.7468Epoch 00002: val_acc improved from 0.65000 to 0.75158, saving model to weights.hdf5
5700/5700 [==============================] - 59s 10ms/step - loss: 0.5962 - acc: 0.7470 - val_loss: 0.5326 - val_acc: 0.7516
Epoch 3/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.2558 - acc: 0.9152Epoch 00003: val_acc improved from 0.75158 to 0.90421, saving model to weights.hdf5
5700/5700 [==============================] - 59s 10ms/step - loss: 0.2556 - acc: 0.9153 - val_loss: 0.3438 - val_acc: 0.9042
Epoch 4/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.1269 - acc: 0.9630Epoch 00004: val_acc improved from 0.90421 to 0.91737, saving model to weights.hdf5
5700/5700 [==============================] - 59s 10ms/step - loss: 0.1268 - acc: 0.9630 - val_loss: 0.2830 - val_acc: 0.9174
Epoch 5/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.0551 - acc: 0.9854Epoch 00005: val_acc did not improve
5700/5700 [==============================] - 61s 11ms/step - loss: 0.0551 - acc: 0.9854 - val_loss: 0.3600 - val_acc: 0.8926
Epoch 6/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.0530 - acc: 0.9874Epoch 00006: val_acc did not improve
5700/5700 [==============================] - 60s 10ms/step - loss: 0.0530 - acc: 0.9874 - val_loss: 0.4529 - val_acc: 0.8989
Epoch 7/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.0191 - acc: 0.9961Epoch 00007: val_acc improved from 0.91737 to 0.93211, saving model to weights.hdf5
5700/5700 [==============================] - 61s 11ms/step - loss: 0.0191 - acc: 0.9961 - val_loss: 0.3708 - val_acc: 0.9321
Epoch 8/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.0469 - acc: 0.9907Epoch 00008: val_acc did not improve
5700/5700 [==============================] - 64s 11ms/step - loss: 0.0469 - acc: 0.9907 - val_loss: 0.3062 - val_acc: 0.9316
Epoch 9/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.0336 - acc: 0.9919Epoch 00009: val_acc did not improve
5700/5700 [==============================] - 67s 12ms/step - loss: 0.0336 - acc: 0.9919 - val_loss: 0.3125 - val_acc: 0.9305
Epoch 10/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.1496 - acc: 0.9556Epoch 00010: val_acc did not improve
5700/5700 [==============================] - 63s 11ms/step - loss: 0.1495 - acc: 0.9556 - val_loss: 0.3176 - val_acc: 0.9184

In [72]:

df = pd.DataFrame({'epochs':history.epoch, 'accuracy': history.history['acc'], 'validation_accuracy': history.history['val_acc']})
g = sns.pointplot(x="epochs", y="accuracy", data=df, fit_reg=False)
g = sns.pointplot(x="epochs", y="validation_accuracy", data=df, fit_reg=False, color='green')

In [73]:

model.load_weights('weights.hdf5')
predicted = model.predict(X_test)

In [74]:

predicted

Out[74]:

array([[3.7966363e-04, 9.9932277e-01, 1.9026371e-05, 2.7851801e-04],
       [3.3336953e-05, 1.6323056e-03, 1.9113455e-03, 9.9642295e-01],
       [1.1827729e-04, 7.1883442e-05, 9.9874747e-01, 1.0624601e-03],
       ...,
       [9.9962401e-01, 2.4647295e-04, 7.8240126e-05, 5.1257037e-05],
       [9.9995482e-01, 3.1498657e-05, 9.2394075e-06, 4.3481641e-06],
       [9.0285734e-04, 4.3323930e-04, 9.9486190e-01, 3.8019547e-03]],
      dtype=float32)

Understanding Softmax

If you look at the last layer of your neural network you can see that we are setting the output to be equal to number of classes which mean the model will give us the probability that the input is belong to a particular class. Hence to get the predicted we need to use argmax to find the one with highest probability

In [75]:

predicted = np.argmax(predicted, axis=1)

Lets look at the accuracy

In [76]:

accuracy_score(y_test, predicted)

Out[76]:

0.9525

We were able to achieve an accuracy score of 95.25% which is pretty good and a huge jump over our simple model.