Datascience Deep Learning Machine Learning

Deep learning with Keras and python for Multiclass Classification

In this post, we will be looking at using Keras to build a multiclass

classification using Deep Learning.

What is multiclass classification?

Multiclass classification is a more general form classifying training samples in categories. The strict form of this is probably what you guys have already heard of binary

classification( Spam/Not Spam or Fraud/No Fraud).

For our example, we will be using the stack overflow dataset and assigning tags to posts. You can find the dataset here.

I have grabbed around 2k sample for 4 tags iPhone, java, javascript and python.

We will be building a deep learning model using Keras. Why Keras?

It’s easy and I am more comfortable with it.

As any thumb rule, we should always look at our data before we start building any model.

In [92]:
import keras 
import numpy as np
from keras.preprocessing.text import Tokenizer
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, Dropout, Embedding, LSTM, Flatten
from keras.models import Model
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
plt.style.use('ggplot')
%matplotlib inline
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
In [84]:
data = pd.read_csv('stackoverflow.csv')
In [85]:
data.head()
Out[85]:
posttags
0conventions of importing python main programs …python
1python write to file based on offset i want t…python
2enable a textbox on the selection of no from t…javascript
3sending mms and email from within app how doe…iphone
4why aren t java weak references counted as ref…java

Check class distributions

In [86]:
data.tags.value_counts()
Out[86]:
python        2000
iphone        2000
java          2000
javascript    2000
Name: tags, dtype: int64

Convert tags to integers as most of the machine learning

models deal with integer or float

Alternative way would be to use LabelEncoder and

fit the tags columns on it

In [87]:
data['target'] = data.tags.astype('category').cat.codes

Calculate the number of words in each posts

We would like to look at the word distribution across all posts. This information would be key later when we are passing the data to Keras Deep Model.

In [88]:
data['num_words'] = data.post.apply(lambda x : len(x.split()))

Binning the posts by word count.

Ideally we would want to know how many posts are short, medium and large posts. Binning is a technique is efficient mechanism to do that

In [7]:
bins=[0,50,75, np.inf]
data['bins']=pd.cut(data.num_words, bins=[0,100,300,500,800, np.inf], labels=['0-100', '100-300', '300-500','500-800' ,'>800'])
In [8]:
word_distribution = data.groupby('bins').size().reset_index().rename(columns={0:'counts'})
In [9]:
word_distribution.head()
Out[9]:
binscounts
00-1003922
1100-3003521
2300-500396
3500-800105
4>80056
In [10]:
sns.barplot(x='bins', y='counts', data=word_distribution).set_title("Word distribution per bin")
Out[10]:
<matplotlib.text.Text at 0x11cdd64d0>

Post transformation this is how our pandas dataframe will look like

In [11]:
data.head()
Out[11]:
posttagstargetnum_wordsbins
0conventions of importing python main programs …python3207100-300
1python write to file based on offset i want t…python3306300-500
2enable a textbox on the selection of no from t…javascript2600-100
3sending mms and email from within app how doe…iphone0590-100
4why aren t java weak references counted as ref…java1260-100

Set number of classes and target variable

In [12]:
num_class = len(np.unique(data.tags.values))
y = data['target'].values

Tokenize the input

For a deep learning model we need to know what the input sequence length for our model should be. The distribution graph about shows us that for we have less than 200 posts with more than 500 words.

Given the above information we can set the Input sequence length to be max(words per post). By doing so we are essentially wasting a lot of resources so we make a tradeoff and set the the Input sequence length to 500

In [13]:
MAX_LENGTH = 500
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data.post.values)
post_seq = tokenizer.texts_to_sequences(data.post.values)
post_seq_padded = pad_sequences(post_seq, maxlen=MAX_LENGTH)
In [14]:
X_train, X_test, y_train, y_test = train_test_split(post_seq_padded, y, test_size=0.05)
In [ ]:
In [15]:
vocab_size = len(tokenizer.word_index) + 1

Deep Learning Model : Simple

Let start with a simple model where the build an embedded layer, Dense followed by our prediction

In [78]:
inputs = Input(shape=(MAX_LENGTH, ))
embedding_layer = Embedding(vocab_size,
                            128,
                            input_length=MAX_LENGTH)(inputs)
x = Flatten()(embedding_layer)
x = Dense(32, activation='relu')(x)

predictions = Dense(num_class, activation='softmax')(x)
model = Model(inputs=[inputs], outputs=predictions)
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['acc'])

model.summary()
filepath="weights-simple.hdf5"
checkpointer = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
history = model.fit([X_train], batch_size=64, y=to_categorical(y_train), verbose=1, validation_split=0.25, 
          shuffle=True, epochs=5, callbacks=[checkpointer])
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_18 (InputLayer)        (None, 500)               0         
_________________________________________________________________
embedding_18 (Embedding)     (None, 500, 128)          5868800   
_________________________________________________________________
flatten_6 (Flatten)          (None, 64000)             0         
_________________________________________________________________
dense_37 (Dense)             (None, 32)                2048032   
_________________________________________________________________
dense_38 (Dense)             (None, 4)                 132       
=================================================================
Total params: 7,916,964
Trainable params: 7,916,964
Non-trainable params: 0
_________________________________________________________________
Train on 5700 samples, validate on 1900 samples
Epoch 1/5
5696/5700 [============================>.] - ETA: 0s - loss: 1.0156 - acc: 0.5621Epoch 00001: val_acc improved from -inf to 0.80474, saving model to weights-simple.hdf5
5700/5700 [==============================] - 12s 2ms/step - loss: 1.0151 - acc: 0.5625 - val_loss: 0.4896 - val_acc: 0.8047
Epoch 2/5
5696/5700 [============================>.] - ETA: 0s - loss: 0.1775 - acc: 0.9477Epoch 00002: val_acc improved from 0.80474 to 0.93316, saving model to weights-simple.hdf5
5700/5700 [==============================] - 11s 2ms/step - loss: 0.1775 - acc: 0.9477 - val_loss: 0.2152 - val_acc: 0.9332
Epoch 3/5
5696/5700 [============================>.] - ETA: 0s - loss: 0.0272 - acc: 0.9961Epoch 00003: val_acc did not improve
5700/5700 [==============================] - 11s 2ms/step - loss: 0.0272 - acc: 0.9961 - val_loss: 0.2079 - val_acc: 0.9326
Epoch 4/5
5696/5700 [============================>.] - ETA: 0s - loss: 0.0074 - acc: 0.9998Epoch 00004: val_acc did not improve
5700/5700 [==============================] - 10s 2ms/step - loss: 0.0074 - acc: 0.9998 - val_loss: 0.2035 - val_acc: 0.9321
Epoch 5/5
5696/5700 [============================>.] - ETA: 0s - loss: 0.0035 - acc: 1.0000Epoch 00005: val_acc did not improve
5700/5700 [==============================] - 11s 2ms/step - loss: 0.0035 - acc: 1.0000 - val_loss: 0.2083 - val_acc: 0.9332

Understanding the model fit

Once we run model fit we can see that around 5th epoch the accuracy for training was 100% where as validation accuracy was around 93.56 which suggests that we are overfitting the data and not able to generalize the model

In [79]:
df = pd.DataFrame({'epochs':history.epoch, 'accuracy': history.history['acc'], 'validation_accuracy': history.history['val_acc']})
g = sns.pointplot(x="epochs", y="accuracy", data=df, fit_reg=False)
g = sns.pointplot(x="epochs", y="validation_accuracy", data=df, fit_reg=False, color='green')

Lets look at accuracy

In [81]:
predicted = model.predict(X_test)
predicted = np.argmax(predicted, axis=1)
accuracy_score(y_test, predicted)
Out[81]:
0.945

With a simple model we were able to get around 94.5% accuracy on the test set.

Recurrent Neural Networks

A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed graph along a sequence. This allows it to exhibit dynamic temporal behavior for a time sequence.

For our example we will use LSTM’s to capture the notion of time in our posts. Which means we tend to see certain words after or before some other context word X and we would want to capture that.

In [68]:
inputs = Input(shape=(MAX_LENGTH, ))
embedding_layer = Embedding(vocab_size,
                            128,
                            input_length=MAX_LENGTH)(inputs)

x = LSTM(64)(embedding_layer)
x = Dense(32, activation='relu')(x)
predictions = Dense(num_class, activation='softmax')(x)
model = Model(inputs=[inputs], outputs=predictions)
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['acc'])

model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_16 (InputLayer)        (None, 500)               0         
_________________________________________________________________
embedding_16 (Embedding)     (None, 500, 128)          5868800   
_________________________________________________________________
lstm_12 (LSTM)               (None, 64)                49408     
_________________________________________________________________
dense_33 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_34 (Dense)             (None, 4)                 132       
=================================================================
Total params: 5,920,420
Trainable params: 5,920,420
Non-trainable params: 0
_________________________________________________________________
In [70]:
filepath="weights.hdf5"
checkpointer = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
history = model.fit([X_train], batch_size=64, y=to_categorical(y_train), verbose=1, validation_split=0.25, 
          shuffle=True, epochs=10, callbacks=[checkpointer])
Train on 5700 samples, validate on 1900 samples
Epoch 1/10
5696/5700 [============================>.] - ETA: 0s - loss: 1.1493 - acc: 0.5081Epoch 00001: val_acc improved from -inf to 0.65000, saving model to weights.hdf5
5700/5700 [==============================] - 61s 11ms/step - loss: 1.1488 - acc: 0.5082 - val_loss: 0.8437 - val_acc: 0.6500
Epoch 2/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.5963 - acc: 0.7468Epoch 00002: val_acc improved from 0.65000 to 0.75158, saving model to weights.hdf5
5700/5700 [==============================] - 59s 10ms/step - loss: 0.5962 - acc: 0.7470 - val_loss: 0.5326 - val_acc: 0.7516
Epoch 3/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.2558 - acc: 0.9152Epoch 00003: val_acc improved from 0.75158 to 0.90421, saving model to weights.hdf5
5700/5700 [==============================] - 59s 10ms/step - loss: 0.2556 - acc: 0.9153 - val_loss: 0.3438 - val_acc: 0.9042
Epoch 4/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.1269 - acc: 0.9630Epoch 00004: val_acc improved from 0.90421 to 0.91737, saving model to weights.hdf5
5700/5700 [==============================] - 59s 10ms/step - loss: 0.1268 - acc: 0.9630 - val_loss: 0.2830 - val_acc: 0.9174
Epoch 5/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.0551 - acc: 0.9854Epoch 00005: val_acc did not improve
5700/5700 [==============================] - 61s 11ms/step - loss: 0.0551 - acc: 0.9854 - val_loss: 0.3600 - val_acc: 0.8926
Epoch 6/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.0530 - acc: 0.9874Epoch 00006: val_acc did not improve
5700/5700 [==============================] - 60s 10ms/step - loss: 0.0530 - acc: 0.9874 - val_loss: 0.4529 - val_acc: 0.8989
Epoch 7/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.0191 - acc: 0.9961Epoch 00007: val_acc improved from 0.91737 to 0.93211, saving model to weights.hdf5
5700/5700 [==============================] - 61s 11ms/step - loss: 0.0191 - acc: 0.9961 - val_loss: 0.3708 - val_acc: 0.9321
Epoch 8/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.0469 - acc: 0.9907Epoch 00008: val_acc did not improve
5700/5700 [==============================] - 64s 11ms/step - loss: 0.0469 - acc: 0.9907 - val_loss: 0.3062 - val_acc: 0.9316
Epoch 9/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.0336 - acc: 0.9919Epoch 00009: val_acc did not improve
5700/5700 [==============================] - 67s 12ms/step - loss: 0.0336 - acc: 0.9919 - val_loss: 0.3125 - val_acc: 0.9305
Epoch 10/10
5696/5700 [============================>.] - ETA: 0s - loss: 0.1496 - acc: 0.9556Epoch 00010: val_acc did not improve
5700/5700 [==============================] - 63s 11ms/step - loss: 0.1495 - acc: 0.9556 - val_loss: 0.3176 - val_acc: 0.9184
In [72]:
df = pd.DataFrame({'epochs':history.epoch, 'accuracy': history.history['acc'], 'validation_accuracy': history.history['val_acc']})
g = sns.pointplot(x="epochs", y="accuracy", data=df, fit_reg=False)
g = sns.pointplot(x="epochs", y="validation_accuracy", data=df, fit_reg=False, color='green')
In [73]:
model.load_weights('weights.hdf5')
predicted = model.predict(X_test)
In [74]:
predicted
Out[74]:
array([[3.7966363e-04, 9.9932277e-01, 1.9026371e-05, 2.7851801e-04],
       [3.3336953e-05, 1.6323056e-03, 1.9113455e-03, 9.9642295e-01],
       [1.1827729e-04, 7.1883442e-05, 9.9874747e-01, 1.0624601e-03],
       ...,
       [9.9962401e-01, 2.4647295e-04, 7.8240126e-05, 5.1257037e-05],
       [9.9995482e-01, 3.1498657e-05, 9.2394075e-06, 4.3481641e-06],
       [9.0285734e-04, 4.3323930e-04, 9.9486190e-01, 3.8019547e-03]],
      dtype=float32)

Understanding Softmax

If you look at the last layer of your neural network you can see that we are setting the output to be equal to number of classes which mean the model will give us the probability that the input is belong to a particular class. Hence to get the predicted we need to use argmax to find the one with highest probability

In [75]:
predicted = np.argmax(predicted, axis=1)

Lets look at the accuracy

In [76]:
accuracy_score(y_test, predicted)
Out[76]:
0.9525

We were able to achieve an accuracy score of 95.25% which is pretty good and a huge jump over our simple model. On a side note, I also found this  book to be super helpful Deep Learning with Python

 

 

About the author

Shrikar

Backend/Infrastructure Engineer by Day. iOS Developer for the rest of the time.

  • Abdul Malik

    Where can I get stackoverflow.csv file?

  • Rajesh R Rajamani

    Hi,

    What is your view on the parameter padding be set to ‘post’ ? I’m getting different results when I switch from the default padding to explicitly setting it to ‘post’ . How critical is this parameter ?

    pad_sequences(test_desc_seq,padding=’post’, maxlen=MAX_LENGTH)

    • https://shrikar.com shrikar

      It quite depends on your usecase. What I have seen is key words tend to be at the end so in that case pre padding help.

  • Rajesh R Rajamani

    I’m not using the file suggested by you . My test cases is a file of one-liners that I’m trying to classify and predict new ones

  • Rajesh R Rajamani

    Hi,

    Could you explain the following line in a bit more . Unable to understand the significance of 128 as batch size . Does it mean that the input will be taken in batches of 128 ?

    embedding_layer = Embedding(vocab_size,
    128,
    input_length=MAX_LENGTH)(inputs)

  • Harem Yousif

    could you please kindly give us the data, your post is really useful but , It is not clear how to download the data in this link https://www.kaggle.com/stackoverflow/stackoverflow

  • willis

    Great post, clear, simple and instructive.
    Could you please explain me how to preprocess the code inside posts? Did you tokenized it as it was simple words ?

/* ]]> */