Building Datascience portfolio projects
A couple of years back I started learning iOS development and found that I was able to learn a lot by building many apps. So I thought I will use the same approach now to learn more about Machine Learning and Datascience by building similar machine learning portfolio projects.
In this post, we will use machine learning to predict the success rate of the Kickstarter project. Let’s look at why this kind of a model would be useful in the real world. As a project creator, we always want our projects to be successful and having a tool like this will help us tune the name, description, adjust goal amount and keywords so that we can reach appropriate audience.
What would be really useful is using the image features of the uploaded images to enhance the accuracy of the model. I will try to look into that after the v1 version of the model. You can get the dataset on kaggle.
Almost any problem in Machine Learning will start with exploring the data and understanding more about it. Lets get started on the same.
Here are the columns:
- Project_id : Unique identifier for the project
- name : Kickstarter project name
- desc : Description of the project
- keywords: Keywords describing project
- disable_communication: Status about communication
- country: Country of the kickstarter project
- currency: Currency of the kickstarer project
- deadline: Deadline for reaching the goal
- created_at: Project created date
- launched: When the project is made visible.
- backers_count: No of backers
- final_status: (Target Variable)
For building the machine learning model we will add a few more features like duration and cleaned_text.
- all_text : name + desc + keyword(Remove hyphens)
- duration : Total duration of the project
- days_status_changed: Number of days between status changed and the end date
- cleaned_text: Remove punctuations and clean text
Exploratory Data Analysis
Scikit learn provides a really nice feature for building the model aka Pipeline. In our case, we have both text features and numerical values so we need to transform both the text and numerical values differently.
Let’s look at how we can use feature union to merge the features before passing it along to the Machine learning algorithm. To use feature union first we need to build a couple of transformer mixin. These mixins will allow us to extract certain columns from the dataframe and pass them through different transformations.
class ItemSelector(BaseEstimator, TransformerMixin): def __init__(self, keys): self.keys = keys def fit(self, x, y=None): return self def transform(self, data_dict): return [x for x in data_dict[self.keys].values.tolist()] class IntItemSelector(BaseEstimator, TransformerMixin): def __init__(self, keys): self.keys = keys def fit(self, x, y=None): return self def transform(self, data_dict): return data_dict[self.keys].astype(float).values
Let’s see how we can build the model and fit the pipeline.
pipeline = Pipeline([ ('union', FeatureUnion( transformer_list=[ ('pipeline', Pipeline([ ('selector', ItemSelector(['cleaned_text'])), ('vect', TfidfVectorizer(stop_words='english', min_df=5, max_df=50)) ])), #Pipeline for pulling ad hoc features from post's body ('integer_features', Pipeline([('fts', IntItemSelector(int_features))])), ] )), # Use a SVC classifier on the combined features # ('svc', SVC(random_state=12, kernel='linear', probability=True, class_weight='balanced')), # ('clf', LogisticRegression()) ('clf', RandomForestClassifier(n_estimators=100)) ]) pipeline.fit(X_train, y_train) from sklearn.metrics import confusion_matrix predicted = pipeline.predict(X_test) print(np.mean(predicted == y_test)) print(confusion_matrix(y_test, predicted)) from sklearn import metrics print(metrics.classification_report(y_test, predicted))
Here are the results
0.856582695224 [[16589 1826] [ 2051 6567]] precision recall f1-score support 0 0.89 0.90 0.90 18415 1 0.78 0.76 0.77 8618 avg / total 0.86 0.86 0.86 27033
The target class is not balanced in this case. Let’s see what I mean by that
df.final_status.value_counts() 0 73568 1 34561 Name: final_status, dtype: int64
So if we had always predicted 0 as the output we would be correct (73568)/(73568+34561) ~ 68% of the time. We can be confident at this time that the model is learning something with our current 86% accuracy. A better measure for evaluation when we have class imbalance is to look at auc curve. Please let me know if you have any feedback.
Also, let me know if you have any interesting problems you want to be solved with Machine Learning/ AI(Add details here)
Code: Github Repo
Correction: There is a flaw with the approach mentioned above as using backer_count is a leak and should not be used as a feature.(As pointed out by reddit user somkoala).
But I think if we use can still backer_count as a feature if we are predicting the probability of success on a given day after the project is live. We can think of it as a probability curve which will vary by day and backer count.
Removing the backer_count will get the accuracy to around 68% which is kind of predicting 0 always. Do let me know if you guys get any thing better than that without backer_count as a feature.