Datascience Machine Learning

Predicting kickstarter project success

Building Datascience portfolio projects

A couple of years back I started learning iOS development and found that I was able to learn a lot by building many apps. So I thought I will use the same approach now to learn more about Machine Learning and Datascience by building similar machine learning portfolio projects.

In this post, we will use machine learning to predict the success rate of the Kickstarter project. Let’s look at why this kind of a model would be useful in the real world. As a project creator, we always want our projects to be successful and having a tool like this will help us tune the name,  description, adjust goal amount and keywords so that we can reach appropriate audience.

What would be really useful is using the image features of the uploaded images to enhance the accuracy of the model. I will try to look into that after the v1 version of the model. You can get the dataset on kaggle.

Almost any problem in Machine Learning will start with exploring the data and understanding more about it. Lets get started on the same.

Here are the columns:

  1. Project_id : Unique identifier for the project
  2. name : Kickstarter project name
  3. desc : Description of the project
  4. keywords: Keywords describing project
  5. disable_communication: Status about communication
  6. country: Country of the kickstarter project
  7. currency: Currency of the kickstarer project
  8. deadline: Deadline for reaching the goal
  9. state_changed_at:
  10. created_at: Project created date
  11. launched: When the project is made visible.
  12. backers_count: No of backers
  13. final_status: (Target Variable)

Feature Engineering

For building the machine learning model we will add a few more features like duration and cleaned_text.

  • all_text : name + desc + keyword(Remove hyphens)
  • duration : Total duration of the project
  • days_status_changed: Number of days between status changed and the end date
  • cleaned_text: Remove punctuations and clean text

Exploratory Data Analysis

Screen Shot 2017-07-15 at 11.14.37 PM

 

Screen Shot 2017-07-15 at 11.15.12 PM

 

Screen Shot 2017-07-15 at 11.20.39 PM

 

Scikit learn provides a really nice feature for building the model aka Pipeline. In our case, we have both text features and numerical values so we need to transform both the text and numerical values differently.

Let’s look at how we can use feature union to merge the features before passing it along to the Machine learning algorithm. To use feature union first we need to build a couple of transformer mixin. These mixins will allow us to extract certain columns from the dataframe and pass them through different transformations.

class ItemSelector(BaseEstimator, TransformerMixin):
 def __init__(self, keys):
 self.keys = keys

 def fit(self, x, y=None):
 return self

 def transform(self, data_dict):
 return [x[0] for x in data_dict[self.keys].values.tolist()]

class IntItemSelector(BaseEstimator, TransformerMixin):
 def __init__(self, keys):
 self.keys = keys

 def fit(self, x, y=None):
 return self

 def transform(self, data_dict):
 return data_dict[self.keys].astype(float).values

Let’s see how we can build the model and fit the pipeline.

pipeline = Pipeline([
 ('union', FeatureUnion(
 transformer_list=[
 ('pipeline', Pipeline([
 ('selector', ItemSelector(['cleaned_text'])),
 ('vect', TfidfVectorizer(stop_words='english', min_df=5, max_df=50))
 ])),
 
 #Pipeline for pulling ad hoc features from post's body
 ('integer_features', Pipeline([('fts', IntItemSelector(int_features))])),
 ]
 )),

 # Use a SVC classifier on the combined features
# ('svc', SVC(random_state=12, kernel='linear', probability=True, class_weight='balanced')),
# ('clf', LogisticRegression())
 ('clf', RandomForestClassifier(n_estimators=100))
])
pipeline.fit(X_train, y_train)
from sklearn.metrics import confusion_matrix
predicted = pipeline.predict(X_test)
print(np.mean(predicted == y_test))
print(confusion_matrix(y_test, predicted))
from sklearn import metrics
print(metrics.classification_report(y_test, predicted))

Here are the results

0.856582695224
[[16589 1826]
[ 2051 6567]]
precision recall f1-score support

0 0.89 0.90 0.90 18415
1 0.78 0.76 0.77 8618

avg / total 0.86 0.86 0.86 27033

The target class is not balanced in this case. Let’s see what I mean by that

df.final_status.value_counts()
0    73568
1    34561
Name: final_status, dtype: int64

So if we had always predicted 0 as the output we would be correct (73568)/(73568+34561) ~ 68% of the time. We can be confident at this time that the model is learning something with our current 86% accuracy. A better measure for evaluation when we have class imbalance is to look at auc curve. Please let me know if you have any feedback.

Also, let me know if you have any interesting problems you want to be solved with Machine Learning/ AI(Add details here)

Code: Github Repo

Correction: There is a flaw with the approach mentioned above as using backer_count is a leak and should not be used as a feature.(As pointed out by reddit user somkoala).

But I think if we use can still backer_count  as a feature if we are predicting the probability of success on a given day after the project is live. We can think of it as a probability curve which will vary by day and backer count.

Removing the backer_count will get the accuracy to around 68% which is kind of predicting 0 always. Do let me know if you guys get any thing better than that without backer_count as a feature.

About the author

Shrikar

Backend/Infrastructure Engineer by Day. iOS Developer for the rest of the time.

  • Prasanna

    This may be due to class imbalance. Try learning about over sampling / under sampling to see if you can get better results even with the same featuree. also check the education level of the founder and see if it correlates. Any prior ventures that are successful would also count.

/* ]]> */