Continue from the last story, IMDb Data exploration, here I’m going to predict the gross of movie with other parameters like budget, cast_total_facebook_like. I will use couple of machine learning models to make the prediction, specifically, SVM, Random Forest and Decision Tree.
First, I need to load packages I need to use those models, for example, ‘randomForest’ to use Random Forest and ‘e1071’ for SVM.
library(randomForest)
library(e1071)
library(rpart)
Then load the dataset, sampling it and exclude some parameters that I don’t want to use for predicting.
I split the dataset into training data(85%) and testing data(15%), with 14 variables left for predicting.
Now I put the train data in SVM model and get the goodness of the prediction by applying rmse measurement.
Rmse stands for root-mean-square error. The idea of rmse is to measure the differences between actual data and predicted data. The lower rmse, the better prediction is. The rmse for training data in SVM model is 39,010,497.
To help to understand how good is the prediction with vidualization, I plot the first 20 data point to compare the actual gross and the predicted gross.
We can see the predicted result are slightly fit to the actual gross, while some are off the the actual gross. However, these are just trainning data. Now I’m going to use the test data apply to trained model.
rmse for training data in SVM: 39,010,497
rmse for testing data in SVM: 42,378,464
The rmse for test data has increased from 39,010,497 to 42,378,464. But it's still in a acceptable range.
Now, let’s see how good the prediction is in random forest.
rmse for training data in random forest: 16,515,437
rmse for testing data in random forest: 38,043,885
The rmse for training data in random forest is much less than in SVM. At the meantime, the rmse for testing data is also less than in SVM. However, it’s much more than the rmse for the training data, which indicates that it might have a over fitting issue. This can be solved by tuning the model, but I’ll leave for the future.
Finally, let’s how is decision tree.
rmse for training data in decision tree: 43,985,433
rmse for testing data in decision tree: 45,508,212
Also, we can take a peek at how the decision tree looks like by plot the model. To do this, I need to load the ‘rpart.plot’ package for a nicer graph. (The plot function didn’t give me a insightful graph.)
Looks like the first split condition 'num_voted_user' has split out the majority of the movies(77%), which have smaller numbers of voted user, to lower gross categories. It seems like a reasonable result, since less voted user and lower gross can both result from fewer people watched those movies.
In conclusion, among these three models, random forest has the best performance, while it may have a over fitting issue need to be fixed in this case. Then SVM has the second best performance in terms of rmse. While decision tree is the last one, the rule of its categorising is more readable to us.
The full code is available here.