Today, I am going to use kickstarter project data to see how to launch a successful campaign. Here is the kickstarter project data, with 17 attributes.Now let’s take a quick look of the data. I show only part of the attributes.
'data.frame': 45957 obs. of 17 variables:
$ project.id : int 39409 126581 138119 237090 246101 316217 325034 407836 436325 610918 ...
$ name : chr "WHILE THE TREES SLEEP" "Educational Online Trading Card Game" "STRUM" "GETTING OVER - One son's search to finally know his father." ...
$ category : chr "Film & Video" "Games" "Film & Video" "Film & Video" ...
$ subcategory : chr "Short Film" "Board & Card Games" ...
$ status : chr "successful" "failed" "live" "successful"
$ goal : num 10500 4000 20000 6000 3500 3500 ...
$ pledged : int 11545 20 56 6535 0 3582 280 2180 1125
$ funded.percentage: num 1.0995 0.005 0.0028 1.0892 0 ...
$ backers : int 66 2 3 100 0 39 8 46 30 255 ...
$ funded.date : chr "Fri, 19 Aug 2011 19:28:17 -0000" ...
$ duration : num 30 47.2 28 32.2 30 ...
Second, since I set ‘stringAsFactors’ = FALSE, those attributes having strings will be set as chr rather than factor. However, there are still some attributes that I want to set as factors for analysis, like category, subcategory and status. So I’ll set them as factors separately. In addition, to make easy at the analysis process, I will use complete case to clean those data with NA. (The other reason I use complete case is that this is a relatively large dataset, I don’t mind to sacrifice some data for easier data processing.)
After using complete case, the entries has decreased from 45957 to 45945.
There are some simple questions to ask in this analysis.
1. What is the average of the pledged?
2. Is the distribution of backers skewed?
3. Is duration normaly distributed?
The average pledged is 4980.75. The distribution of Backers is obviously positive skewed. And last, the duration is not normally distribution.
Now, let’s go further to a more stratigic way.
How to launch a successful campaign? To find the answer, let’s break it down to several questions.
First, what is the best length for a campaign? To define “best”, there are couple of measurement to use, and I choose funded percentage. Funded percentage is pledged divided goal, so if funded percentage is greater than 1, that means the campaign has reached the goal.
I didn’t set a limit for funded percentage, so we can see that there are some outliers that draw the graph all the way up.
To fix this problem, I am going to use IQR only to avoid the outliers, which also make sense for finding the best campaign in a general situation. Because funded percentage is resulted from goal and pledged, I check the distribution for both of them first. It turns out that both of them are positively skewed. So I extract the IQR from both of the data, meaning exclude data that is greater or smaller than 75th and 25th percentile in either goal or pledged. Then check again the distribution for both of them.
After taking IQR.
Now it looks better. Let’s check again duration vs funded percentage.
Though the graph looks kind of messy, we can still see that when duration is 30, funded percentage is higher than others. And duration from 60 to 90 are clearly not a good duratoin for a campaign, also there are less campaign are set to have those duration.
On the other hand, we can see the graph is basically divided into two parts, percentage above 1.0 and below 1.0, which is successful project and project other than successful (failed or live etc.). The distribution of these two parts are quite similar.
Now let’s check what is the best goal to set for a successful campaign. I will keeping using data_IQR.
Surprisingly, none of any project setting goal over 4000 is successful. And pretty clear that the highest percentage is when the goal is at 2000 and decrement afterward. Again, since funded percentage is result from goal and pledged, I will checked pledged to see exact how pledged do those project get. It shows that the maximum of the pledged is about 4000. And when goal is lower than 4000, goal and pledged is a clear positive related. Both of these points has explained why there is no project having goal over 4000 is successful.
The next question is: What type of projects would be most successful at getting funded?
To answer this one, I’ll change my measurement to pledged instead of funded percentage, since it is more about getting funded. Hence, I alter my data to get IQR of pledged only, not pledged and goal. Then check the distribution of pledged in every category.
In terms of average, Dance, Music and Theater are almost the same, higher than others. But it’s clear that Music is mostly higher than the other two categories. Therefore, Music is the most successful category.
Finally, the last question is: what is the best month/day/time to launch a campaign?
This time I am going to using data_IQR and the other attribute funded date to get the answer. To get the month, day from funded date, I will need to use strptime function to convert the type of funded date from chr to time. Then use months and weekays to get the month and day value and insert them as another two attributes. Besides from getting month and day, I also reorder the factors.
Extracting time from funded data is a little bit more complex than extracting month and day, since there is no function I can use to do so directly. There is a package ‘chron’ to extract time, which I can’t install somehow. So I used another package ‘lubridate’ to get around with it. The idea of lubridate is to get the hour and minute seperately as chr type. And I paste those two value together with paste function. But they are still in chr type. So I need to convert it back to time type again.
Finally, let’s see what these graphes tell us.
- Don’t launch a campaign in June or July, period. Other months are fine.
- Any day except Sunday doesn’t has much influence on funded percentage. Just don’t lauch a campaign on Sunday, then it will be fine.
- Other than the period from 08:00 to 12:00, any time to launch a campaign is fine. But 04:00 might be the best time among all ther others.
Conclusion:
- Duration set as 30 has the highest probability having the campaign successful. Better not set the duration from 40 to 60.
- Don't set the goal over 4000. The probability of having a successful campaign will decrease from when goal is 2000 to 4000.
- Music, Dance and Theater are more likely to have more pledged. Music is the best among them.
- Launch a campaign in June or July might not be a good idea, nor on Sunday. Better not launch a campaign in the period of 08:00 to 12:00.