By Nathan Adlam
As a Data Analyst, I have been tasked with exploring how annual members and casual riders use Cyclistic bikes differently.
To go about this, I will produce a report with the following deliverables:
1. A clear statement of the business task
2. A description of all data sources used
3. Documentation of any cleaning or manipulation of data
4. A summary of my analysis
5. Supporting visualizations and key findings
6. My top three recommendations based on my analysis
How do annual members and casual riders use Cyclistic bikes differently?
I started by downloading the most recent full 12 months of Cyclistic trip data. This data was downloaded and stored locally on my PC. This is public data that is available for use under this license. This data was provided by the company.
To begin processing the data, the data sets were loaded into Power Query in Excel. This allowed for easier processing and handling of the data. All of the datasets were aggregated into one query. The total number of rows was 5,738,612.
The next step was to clean the data.
The ride_ID column did not offer any useful information so it was deleted.
A couple of basic calculations could be made to see what kind of range of data existed. The max ride length was discovered to be 69 days, which seemed unrealistic. The following view showed some blank cells, so these rows were filtered out.
After filtering out the rows with empty cells in end_station_name and end_station_ID, I decided to do the same with start_station_name and start_station_id. This process filtered out 1.4 million rows, which seemed quite significant, considering the total number of original rows in the data was only 5.7 million.
After a second look, I found that these rows with blank cells in the end_station_name column were rides with electric bikes, typically from members. There were also a significant number of blank cells in the start_station_name that were rides with electric bikes, typically from members.
However, in order to answer the question of how casual riders and members use the bikes differently, it is important to know where they start/end their rides. One clarification that could be critical for the analysis is the meaning of these empty cells and if there is any disincentive to not start/end at an official station, and how this can be avoided. This would be a great next step for further analysis of the project.
So after filtering out the rows with blank cells, the new max ride length was around 8 days. It seemed reasonable that someone could rent a bike for a week-long trip, so this was kept as is .
A new column ride_length was created by subtracting started_at from ended_at. When trying to calculate the average ride length, it was found that some of these cells resulted in an error, because in some cases the ride_length turned out to be 0. So to fix this problem, and also remove some of the ride lengths on the very low end, every ride length under 30 seconds was removed. I couldn't see any reasonable use cases for using a bike for under 30 seconds. This would also be a great next step to clarify with Cyclistic.
I also created a new column that indicated what day of the week the ride was on. This could be useful in understanding when/how casual riders and members use the bikes differently.
Lastly, I created a column that only contained the time the ride started at. This was already included in a column with the start date, but for ease of handing within a pivot table, I separated it.
The next step was to analyze the processed data. It's important to note that some of the process/analyze steps were not done chronologically, because the analysis uncovered some other issues with the data that needed to go back for processing.
One example of this was when calculating the maximum ride length. This was done after the initial round of processing, but the result of a 69 day maximum ride-length led me to go back and clean up some data that I initially missed. I also noticed some ride lengths that were unrealistically low (less than 10 seconds), so I decided to eliminate the rows where the ride length was under 30 seconds. This final filtering left me with 4.2 million rows, after starting with 5.7 million rows.
So after the cleaning was finally finished, it was time to really analyze the data.
The following charts were created to give us a better idea of how casual riders and members use the system differently.
I found that the average ride length for casual riders was 23 minutes, 34 seconds.
For members, it was 12 minutes, 31 seconds.
This data shows when most casual riders ride, and when to focus marketing efforts. Rides per month for both members and casual riders trends up in the warmer months and down in the colder months.
This data shows that casual riders ride length increases on the weekends, while members stays relatively flatter.
From this chart, we can see that there are 2 stations that have a significantly higher number of starting rides than others. Marketing could be focused in this area.
From this chart, we can see a more balanced spread between the top stations. The location of these stations can better suggest the reason for their bike usage (commute to work, popular bike route, etc.)
On this map, the red points indicate the top 5 stations where members start rides, while the purple points indicate the top 5 stations where casual riders start their rides. It appears that the most popular casual rider starting locations are located in more touristic areas and the most popular member starting locations are more in the city.
This chart overlays when both casual riders and members start their rides. From this chart, we can see a bigger spike in usage around 8/9 AM and 5/6 PM, suggesting they use bikes to commute to/from work. Casual riders see a similar but not-quite-so-pronounced change in their trend at these times.
These are just a few basic charts that offer some insights into how casual riders and members use this bike-sharing program differently.
My business task was to determine how members and casual riders use Cyclistic bikes differently. My top 3 recommendations are as follows:
Divide casual riders into tourists and recurring riders
This can make a big difference in terms of who to market to and where to allocate resources. Marketing to tourists will be much less effective in converting them to members than finding casual users who use the system on a recurring basis. An incentive could be offered for someone to log in so their rides are tracked on an ongoing basis, for example.
Use the data found to focus marketing efforts
Based on the data, most likely sometime between May-September, between 1-7 PM, in the locations where the casual riders start most of their rides (if applicable). This could be in the form of an incentive for becoming a member, for example.
Find out why members became members and what's holding casual riders back from becoming members
A simple survey could answer some questions about why casual riders don't become members and why members became members in the first place. It could be discovered that there is some unseen frustration with the system. There are also a significant number of low ride lengths, so maybe there is something difficult with the app or it's too easy to make an error.
Thanks so much for your time and attention, I hope you enjoyed this analysis.