### Background

City of Chicago has available a dataset of taxi rides dating back to 2013. Dataset in total numbers ~105 million taxi rides. These are only public taxi rides – so Uber/Lyft are not included. It is believed to covered the majority of public taxi rides. Since dealing with 105 million rows is overkill, only 2016 will be analyzed. This accounts for just under 20 million taxi trips.

### Data Gathering / Cleaning

The city already did some data cleaning, removing gross outliers. I did some further cleaning by reducing the data to trips only under $200. Some trips in the data set had fares in the thousands of dollars. For my purposes, these figures were implausible so I chose to draw the line at $200. $200 is around 5 or 6 standard deviations above the mean, so it captures the vast majority of the data – only ~4,000 trips were excluded.

For anonymity purposes, trip timestamps are rounded to the near 15 minute interval. Getting a count of the number of trips is simple enough – data was loaded in SQL so all you need is a `SELECT trip_start, COUNT(trip_start) FROM db GROUP BY trip_start`

I wanted some way of estimating how many taxis are on the road. It’s reasonable to assume that in Chicago, the 3rd largest city in the U.S., a cab driver should be able to find at least one customer an hour. My measure of taxis on the road is the count of unique taxis that began a trip in the past hour. I also collected the past hourly rate that taxis were earning (sum of all revenue earned in the past hour divided by number of unique taxis). SQL code to accomplish the above, below:

SELECT trip_start, COUNT(trip_start) AS trips_taken, ( SELECT COUNT(DISTINCT taxi_id) FROM taxi_trip_fare_db WHERE trip_start BETWEEN t.trip_start - INTERVAL '60 MINUTES' AND t.trip_start AND trip_total IS NOT NULL AND trip_total <= 200.00 ) as lag_60_taxi_count, ( SELECT SUM(trip_total) / COUNT(DISTINCT taxi_id) FROM taxi_trip_fare_db WHERE trip_start BETWEEN t.trip_start - INTERVAL '60 MINUTES' AND t.trip_start AND trip_total IS NOT NULL AND trip_total <= 200.00 ) as rolling_hourly_rate FROM taxi_trip_fare_db t WHERE DATE_TRUNC('day', trip_start) = {0} AND trip_total IS NOT NULL AND trip_total <= 200.00 AND trip_total > 0.00 GROUP BY trip_start ORDER BY trip_start

Import the output from above into a Pandas DataFrame and you get a dataset that looks like this:

### People take taxis exactly when you would expect

This really nicely summarizes what’s going on with taxi usage in Chicago – and what might be assumed of taxi usage in general:

- There’s a small rush weekday mornings – people going to work
- Usage remains relatively constant throughout the day
- Usage picks up weekday evenings – rush hour – people coming home from work
- Heavy using Friday and Satuday nights – the typical weekend nightlife crowd

Below is the code I used to accomplish this – most of the code is for formatting – there was a lot of little tweaking to get the output exactly as I wanted it:

# first use Pandas resample function to roll up the 15 minute intervals into hour intevals mean_trip_count_rs_df = pd.DataFrame(rollingcount_df.trip_count.resample('1H').sum()) # Extracting the weekday name and hour of day from the time stamp index mean_trip_count_rs_df['weekday_name'] = mean_trip_count_rs_df.index.weekday_name mean_trip_count_rs_df['hour_of_day'] = mean_trip_count_rs_df.index.hour # converting the hour of day to a string, then formatting into hour:minute mean_trip_count_rs_df.hour_of_day = mean_trip_count_rs_df.hour_of_day.astype(str) mean_trip_count_rs_df.hour_of_day = mean_trip_count_rs_df.hour_of_day.map(lambda t: datetime.strptime(t, '%H')) mean_trip_count_rs_df.hour_of_day = mean_trip_count_rs_df.hour_of_day.map(lambda t: datetime.strftime(t, '%H:%M')) # creating the pivot table used to create the heatmap mean_trip_count_rs_pivot = mean_trip_count_rs_df.pivot_table(index='weekday_name', columns='hour_of_day', values='trip_count', aggfunc='mean') # to order the days corrected, needed to create a dummy column to sort on, then delete the dummy column mean_trip_count_rs_pivot['dayorder'] = mean_trip_count_rs_pivot.index.map(lambda d: weekday_dict[d.strip()]) mean_trip_count_rs_pivot.sort_values(by='dayorder', inplace=True) mean_trip_count_rs_pivot.drop('dayorder', inplace=True, axis=1) # formatting the hour:minute timestamp into 12 hour time stamp mean_trip_count_rs_pivot.columns = mean_trip_count_rs_pivot.columns.map(lambda t: datetime.strptime(t, '%H:%M')) mean_trip_count_rs_pivot.columns = mean_trip_count_rs_pivot.columns.map(lambda t: datetime.strftime(t, '%I:%M %p')) # creating the heat map meantripcount_plot = sns.heatmap(mean_trip_count_rs_pivot, annot=False, linewidths=0.5, cmap='Blues') plt.title("Friday and Saturday nights are the most popular times for taxi rides") plt.xlabel("Time of Day") plt.xticks(rotation=45, horizontalalignment='right') plt.ylabel("Day of Week") plt.yticks(rotation=0) plt.tight_layout()

### Taxi drivers rise to meet the demand of taxi rides

Taxis are generally driving when people are looking for taxi rides.

### The most profitable time to drive are Sunday nights?

There’s not much rhyme or reason as to the most profitable time to drive a taxi ride. Monday morning at 6 AM is one of the best times to be on the road according to the data.

In general, drivers are seeing an increased hourly wage:

- late weekday nights around 9-10 PM
- Thursday evening from 4 PM to 12 AM
- Sunday night from 4 PM to 1 AM

One hypothesis for Thursday and Sunday nights seeing increased hourly wages might be more trips to and from the airports for people who travel for business – as majority will be traveling in/out of the city on Thursday and Sunday nights.

### Taxis are busiest early Saturday/Sunday mornings and weekday mornings

Taxi utilization is a metric I created. It’s the number of trips in an hour divided by the number of taxis on the road. This is to get an idea of how busy taxis are on a per taxi basis – how many fares is each taxi picking up per hour. By this metric, taxis are busiest in the morning rush hour, and very late Friday/Saturday nights (bar closing hours).

Full code and more available on my GitHub

(bear with me, as of the first publishing of this post, I need to clean up my GitHub)