Seeing how crucial public transport is to the people who don’t drive and for protecting the environment, public transport needs to be in its best shape to move passengers conveniently, especially with the numerous possibilities big data brings to public transport performance.
One aspect of public transport performance is occupancy data, which is essentially people counting data. Occupancy data has a huge role to play in informing passengers about the congestion levels in trains and buses to help them avoid crowded vehicles.
To get started with using people counting data, we had to familiarise ourselves with the data set we received by exploring it. Here’s how we explored people counting data in public transport.
Note: Line names have been changed to random numbers for anonymity.
Using Python, we imported Pandas and Numpy for analysing and computing our people counting data sets, as well as Matplotlib and Seaborn for visualising the computed data.
We then assigned the variable of the chosen date-time dataset of 28th February 2020 as “begin_of_corona”. It goes without saying that the date-time of 28th February 2020 was roughly when the Covid-19 pandemic started making its entrance into our lives.
# Python Configuration %matplotlib inline import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np from datetime import datetime begin_of_corona = datetime(2020, 2, 28)
The ratio of counted trips to all trips offered a clue as to whether routes were regularly counted.
The calculation for this ratio used a comma-separated values (CSV) file about the trip schedule from the Pandas library, so the file was assigned as “trip_schedule”, which was an export of the MaBinso software.
The CSV file was the basis for a pie chart we baked made to show the proportions of the following elements:
trip_schedule = pd.read_csv('trip_schedule.csv',';',low_memory=False) labels = '0 - uncounted', '1 - counted', '2 - extrapolation', '3 - replacement value' sizes = [866868/len(trip_schedule), 434581/len(trip_schedule), 569414/len(trip_schedule), 177620/len(trip_schedule)] # only "explode" the 2nd slice (i.e. 'counted') explode = (0, 0.1, 0, 0) fig1, ax1 = plt.subplots() ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90, radius=1.9) plt.show()
We designed the pie chart by using Matplotlib’s Pyplot “plt.subplots()” function which was assigned to “ax1” and “.pie” to plot the pie chart while incorporating:
With the design code assigned, we then used Matplotlib’s Pyplot “plt.show()” function to plot the final result for our interpretation.
Looking at the pie chart, it sadly didn’t help that only 21.2% of all trips in the data were counted. Why did the counted data make up such a small proportion of the overall data?
One possibility for the lower amount of counted data was that the data cleaning process may have gone beyond removing duplicates. Maybe the data engineer got carried away with the decluttering or made some human errors.
Another possibility for this issue was that since manual counting was involved, the manual counts were stopped after a certain point. Simultaneously, there were not enough automatic people counting systems to make up for this. The shortage of manual and automatic counts was more likely to influence the results than the overcleaning of data.
But a more likely possibility to consider was that the data was gathered around the time Covid-19 began making headlines and full lockdowns were enforced worldwide shortly after. This could simply mean that no data was gathered during the first full lockdown.
Was there something that could have been done better? Perhaps, public transport data collectors should establish that the show must go on, rain or sunshine. In more direct words, public transport data collection should be done consistently regardless of whether passengers will be absent or present.
Without doing so, we may have missed out on seeing the actual implication of the lockdown on the occupancy level. Who knows? There could have been a few frontliners who still needed to use the train or bus unless the public transport operators were instructed to put their services on hold during the lockdown.
This ratio was also a ratio of counted trips to all trips but for each line. So, for this calculation, we focused on just 2 columns by setting them as “valid_columns”: APC and Line.
APC stands for “automated people counting“.
We indexed the “valid_columns” using square brackets in the trip_schedule CSV file that we used for our pie chart earlier. Along with this indexing, we used a few functions:
We then assigned this whole thing to “lines”.
valid_columns = ["apc", "line"] lines = trip_schedule[valid_columns].groupby(by=["line", "apc"]).line.count().to_frame('c').reset_index().sort_values(by='c',ascending=False) plt.figure(figsize=(23, 9)) sns.set_style("white") sns.set_context("notebook", font_scale=1, rc={"lines.linewidth": 2.5}) ax = sns.barplot(data=lines[:120], x='line', y='c', hue='apc') _ = plt.xticks(rotation=90) plt.legend(loc='upper right')
We used “plt.figure(figsize=(width, height))” to create the figure object and tweak its size to the desired width and height to fit all the lines on the x-axis of the barplot. We then used Seaborn accordingly:
Since the lines were arranged from the highest “c” values to the lowest “c” values earlier, the barplot appeared like a descending mountain of sticks.
From interpreting the barplot, there are many things to question about this analysis such as:
Such questions can help determine the next steps of the data exploration.
We also visualised the timeline over which the lines were counted using a scatterplot. The aim of the scatterplot was to show when the lines were counted and not counted.
plt.figure(figsize=(23, 9)) _ = plt.title(f'Lines to Counted Trips') sns.scatterplot(data=learning_data, x="timestamp", y="line") <AxesSubplot:title={'center':'Lines to Counted Trips'}, xlabel='timestamp', ylabel='line'>
We again created the figure object, tweaked its size and used Seaborn to plot the scatterplot with the timestamps on the x-axis and the lines on the y-axis.
The “<AxesSubplot…>” part was for positioning the title “Lines to Counted Trips” in the top centre while setting the x-axis label as “timestamp” and the y-axis label as “line” (as you can see in the image below).
While the coding approach used the entire time may have room for discussion, I don’t see how the code matters more than the quality of the data set itself, which can be seen in the scatterplot.
The gaps shown in this scatterplot brings us back to the issue of inconsistent data collection (but in more detail), which is the foundation of this data exploration. So, there’s no point in talking about the rest of the steps taken in any data exploration project if the data set we rely on is poor.
For some of the lines, the Covid-19 pandemic onset could be the best possible explanation. But this scatterplot has also revealed that most of the lines didn’t have people counting data before mid-2019. There are two possible factors:
Another strange thing to note from this scatterplot is that Line 7 only showed a year’s worth of people counting data, between early 2018 and early 2019. What happened to Line 7 after that?
Not to forget, the whole timeline started from early 2018, which leads me to wonder if people counting data has been collected before 2018 or not.
We started by importing the relevant libraries for us to analyse, compute and visualise the people counting data. We then proceeded to calculate the ratios of counted trips to all trips and the same ratio for each line, which have been visualised by a pie chart and a barplot respectively. A timeline of the latter ratio was also visualised using a scatterplot.
In the process, we learnt (or rather, were reminded of) the importance of data quality in any data investigation and that public transport data collection should be done consistently regardless of whether passengers will be absent or present.
At the same time, the questions we generated from interpreting the barplot suggested other things to explore about the data. These interpretations inform our next steps for this data exploration, which will be explained via heatmaps in the next article.
Everything you need to know about NoSQL, a type of database design that offers more… Read More
ZooKeeper is an open-source Apache project that provides a centralized service for providing configuration over… Read More
There are many types of data structures out there that are meant to store data… Read More
An improved architecture and enthusiastic user base are driving uptake of the open-source web tool. Read More
This article will offer you a clear grasp of the drilling process, how to install… Read More
Introduction The term solar energy should be quite common; it represents the direct energy produced… Read More