New York MTA Challenge: A Data Science Story

17 min readJan 8, 2021

The Challenge

Suppose that you are a non-profit based in New York City, like The New York Women’s Foundation, or The Salvation Army and you would like to get the word out for an event you have planned. But like many non-profits you have limited resources so you would like to know where in the city you should place people whose mission would be outreach by handing out fliers, brochures, pop-up presentations, etc, to market this future event.

You have the materials, presentations, and other marketing materials ready. But there is yet a few questions that remain:

Given that we will be reaching out at different locations within the New York MTA system, which of the MTA turnstiles will result in reaching the most people?
Even if we knew which MTA turnstiles would enable us to reach the most people, knowning that people don’t just stand still at a MTA turnstile, when would be the best time to place our out-reach personnel to have the best chance of reaching the most people?

These two questions can’t just be answered by guestimation. The non-profit needs to perform an exercise that will give us a reliable answer based on facts on the ground. The organization only has resources to send people out one time per year (let’s say).

A Way Forward: Data Science

For the purposes of this post, let’s assume the non-profit organization hires you, to figure out the where and the when that will enable it to maximize its outreach for its future event.

TLDR

Data Science is an inter-disciplinary field that uses science, math, domain knowledge, and software to extract actionable knowledge from observations.

What I am proposing is that we help our non-profit achieve its goal to maximize outreach by finding the following:

The precise MTA turnstiles at which to place personnel to achieve maximum outreach.
The exact day and four-hour-period in the entire year to place personnel to achieve maximum outreach.
Have the ability to find the right time and place (MTA turnstile, exact day, and 4-hour period) to do outreach for an event scheduled at any time during the year.

The Data Source

For the purposes of this study, I will be using New York MTA turnstile data.

The turnstile data contains cumulative entry and exit counts for each turnstile, which are sampled at periodic audit events for most 7-day periods throughout the year.

File Naming Convention

The MTA data, as you can see from the link above, are contained within text files, where each text file contains information for a specific period of time.

The files are named so that you can quickly determine for which 7-day period within any year starting with May 5, 2010 (at time of this writing).

Below is the naming convention of turnstile traffic files:

http://web.mta.info/developers/data/nyct/turnstile/turnstile_<YY><MM><DD>.txt

Here <YY> is the two-digit offset from the year 2000. So 2000 + <YY> represents the year the audit data was collected. If <YY> == 19then the year is 2019, for example.
The field <MM> represents the two-digit ordinal identifier for the month in which the year was taken, where <MM> == 01 being January (the first month of the year), and <MM> == 12 being December (the last month of the year).
The field <DD> represents the day of the month which is the last day of the trailing 7-day period during which audits are reported in the file.

So here’s a specific example from the analysis I performed:

http://web.mta.info/developers/data/nyct/turnstile/turnstile_190921.txt

This represents all of the cumulative entry and exit counts taken every four hours from September 14, 2019 to September 21, 2019.

Contents of The `turnstile_YYMMDD.txt` Files

Each of the text files is a comma separated variable (CSV) file containing the following fields[1]:

C/A: Control Area.
UNIT: Remote Unit for a station.
SCP: Subunit Channel Position represents an specific address for a device.
STATION: Represents the station name at which the device is located.
LINENAME: Represents all train lines that can be boarded at this station. Normally lines are represented by one character.
DIVISION: Represents the Line originally the station belonged to BMT, IRT, or IND.
DATE: Represents the date (MM-DD-YY).
TIME: Represents the time (hh:mm:ss) for a scheduled audit event.
DESC: Represent the “REGULAR” scheduled audit event (Normally occurs every 4 hours).
1. Audits may occur more that 4 hours due to planning, or troubleshooting activities.
2. Additionally, there may be a “RECOVR AUD” entry: This refers to a missed audit that was recovered.
ENTRIES: The cumulative entry register value for a device.
EXIST: The cumulative exit register value for a device.

Here is a snippet of a few lines of one of the CSV files used in the analysis:

Snippet of a MTA Turnstile Traffic File

Why do all of this upfront (seemingly pedantic) examination of the data?

Understanding the contents of the data set is key to correctly interpreting and analyzing the data.
Now that I know the convention used to name the text files, I could in a future revision write more code to automate downloading any specific day or time period.
Understanding both the data and the naming convention could enable me to abstract away these turnstile files to ease handling them and enable more systematic ways of storing them, like for example storing them to database.

Challenge 1: Accessing the Files for Analysis

This first step in performing an analysis is reading in the data. There are a few possible choices:

Just download the CSVs
Obtain an API key and access it via API
Use URLs

The first choice is a good idea when you have a few files that are relatively small that you only need to download once.

The second choice is the best plan of action if you plan on writing a software application (“app” for short) that will be making high frequency access to the data, downloading large amounts of data, or accessing specific types of data that the MTA (or some other organization) deems valuable and worthy of being metered by an API [2].

The third choice is a happy medium and it is what I use for this project. Below is the code snippet I used:

How Much Data Should We Download?

For the purposes of this analysis I downloaded all of the weekly turnstile data for the year 2019 for the following reasons:

I wanted a full year so that I could potentially find all of the ideal locations and specific points in time throughout the year at which the non-profit should place personnel. This would give the non-profit the flexibility of choosing from multiple opportune times and places if that enables them to optimize some other constraint that I did not consider as part of my analysis.
By downloading an entire year’s worth of data I get a more complete picture of what a typical year could look like, which will lend to a more robust analysis.

I omitted what I considered exceptional years (year 2020 due to the global pandemic and year 2016 due to the election so as to not include potential outliers in the analysis. I could have done a more robust analysis of multiple years and actually did something like an IQR analysis to identify these outliers, but that was outside of the scope of this project given the time constraints.

Challenge 2: Clean Up the Data

So now the pandas data frame concat_dfcontains traffic data for each turnstile in the NY MTA system for 2019.

Here’s a snap shot of the first few rows of this data frame:

But this data set is still in its raw form. As raw data, it is not ready to be analyzed to arrive to the insights we seek for our non-profit organization. Here are some issues we should check for, and if they exist we need to clean up:

Improperly formatted data which will make it hard or impossible to do calculations required to further analyze the dataset.
Data that is not correct or consistent, which would yield incorrect or misleading analysis results.

The first thing I do is check the names of the columns:

Something Is Wrong with the Column Names

As you can see column name EXITS has a lot of extra spaces. I’ll remove those for each of the columns in the data frame using this code:

And here’s what it looks like after that:

Next, I need to make sure that the DATE column is formated as a datetime object and not as text, because:

Representing the dates as datetimeobjects will allow me to programmatically sort rows by date, calculate the amount of time between two dates, and do other operations that I would not be able to do easily using the dates in their raw form.

So here is the current data type for the entries in the DATEcolumn:

Currently DATE is a string

After running this code, I convert each of the strings in the `DATE` column to a date time object (specifically a pandas._libs.tslibs.timestamps.Timestamp, which is a datetime,object):

Note that I create two new columns:

DATE_TIMEwhich contains the datetimeobject which represents the exact date and hour that the entry count was audited.
DATETIME_DATEwhich just contains the date of the audit as another datetime object.

As mentioned earlier, it’s possible that incorrect or inconsistent data can reside within raw data that is collected in the field. There are many reasons why this could happen but here are two common data collection error causes:

Human error: The person could have recorded the information incorrectly for a variety of reasons.
Machine error: The device, instrument, or machine that is being used to collect the information could have experienced an error event that was not handled by the machine itself, and/or was not reported or resolved by the operator.

One specific thing to look for is the existence of duplicate rows. According to how this information is defined there should be only a single audit per turnstile for every four hour period.

The code below checks for that:

Which yields this output:

What we did (and which was enabled by our string-to-object coversion above) was to collect all ENTRIES counts which share the same ( C/A, UNIT, SCP, STATION, and DATE_TIME) and then count the number of times these ENTRIEScounts are found. Ideally, we should only see 1.

Just as a quick diversion: The power of object oriented programming languages (and the frameworks on which they are built) is at play in the code above. Namely the concept of function call chaining.

Because of function call chaining I am able to make multiple calls to the same object in a single line of code. I don’t always do this (like if my intent is to be more explicit, or if there’s something I’d like to do between the calls to the object, or if like in one case that I personally experienced the compiler or interpreter didn’t support it [said big tech company whose compiler that had this problem shall remain nameless ;-) ]), but in instances like this where we would otherwise have to make a bunch of calls to the same object, call-chaining is very appropriate (IMHO).

More info on the groupby() method can be found here.

So we have established that there are duplicate rows in our data set. Below are the two lines of code that I used to remove them:

So with the first line above, I sort the data frame by the colums C/A, UNIT, SCP, STATION, and DATE_TIME. I then remove all rows that have the same value for these columns, which would indicate two audit samples made at the exact same time for the very same turnstile.

So now when I re-run the code two code-blocks ago I get this:

Note, I use ascending=False so that the duplicate counts are sorted in descending order. So what you see are the rows with the highest instance-counts, which in this case is 1 the desired number.

Challenge 3: Getting Net Daily Entries From the Data Set

Remember the end goals we set out above? If not, here they are in short:

We want to inform the non-profit of two things to maximize outreach for its next event:

Where within the MTA system to place personnel to do the outreach.
When to place the personal to do the outreach.

I will now postulate that the non-profit will maximize outreach when the personnel are located at the right place and time when the number of entries to any turnstile is very high. This seems to be a good first approximation: as you maximize the eyeballs that see your personnel handing out fliers, then the greater the chance you hand a flier to a person that will respond. In the future, hypothesis testing via an A/B test, or finding a completed test that was conducted by someone else, would give some more concrete clarity on this notion.

So how do we get the daily entries? Let’s look at a few rows of data for a single turnstile on a single day:

Which generates this result:

What you see is that the ENTRIEScount for each 4-hour period for this day. The counts increase, as it’s cumulative. But what we want to know is the net number people on a given day that enter a turnstile.

So what I propose is the following:

Let the first entry count for a given day be the entry count for that day. Let’s call that day N.
Find the first entry count for the very previous day (day N — 1) and add that to the row for the turnstile for the current day (day N).
We can then subtract the number of entries on day N — 1 from the number of entries on day Nto get the net number of people who entered a turnstile on a day N.

So here’s the code that does that:

This is what concat_df_daily_entries looks like once this is all said and done:

The Row for Day N Includes the Cumulative Entry Counts for Day N and Day N-1

So, now we’re ready to rock-in-roll Chuck Berry Style!

(wait, that’s not Chuck Berry…)

(No! No! STILL not Chuck!)

(warmer; but can’t you find Chuck rockin’ and rollin’???)

(uh… no.)

(We’ll settle for that [for now]).

(Now Back to Our Normally Scheduled Blog Post):

So what we have per row in concat_df_daily_entriesare the following tuples:

(DATETIME_DATE, DAILY_ENTRIES): The number of cumulative entries for a turnstile on a given day.
(PREV_DATETIME_DATE,PREV_DAILY_ENTRIES): The number of cumulative entries for a turnstile on the immediately preceding day .

We should be able to now get the net daily entries per day, right?

NOT YET!

We need to do more cleaning. We need to ask ourselves critical questions about this data that when answered will reveal whether or not this data has integrity.

So here’s an assertion: The cumulative entries always count up, which is what makes sense, based on the definition of the ENTRIES column we read about above.

So here’s the code that provides it. In this case we should get nothing back when we run this query:

Here’s what we get:

Instances of Backwards Counting Turnstiles

So these are a few examples of the turnstile counting backwards from day N-1 to day N.

I see that in the first 10 instances of day-to-day backwards counting that the “57 ST-7 AV” station has a lot of instances of that. So on a hunch I start looking at this station in more detail:

This is the output:

We do see where the entry counts don’t accumulate for several hours, and that’s not strange. The MTA is not busy 24x7x365.25.

But what is unexpected and needs further treatment is what we see above, i.e., the turnstile counter counting backwards. This would require further analysis (maybe with some monitoring of backwards counting turnstiles) to find out why. But I think these are likely reasons:

Turnstile malfunction
Someone or something forces numbers to roll back
The hardware counter could be rolling over. This can happen [3] in general and given that it’s probably made of the same stuff as other electronic systems it can happen here.

So the thing to do when you are handed bad data is to clean it up. In this case we are going to apply a heuristic as demonstrated in this code:

So here’s what is being done on each row of concat_df_daily_entries:

We calculate DAILY_ENTRIES — PREV_DAILY_ENTRIES.
If it’s negative we flip the sign to positive.
If the result is larger than what we heuristically believe to be a possible net number of people that could enter any turnstile on any given day, we assume that the real net number of people entered on that day is the smaller of the two numbers ( DAILY_ENTRIES or PREV_DAILY_ENTRIES).
If that number is still larger than what we could heuristically realistically believe to be the net number of entries, we set it to zero (essentially clearing it out).

For purposes of the calculation we let this heuristically large number be called BIG_NUMBER. It’s an argument we can pass to get_daily_count() so that it can be adjusted by the programmer just in case results appear to be still wrong.

For the purposes of this challenge I assumed BIG_NUMBER is 1,000,000. I should spend time thinking how I could arrive to a more realistic number. Maybe having a camera pointed at a few turnstiles and then developing a CNN [4] that could recognize people entering said turnstile and counting the actual number of people.

After performing the apply() above, I need to rename columns. In fact, looking back at the code, I should have named DAILY_ENTRIES CUMULATIVE_DAILY_ENTRIESand PREV_DAILY_ENTRIES CUMULATIVE_PREV_DAILY_ENTRIES. In software (and in engineering in general) variable names matter, especially if you ever want to hand your code to someone else to maintain with minimized aggravation for you and the person to whom you’re handing the code.

So I do this:

And from here on out I redefine DAILY_ENTRIES to mean net daily entries.

Challenge 4: Visualizations Key To Insights

We now want to get some big-picture insights from our data. But what would be the best way to do this?

We could read each row and find the rows with the largest DAILY_ENTRIES.
We could calculate the mean, variance, and standard deviation for the DAILY_ENTRIES.
We could (finally) do some outlier detection and remove outliers.

But one thing we could do is visualize the data we have. By that I mean we could create a representation of the data that abstracts away details about the data we don’t care about so that we could more readily see aspects of the data set that we do care about or may serendipitously find interesting.

Visualization may help us to then more quickly determine which of the above calculation we may want to do first and may help us to better decide if any of the above are required at all.

To start off, what does the net daily entries to a single turnstile look like for the time period for which we have data?

First I split off a single turnstile from the large data set:

I then plot this smaller data set:

Which yields this graph:

A Year’s Worth of Net Daily Entries For a Single Turnstile

Here are some things that jump out to me looking at this graph:

What happened on July 20th?
This is still a lot of information.
This plot reminds me of a sine wave.

July 20th, 2019

Looks like there was a large influx of people on that day due to a computer outtage:

Computer failure cripples several New York City subway lines

The Other Observation: It’s a Sine Wave

So instead of spending time calculating statistics, I decided to plot the data I have.

I immediately found an outlier without typing anymore code.
I identified a latent property of the data set, i.e., that it’s periodic.

Challenge 5: A Bigger Insight

So now let’s say that we show this plot to the non-profit. And let’s say that the non-profit then makes this request:

It’s nice that you can drill-down to individual turnstiles. But this is too much detail for us. We would rather see net daily entries for each station. We also understand and agree that the data seems to have some periodicity to it. We don’t have the resources to place our personal Sunday — Saturday at stations within the MTA to do our outreach. Can you tell us which days of the week are the best times to place our personnel on any given station?

Whoa! Seems as though they are asking for a lot. But this is a good sign. When the customer/client/colleague/friend/enemy sees what you have and asks for more, that means they’re engaged. What you don’t want to see is a customer that looks at your stuff, and then you never hear back from them.

Feedback and engagement are golden:

So here’s what we’re going to do:

We are going to collect daily entries of each turnstile together by station, which is in this case by ( C/A, UNIT, STATION, DATETIME_DATE) tuples.
We are going to sum the net daily entries for all turnstiles at the same station and make this the net daily entries for each station.
Be able to visualize the net daily entries on a day-by-day basis for any week (and any number of weeks) in our data set.

Step 1

That code yields this:

I need to know the actual name of the day of the week, as per the request of the non-profit. So I will extract the name of the day for each date in the data set using this code:

Step 2 and Step 3

Which yields this diagram:

Net Daily Entires Per Day of the Week For 3 Weeks In June 2019

So if the non-profit based on other constraints was thinking of placing outreach personnel at the 191 ST station, and were thinking of doing it during the first few weeks in June, then Monday and Wednesday are good days to choose, and Tuesday and Friday are bad days to choose.

Using the code above, a person could change START_DATE, NUMBER_OF_WEEKS, and STATION_NAMEand get a similar graph for any station for any number of 7-day periods in the data set.

Finally

That’s it. But there are many more things that could be calculated with this data set or analysis done in combination with other data sets:

Find the stations with the largest total net entries for the year 2019.
Find the set of stations that are closest to each other that will on a single day have max daily entries via a multi-constraint analysis that includes geographic proximity of the various stations.

And many more! I look forward to reading your feedback especially what other analysis could be done to help the non-profit find the optimal time and place to situate outreach people within the MTA system.

The code here was written by me but is based on another Jupyter notebook I developed with other data scientists as part of my project experience I gained in the Metis Data Science Bootcamp from which we recently graduated. Here is a link to the complete notebook referred to in this blog post.

¹: Full definition of the fields is here

²: More info on how to obtain an API key from the NY MTA can be found here.

³: An embedded SW developer wondering how to deal it.

⁴: BREAKING NEWS: I was only kidding! I meant convolutional neural network.