Navigating the ever-expanding sprawl of New York, a city that’s home to 8.4 million people, can be nightmarish to tourists and natives alike. Ben Wellington, a professor at Pratt, has found a way to make some sense of the chaos that surrounds us by using open data. By mining the city’s data in his blog I Quant NY, he’s shown New Yorkers what balances they need on their MetroCards to zero out, exposed an imbalance between payment systems in cabs, and found the best place to watch illegal fireworks. As New York’s de facto data expert and open data pioneer, Wellington recently gave a TEDx Talk to inspire city dwellers to take advantage of the data available to them (and to push their local governments for more of it). While the professor has revealed that there are remarkable things we can do with the data, he has also shown that there are major limitations on it. He recently spoke to ANIMAL about some of those limitations and how the city can improve.

How did you start the blog?
My wife is an urban planner, and I was a computer scientist and I ended up, actually, pitching a course to Pratt about teaching statistics and the idea was to make statistics more accessible by using open data. So the students would be learning about applied statistics instead of just theoretical. So instead of using a textbook and learning about Jane and Tom, taking a test, you’re learning about CitiBike and stop-and-frisk, and learning about statistics on the way. I started that course a few years ago, and to make that course interesting, the homework had to be real, they had to find interesting facts. So I went about finding fun things in the data and this year I ended up publishing some of them and I ended up being really surprised by the interest in that.

What were the most interesting things you found in your class?
One of my favorites was the Health Inspection grade skewing. That one shows that if you plot the scores given out by health inspectors to restaurants, the letter grade — it turns out that the scores will cluster right at the edges between the letters. So, there’s a whole lot of 13s given out, which is the lowest A you can get. But there are relatively few 14s, which is the highest B you can get. It shows that these inspectors are reluctant to put these restaurants over the edge, so there’s a lot psychological movement in the numbers.

One of the things you do is raise questions. Do you try to find answers?
[Laughs] At this point, it’s the media. The city hasn’t really set up a good communications channel between people and the city as far as data goes. So there are no available channels to communicate with the city, which is really unfortunate. The owners of the data sets aren’t particularly responsive if you try to reach out, and it’s not kept up to date, so there is no “in,” other than for me to get people talking. I try to put it out there and if reporters are interested they often get statements from agencies. So I think I’ve had five or six agency responses this year.

In general, what has the response been like?
They’re usually very vague. The Department of Health put out a statement…basically avoiding what was said. The DOT said they appreciated the work that went into the analysis, and were giving it a read, which also doesn’t really get us anywhere. The MTA has said they were gong to look into the uneven MetroCard problem when they increase fares this year [see Wellington’s update here]. And the NYPD has been silent, sadly, on one of my stories. The media had reached out to them, but they just refused to comment. That was on the Twitter post that said they were increasing bike enforcement because of an increase in bicycle collisions, but the data showed the opposite.

If there is, at best, a mediocre response, then what do you really think we can expect to change even if we have all the city’s data?
My goal is to get people talking about data and realizing that it can be a tool to help them improve operations. The FDNY was due at the end of last year to release the Fire Call data, so they’ve missed that deadline. But when that comes out, it’s a great example of data that people can come and try to build. I’ll try to build a mathematical model, which will be fun, to try to predict fires. Not every agency has the resources to do data analytics or data science. So by releasing data, they’re going to leverage the collective intellect of people who are interested in this stuff.

In general, I think that my goal is to show there’s a lot things that could be optimized and fixed if the data is released and if people have access to it. And while these fixes aren’t coming as quickly as with the hydrant, I believe that deep down, as people think forward on these programs, they are thinking about how data can play a role.

Do you think that this could open up room for start-ups and private companies?
Absolutely. There’s already data companies that have sprung up to do things like normalize data. To sit in the middle and clean up all these messes and offer enterprise customers access to a cleaner version of the data set. One company that sits there would be someone like Enigma. Where, I think Mike Flowers, who used to be in the Office of Analytics under the Bloomberg administration, is working there now and their job is to organize this public data in a way that is more usable.

How versed in mathematics does someone have to be to analyze data?
It’s a great question. There are all sorts of ways we can approach things. Some can be mathematical and some can be really simple. My entire class at Pratt is for urban planners who are basically using Excel. Many of the students aren’t coming in with strong math backgrounds, but once you can work with data a little bit and you appreciate the fact that it’s there, you can actually do a lot with very simple math. A lot of the work that I do is just sort of sums and counts.

How do you approach a data set if you don’t know what to look for?
I don’t go incredibly deep just because the class is not focused on statistics, but a lot of the time, the most interesting stuff can be found in outliers. So, if you count the number of tickets given per precinct and you see one is 10 times more than the other — that might be an interesting story to tell. Or, just comparing averages between things, and sometimes a story just falls out by looking at it. But it is sometimes hard to figure out what questions to ask, that is one of the challenges.

And are you worried about finding something that is statistically invalid?
That’s a great question. I guess the difference between what I am doing and what you’ll see a lot of outside what I’m doing…my goal is to dip my toes in the water and get people who are studying different things to talk about the data sets in their area. But I’m not doing a really in depth, statistical, rigorous analysis, which could take months and years to do, fully vetted and properly. However, I hope that I’m inspiring others to take some of this data and then take it to the next level. My goal is rapid, short research cycles. The advantage is that you can cover lots of ground with data sets from lots of agencies. But the downside is, yeah, any individual post is in fact just a blog post. It is not a research report that’s been peer reviewed to the fullest extent. In general, I try to list the caveats in my work.

What do you think of data-driven journalism, like FiveThirtyEight, and how is what you’re doing similar to or different from them?
It wasn’t until people started writing about me as “data journalism” that it’d ever even occurred to me. I have no journalism experience. It’s a steep learning curve. I’m still learning about urban planning and visualization. I would say that one of the differences is that I think more often in data journalism, people behind the story are taking it another level than I am. So, from what I see — not always, but often — they’ll follow up and actually go and talk to the people at all the relevant places and find out what people are thinking about this and who’s been affected. So they might use the data to get started, but then push through a story, whereas I’m stopping at the data because it’s mostly a hobby at the moment.

Do you think you might end up doing advocacy or crossing into journalism with this work?
I would put myself in the advocacy area. I just feel like I’m doing it through education and getting people talking about this stuff as opposed to direct. I’ve certainly had several government agencies and city agencies and city nonprofits reach out to me through my work to talk about best practices. I’m following up on a lot of those, and I’m starting to have more conversations directly with some agencies. That’s exciting.

What are the best places to find data?
The best starting point is either the New York City Open Data Portal, which has over 1,000 data sets. You can find things like parking tickets and 311 data there. I would say that’s a large part of the data, but sometimes there’s data that either exists on independent agency’s websites, so you have to go and find those, which is unfortunate the way it’s spread out and not centralized.

Other times, people have filed Freedom of Information Law requests and then have posted the data themselves for the public. That’s always bothersome because instead of the information eventually making its way to the public through FOIL, I’d like the government to step in and do that part themselves. It’s inefficient to have all of these people putting in a request for the same data over and over and over again.

Then of course, there’s the state open data portal, which is a really great resource. And Data.gov. I’m just starting to browse around there because I’ve mostly been New York-focused, but there’s a lot of great stuff there as well.

How could NYC improve its data sets?
First and foremost, we’re missing some key data sets. The top of the list, in my mind, is NYPD data. Cities like Seattle actually have 911 data available on the portal, so you can see 911 calls and their statistics and where they are going. Cities like Seattle, LA, Baltimore, and Chicago all have real-time — or, fairly up to date— crime data, which New York lacks. Our police department data has actually taken a backseat behind many other cities, which is really unfortunate.

We’re also missing data on NYCHA (New York City Housing Authority). There’s very little information coming out, given the number of people in the city that live in public housing. It’s pretty unbelievable how closed off it is. There’s very little data about apartments and issues they have there.

Overall, I think that making sure people can find data in a centralized way is probably a really important step as well. Doing that through the current open data portal would be a great step in the right direction.

We’ve seen a lot of figures around the NYPD slowdown, but it’s the same few numbers. Politicians have said they were statistically insignificant, while others say its evidence that Broken Windows doesn’t work. Do you think anything can be gleaned from the data?
Yeah. So this is the exact reason that we need better open data. Data comes to us filtered through newspapers, and filtered through the media, we’re left with a huge number of questions — as you’re referencing. What’s the raw data look like? Which neighborhoods are most affected? All this data, but we can only rely on what news agencies think is the most interesting things. That sometimes misses a lot of the picture. All of this information that we receive is from reporters going down to the precincts and looking at numbers manually and asking questions. None of the things you’re reading about are coming out through released data sets. So unless you as an individual or a citizen have time to go down and find connections with the NYPD and have your questions answered, you’re kind of left in the dark. So, to answer your question: We don’t know, because it’s so much of a black hole. In a world where this data was kept up to date, like it is in Seattle, in Chicago, where we had that same access, then we could be doing a lot more to understand the slowdown. Instead, we’re left funneling the bits of information that we get. That’s really unfortunate.

The only real time NYPD data that is available in the entire system is through 311…I did study response times to 911 calls from the NYPD during that period of time, and I didn’t see much of a slowdown. You know, obviously that doesn’t tell us the whole story.

Eventually, we will be getting the parking ticket data. I look forward to that. The summons data, which you can get in other cities, they do it by precinct but they certainly don’t do it at the level you would need it to fully understand it.

What’s the oddest thing that you’ve come across since sorting through data?
I would say the oddest thing was the taxi example. I was pretty surprised to learn that for the last, at least five years, you’ve been paying different tips in different cabs.

I was also struck to learn that there’s literally no way to get a zero balance on your MetroCard. Doesn’t matter how many times you buy, ride — if you hit their buttons, only, you literally cannot zero it out. That surprised me as well. I mean, I know people feel that because they can never seem to get it to zero, but it was definitely surprising to me…I get so excited when I’m at the subway station and I see someone punch in \$19.05.

What are some of the things that you’re considering in investigating going?
Yeah, people ask that, but I think that because these are small little studies, I tend not to have much of a roadmap. I certainly am going to digging more on taxis…The fact that you can take a taxi and either use the meter, or you can negotiate a fare with them. The negotiated fare is supposed to only bring you to areas outside of the city, but the data shows that that’s not necessarily true, so I want to learn more about what’s going on there and whether there are some oddities. I’m also really excited about the FDNY data when it comes out.