Preprocessing and Exploratory Data Analysis of Large-Scale Taxi GPS Traces

1.Preprocessing and Exploratory Data Analysis of Large-Scale Taxi GPS Traces

1.1. Introduction

The dataset records millions oftaxi trips in Manhattan, New York in a given year. This dataset has been used extensively to study the dynamics of the urban taxi flow. For example, it has been used by a group of researchers at MIT to evaluate the ride sharing potential of the city (Santi et al., 2014) or to estimate the minimum taxi fleet that is able to serve all the travel demand in the city (Vazifeh et al., 2018).

You will be asked to preprocess the dataset, play with it, and derive meaningful statistics through exploratory data analysis. To start, you are provided with the following two files:

− taxi_id.csv.bz2

− intersections.csv

The first compressed file (taxi_id.csv.bz2) records the origin and destination of the taxi trips along with the timestamps. For simplicity, the origin and destination of the actual trips have been matched to the nearest road intersections. The format of this file is as follows:

taxi_id, pick_up_time, drop_off_time, pick_up_intersection, drop_off_intersection

The taxi_id is a numerical value that uniquely identifies each taxi. pick_up_time and drop_off_time are expressed in Unix epoch time, and pick_up_intersection, drop_off_intersection are the indices of the intersections (numbers from 1 to 4091).

The second file (intersection.csv) represents the street intersections to which pick-up and drop-off points were snapped to. The format of the file is:

id, latitude, longitude

where id is a progressive identifier from 1 to 4091 and latitude and longitude are the GPS coordinates of the intersection. Below are two screenshots of these road intersections:

1.2. Tasks

(1) How many unique taxis are there in this dataset, and how many trips are recorded?

(2) What is the distribution ofthe number of trips per taxi? Who are the top performers?

(3) How does the daily trip count (i.e., number of trips per day) change throughout the year? Any rhythm or seasonality?

(4) What is the distribution ofthe number of departure trips at different locations (i.e., intersections)? What about the distribution of arrival trips? What will you conclude from these two distributions?

(5) How does the number of trips change over time in a day? (You will be given three dates randomly selected from the dataset, and then plot the hourly variation of trips from the perspective of local time).

(6) What is the probability distribution of the trip distance (measured as straight-line distance)? How about travel time (i.e., trip duration)? What will you conclude from these two distributions?

You are required to provide figures along with your answers. Note that some of the above questions are open ended, and the answers could vary among students.

1.3. What to submit

− A word document or pdf file with answers to (1) – (6)

− The computer code used. If particular software is used, please elaborate the procedures on how it helps derive the answers.

Reference.

Santi, P., Resta, G., Szell, M., Sobolevsky, S., Strogatz, S. H., & Ratti, C. (2014). Quantifying the benefits of vehicle pooling with shareability networks. Proceedings of the National Academy of Sciences, 111(37), 13290- 13294.

Vazifeh, M. M., Santi, P., Resta, G., Strogatz, S. H., & Ratti, C. (2018). Addressing the minimum fleet problem in on-demand urban mobility. Nature, 557(7706), 534.

2.Derive Community Structures from Taxi Flow Network

2.1. Introduction

You have performed an exploratory data analysis to derive meaningful statistics of the taxi GPS dataset. Some of the questions allow us to generate insights into the spatial or temporal distribution of travel demand in NYC. However, none of the questions require us to couple trip origins and destinations together. In other words, the interactions among different locations in the study area remain unexplored.

You will be asked to perform a network analysis, namely community detection, to uncover the hidden structures in the flow network derived from the taxi GPS dataset. According to Wikipedia:

“In the study of complex networks, a network is said to have community structure if the nodes of the network can be easily grouped into (potentially overlapping) sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into groups of nodes with dense connections internally and sparser connections between groups. But overlapping communities are also allowed. The more general definition is based on the principle thatpairs of nodes are more likely to be connected if they are both members of the same community(ies), and less likely to be connected if they do not share communities. A related but different problem is community search, here the goal is to find a community that a certain vertex belongs to.”

Thus, the community detection algorithm(s) can be applied to our taxi GPS dataset to derive community structures such that the taxi flows (e.g., origin-destination trips) within the communities are denser while inter-community flows are sparser. The results could generate insights into the spatial interactions among different locations in the city.

2.2. About Community Detection

2.2.1. Derive taxi flow network from the GPS dataset

The taxi GPS dataset makes it possible for us to derive the origin-destination (OD) trips that contain rich information of the location interactions in the Manhattan area. Usually, a community detection algorithm is performed over a network, with the nodes representing particular entities, and the links (and weights) denoting the interactions among these nodes.

In this assignment, you will first derive a flow network, in which the nodes are represented by various locations (or places) in the study area, where the links between the nodes are measured as the total amount of taxi trips between the corresponding locations.

Instead of representing the nodes using road intersections, we use taxi zones in this analysis to represent different places (e.g., nodes) in the flow network. The reason is that using taxi zones will significantly reduce the number of nodes and edges in the network, which makes the computation time (ofthe community detection) more reasonable.

To accomplish this task, you are provided with another file:

− intersection_to_zone

This file maps each road intersection onto a particular taxi zone in Manhattan. Two columns, namely inter_id and zone_id, denote the id of the road intersection and the taxi zone, respectively.

What you need to do is to analyze the original taxi GPS dataset, derive the OD trips at the level of road

intersections, and transform the network onto taxi zones. To make it clear, the final network used for the

community detection consists of 63 nodes that denote the taxi zones in Manhattan, with the weight of the

edges as the amount of taxi trips between the zones.

2.2.2. Perform the community detection algorithm

There are many community detection algorithms (https://en.wikipedia.org/wiki/Community_structure), with their own pros and cons. In this assignment, you are asked to apply the algorithm proposed by Blondel et al., (2008). The implementation is described in details in this article.

However, to facilitate the analysis, you are encouraged to use existing libraries & APIs. In particular, igraph (http://igraph.org/), which is a collection of network analysis tools, allows you to perform this algorithm through a couple of programming languages such as Python, R, and C. The python API for this algorithm

can be found athttps://igraph.org/python/api/develop/igraph.community.html.

A few things for your attention:

• You first have to figure out how to install the package (http://igraph.org/python/);

• And then follow the tutorial and learn how to establish a graph, i.e., network (http://igraph.org/python/doc/tutorial/tutorial.html);

• Then, apply the so called multilevel community detection algorithm proposed by Blondel et al.

2.3. Tasks

(1) Derive the flow network at the level of taxi zones.

(2) Understand the multilevel community detection algorithm and perform it over the flow network. The output of your analysis would be a collection of clusters or communities, with each community including a list of taxi zones with frequent interactions.

(3) You are asked to form the flow network during the following time periods, and derive the corresponding communities:

• Using taxi trips occurred during 07:00 – 09:00 throughout the whole year

• And those during 16:00 – 18:00 throughout the whole year

• Using taxi trips of each of the twelve months.

2.4. What to Submit

− A word document or pdf file with 1-2 paragraphs of your understanding of the multilevel community detection algorithm and the key concepts (e.g., modularity).

− The results of the communities based on (3), i.e., results for 14 different scenarios.

− If merging with the first part of the submission, please note the name of the different submission(e.g., Derive Community Structures from Taxi Flow Network).

References

Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10), P10008.

联系我们