Chat with us, powered by LiveChat After reading the chapter by Capri (2015) on manual data collection.? Answer the following questions: What were the traditional methods of data collection in the transit system | Wridemy

After reading the chapter by Capri (2015) on manual data collection.? Answer the following questions: What were the traditional methods of data collection in the transit system

After reading the chapter by Capri (2015) on manual data collection.? Answer the following questions: What were the traditional methods of data collection in the transit system

After reading the chapter by Capri (2015) on manual data collection.  Answer the following questions:

  1. What were the traditional methods of data collection in the transit system?
  2. Why are the traditional methods insufficient in satisfying the requirement of data collection?
  3. Give a synopsis of the case study and your thoughts regarding the requirements of the optimization and performance measurement requirements and the impact to expensive and labor-intensive nature.

In an APA7 format answer all questions above.  There should be headings to each of the questions above as well.  Ensure there are at least two-peer reviewed sources to support your work. The paper should be at least 2 pages of content (this does not include the cover page or reference page).

In: Data Mining ISBN: 978-1-63463-738-1

Editor: Harold L. Capri © 2015 Nova Science Publishers, Inc.

Chapter 1



Xiaolei Ma1, Ph.D. and Yinhai Wang 2 , Ph.D.

1 School of Transportation Science and Engineering,

Beihang University, Beijing, China 2 Department of Civil and Environmental Engineering,

University of Washington, Seattle, WA, US


To improve customer satisfaction and reduce operation costs, transit

authorities have been striving to monitor their transit service quality and

identify the key factors to attract the transit riders. Traditional manual

data collection methods are unable to satisfy the transit system

optimization and performance measurement requirement due to their

expensive and labor-intensive nature. The recent advent of passive data

collection techniques (e.g., Automated Fare Collection and Automated

Vehicle Location) has shifted a data-poor environment to a data-rich

environment, and offered the opportunities for transit agencies to conduct

comprehensive transit system performance measures. Although it is

possible to collect highly valuable information from ubiquitous transit

data, data usability and accessibility are still difficult. Most Automatic

Fare Collection (AFC) systems are not designed for transit performance

monitoring, and additional passenger trip information cannot be directly

 Email: [email protected]

C o p y r i g h t 2 0 1 4 . N o v a S c i e n c e P u b l i s h e r s , I n c .

A l l r i g h t s r e s e r v e d . M a y n o t b e r e p r o d u c e d i n a n y f o r m w i t h o u t p e r m i s s i o n f r o m t h e p u b l i s h e r , e x c e p t f a i r u s e s p e r m i t t e d u n d e r U . S . o r a p p l i c a b l e c o p y r i g h t l a w .

EBSCO Publishing : eBook Collection (EBSCOhost) – printed on 10/28/2022 9:45 AM via UNIVERSITY OF THE CUMBERLANDS AN: 956104 ; Ma, Xiaolei, Capri, Harold L..; Data Mining: Principles, Applications and Emerging Challenges Account: s8501869.main.ehost

Xiaolei Ma and Yinhai Wang 2

retrieved. Interoperating and mining heterogeneous datasets would

enhance both the depth and breadth of transit-related studies. This study

proposed a series of data mining algorithms to extract individual transit

rider’s origin using transit smart card and GPS data. The primary data

source of this study comes from the AFC system in Beijing, where a

passenger’s boarding stop (origin) and alighting stop (destination) on a

flat-rate bus are not recorded on the check-in and check-out scan. The bus

arrival time at each stop can be inferred from GPS data, and individual

passenger’s boarding stop is then estimated by fusing the identified bus

arrival time with smart card data. In addition, a Markov chain based

Bayesian decision tree algorithm is proposed to mine the passengers’

origin information when GPS data are absent. Both passenger origin

mining algorithms are validated based on either on-board transit survey

data or personal GPS logger data. The results demonstrates the

effectiveness and efficiency of the proposed algorithms on extracting

passenger origin information. The estimated passenger origin data are

highly valuable for transit system planning and route optimization.

Keywords: Automated fare collection system, transit GPS, passenger origin

inference, Bayesian decision tree, Markov chain


According to the Census of 2000 in the United States, approximately 76%

people chose privately owned vehicles to commute to work in 2000 (ICF

consulting, 2003). Recent studies conducted by the 2009 American

Community Survey indicate 79.5% of home-based workers drive alone for

commuting (McKenzie and Rapino, 2009). Many developing countries, e.g.,

China, also rely on privately owned vehicles to commute. For example, more

than 34% of the Beijing residents chose cars as their primary travel mode

while only 28.2% chose transit in 2010 (Beijing Transportation Research

Center, 2012). Public transit has been considered as an effective

countermeasure to reduce congestion, air pollution, and energy consumption

(Federal Highway Administration, 2002). According to 2005 urban mobility

report conducted by Texas Transportation Institute (2005), travel delay in

2003 would increase by 27 percent without public transit, especially in those

most congested metropolitan cites of U.S., public transit services have saved

more than 1.1 billion hours of travel time. Moreover, public transit can help

enhance business, reduce city sprawl through the transit oriented development

(TDO). During certain emergency scenarios, public transit can even act as a

EBSCOhost – printed on 10/28/2022 9:45 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to

Transit Passenger Origin Inference Using Smart Card Data … 3

safe and efficient transportation mode for evacuation (Federal Highway

Administration, 2002). Based on the aforementioned reasons, it is of critical

importance to improve the efficiency of public transit system, and promote

more roadway users to utilize public transit. To fulfill these objectives, transit

agencies need to understand the areas where improvements can be further

made, and whether community goals are being met, etc. A well-developed

performance measure system will facilitate decision making for transit

agencies. Transit agencies can evaluate the transit ridership trends with fare

policy changes and identify where and when better transit service should be

provided. In addition, transit agencies are also required to summarize transit

performance statistics for reporting to either the National Transit Database

(Kittelson & Associates et al., 2003), or the general public who are interested

knowing how well transit service is being provided. Nevertheless, developing

a set of structured performance measures often requires a large amount of data

and the corresponding domain knowledge to process and analyze these data.

These obstacles create challenges for transit agencies to spend time and effort

undertaking. Traditionally, transit agencies heavily rely on manual data

collection methods to gather transit operation and planning data (Ma et al.,

2012). However, traditional data collection methods (e.g., travel diary, survey,

etc.) are fairly costly and difficult to implement at a multiday level due to their

low response rate and accuracy. Transit agencies have spent tremendous

manpower and resource undertaking manual data collections, and consumed a

significant amount of energy and time to post-process the raw data. With

advances in information technologies in intelligent transportation systems

(ITS), the availability of public transit data has been increasing in the past

decades, which has gradually shifted public transit system into a data-rich

paradigm. Automatic Fare Collection (AFC) system and Automatic Vehicle

Track (AVL) system are two common passive data collection methods. AFC

system, also known as Smart Card system, records and processes the fare

related information using either contactless or contact card to complete the

financial transaction (Chu, 2010). There exist two typical types of AFC

systems: entry-only AFC system and distance-based AFC system. In the entry-

only AFC system, passengers are only required to swipe their smart cards over

the card reader during boarding, while passengers need to check in and check

out during both their boarding and alighting procedures for the distance-based

AFC system. AVL and AFC technologies hold substantial promise for transit

performance analysis and management at a relative low cost. However,

historically, both AVL and AFC data have not been used to their full

potentials. Many AVL and AFC systems do not archive data in a readily

EBSCOhost – printed on 10/28/2022 9:45 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to

Xiaolei Ma and Yinhai Wang 4

utilized manner (Furth, 2006). AFC system is initially designed to reduce

workloads of tedious manual fare collections, not for transit operation and

planning purposes, and thereby, certain critical information, such as specific

spatial location for each transaction, may not be directly captured. AVL

system tracks transit vehicles’ geospatial locations by Global Positioning

System (GPS) at either a constant or varying time interval. The accuracy of

GPS occasionally suffers from signal loss due to tall building obstructions in

the urban area (Ma et al., 2011). Both of the AFC system and AVL system

have their inherent drawbacks in monitoring transit system performance, and

require analytical approaches to eliminate the erroneous data, remedy the

missing values, and mine the unseen and indirect information.

The remainder of this paper is organized as follows: transit smart card data

and GPS data are described in the section 2. Based on these data sets, a data

fusion method is initially proposed to integrate with roadway geospatial data

to estimate transit vehicles arrival information. And then, a Bayesian decision

tree algorithm is presented to estimate each passenger’s boarding stop when

GPS data are unavailable. Considering the expensive computational burden of

decision tree algorithms, Markov-chain property is taken into account to

reduce the algorithm complexity. On-board survey and GPS data from the

Beijing transit system are used to test and verify the proposed algorithms.

Conclusion and future research efforts are summarized at the end of this paper.


Data from AFC system and AVL system are the two primary sources in

this study. Beijing Transit Incorporated began to issue smart cards in May 10,

2006. The smart card can be used in both the Beijing bus and subway systems.

Due to discounted fares (up to 60% off) provided by the smart card, more than

90% of the transit riders pay for their transit trips with their smart cards in

2010 (Beijing Transportation Research Center, 2010). Two types of AFC

systems exist in Beijing transit: flat fare and distance-based fare. Transit riders

pay at a fixed rate for those flat fare buses when entering by tapping their

smart cards on the card reader. Thus, only check-in scans are necessary. For

the distance-based AFC system, transit riders need to swipe their smart cards

during both check-in and check-out processes. Transit riders need to hold their

smart cards near the card reader device to complete transactions when entering

or exiting buses. Smart card can be used in Beijing subway system as well,

where passengers need to tap their smart card on top of fare gates during

EBSCOhost – printed on 10/28/2022 9:45 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to

Transit Passenger Origin Inference Using Smart Card Data … 5

entering and existing subway stations. Both boarding and alighting

information (time and location) are recorded by the fare gates. Although transit

smart card exhibits its superiority on its convenience and efficiency, there are

still the following issues to prevent transit agencies fully taking advantages of

smart card for operational purposes:

 Passenger boarding and alighting information missing

Due to a design deficiency in the smart card scan system, the AFC system

on flat fare buses does not save any boarding location information, whereas

the AFC system stores boarding and alighting location, except for boarding

time information on distance-based fare buses. Key information stored in the

database includes smart card ID, route number, driver ID, transaction time,

remaining balance, transaction amount, boarding stop (only available for

distance-based fare buses), and alighting stop (only available for distance-

based fare buses).

 Massive data sets

More than 16 million smart card transactions data are generated per day.

Among these transactions, 52% are from flat-rate bus riders. These smart card

transactions are scattered in a large-scale transit network with 52386 links and

43432 nodes as presented in figure 1:

Figure 1. Beijing Transit GIS Network.

EBSCOhost – printed on 10/28/2022 9:45 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to

Xiaolei Ma and Yinhai Wang 6

 Limited external data with poor quality

Only approximate 50% of transit vehicles in Beijing are equipped with

GPS devices for tracking. GPS data are periodically sent to the central server

at a pre-determined interval of 30 seconds. However, the collected GPS data

suffer from two major data quality issues: (1) vehicle direction information is

missing; (2) GPS points fluctuation (Lou, et al., 2009). Map matching

algorithms are needed to align the inaccurate GPS spatial records onto the road

network. In addition, most of transit routes are not designed to have fixed

schedules because of high ridership demands, and only certain routes with a

long distance or headway follow schedules at each stop (Chen, 2009). The

above characteristics of the Beijing AFC and AVL systems create more

challenges to process and mine useful information.

It is noteworthy that the AFC system used in Beijing is not a unique case.

Most cities in China also employ the similar AFC system where passengers’

origin information is absent, such as Chongqing City (Gao and Wu, 2011),

Nanning City (Chen, 2009), Kunming City (Zhou et al., 2007). In other

developing countries, such as Brazil, AFC system does not record any

boarding location information as well (Farzin, 2008). Therefore, a solution for

passenger boarding and alighting information extraction is beneficial to those

transit agencies with imperfect SC data internationally.


Because smart card readers in the flat-rate buses do not record passengers’

boarding stops, it is desired to infer individual boarding location using smart

card transaction data. In this section, two primary approaches are presented to

achieve this goal. Approximately 50% transit vehicles are equipped with GPS

devices in Beijing entry-only AFC system. Therefore, a data fusion method

with GPS data, smart card data and GIS data is firstly developed to estimate

each bus’s arrival time at each stop and infer individual passenger’s boarding

stop. And then, for those buses without GIS devices, a Bayesian decision tree

algorithm is proposed to utilize smart card transaction time and apply

Bayesian inference theory to depict the likelihood of each possible boarding

stop. In order to expand the usability of proposed Bayesian decision tree

algorithm in large-scale datasets, Markov chain optimization is used to reduce

the algorithm’s computational complexity. Both two transit passenger origin

EBSCOhost – printed on 10/28/2022 9:45 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to

Transit Passenger Origin Inference Using Smart Card Data … 7

inference algorithms are validated using external data (e.g., on-board survey

data and GPS data).

Passenger Origin Inference with GPS Data

In the first step, a GPS-based arrival information inference algorithm is

presented to estimate the arrival time for each transit stop, and then, the

inferred stop-level arrival time will be matched with the timestamp recorded in

AFC system. The temporally closest smart card transaction record will be

assigned with each known stop ID. The logic flow chart is demonstrated in

Figure 2. The major data processing procedure will be detailed below.

Figure 2. Flow Chart for Passenger Origin Inference with GPS Data.

Bus Arrival Time Extraction

Three primary data sources are involved in the passenger information

extraction: vehicle GPS data; transit stop spatial location data; and flat-fare-

based smart card transaction data. A transit GIS network contains the

geospatial location of each stop for any transit routes. The GPS device

mounted in the bus can record each bus’s location and timestamp every 30

seconds, but the data quality of collected GPS records is not satisfying: No

directional information is recorded in Beijing AVL system; GPS points are off

EBSCOhost – printed on 10/28/2022 9:45 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to

Xiaolei Ma and Yinhai Wang 8

the roadway network due to the satellite signal fluctuation. Data preprocessing

is required prior to bus arrival time estimation. A program is written to parse

and import raw GPS data into a database in an automatic manner. Key fields

of a GPS record are shown in Table 1.

Table 1. Examples of GPS raw data

Vehicle ID Date time Latitude Longitude Spot speed Route ID

00034603 2010-04-07

09:28:57 39.73875 116.1355 9.07 00022

00034603 2010-04-07

09:29:27 39.73710 116.1358 14.26 00022

00034603 2010-04-07

09:29:58 39.73592 116.1357 19.63 00022

00034603 2010-04-07

09:30:28 39.73479 116.1357 0 00022

00034603 2010-04-07

09:30:58 39.73420 116.1357 3.52 00022

The first step is to estimate the bus arrival time for each stop by joining

GPS data and the stop-level geo-location data. A buffer area can be created

around each particular stop for a certain transit route using the GIS software.

Within this area, several GPS records are likely to be captured. However,

identifying the geospatially closest GPS record to each particular stop is

challenging since there could be a certain number of unknown directional GPS

records within the specified buffer zone. Thanks to the powerful geospatial

analysis function in GIS, each link (i.e., polyline) where each transit stop is

located is composed of both start node and end node, and this implies that the

directional information for each GPS record is able to infer by comparing the

link direction and the direction changes from two consecutive GPS records.

With the identified direction, the distance from each GPS point to this

particular stop can be calculated, and the timestamp with the minimum

distance will be regarded as the bus arrival time at the particular stop. Figure 2

visually demonstrates the above algorithm procedure. Inbound stop represents

the physical location of a particular transit stop, and this stop is snapped to a

transit link, whose direction is regulated by both a start node and an end node.

By comparing the driving direction from GPS records with the link direction,

the nearest GPS records to this particular stop can be identified, and marked by

the red five-pointed star on the map. The timestamp associated with this five-

pointed star will be considered as the arrival time for this inbound stop. The

EBSCOhost – printed on 10/28/2022 9:45 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to

Transit Passenger Origin Inference Using Smart Card Data … 9

merit of the bus arrival time estimation algorithm lies in its efficiency. Rather

than searching all the GPS data to identify the traveling direction for each stop,

the proposed algorithm shrinks down the searching area, and filters out those

unlikely GPS data. The operation greatly alleviates the computational burden,

and is relatively easy to implement in the large-scale datasets, which is

particularly critical to process the tremendous amount of datasets within an

acceptable time period.

Figure 3. Boarding Time Estimation with GPS Data and Transit Stop Location Data.

Passenger Boarding Location Identification with Smart Card Data

For each smart card data transaction record, the boarding stop can be

estimated by matching the recorded timestamp and the identified bus arrival

time. As presented in Figure 4, for each smart card transaction record, the

transaction time is compared with the inferred bus arrival time at each stop.

This record will be assigned to a particular stop where the bus arrival time is

the most temporally closed with its transaction time. Since passengers begin to

embark the bus at a relative short time interval, this data fusion method is able

to capture almost all missing boarding stops.

EBSCOhost – printed on 10/28/2022 9:45 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to

Xiaolei Ma and Yinhai Wang 10

Figure 4. Boarding Stop Identification with Bus Arrival Time.

In addition, because all the arrival time for all stops of a particular transit

route can be estimated, the average travel time between two adjacent stops can

be calculated as well. This speed statistics is not only critical for transit

performance measures, but also provides prior information for passenger

origin inference when GPS data are absent.


Compared with bus arrival time, door opening time can be more

accurately matched with smart card transaction time. This is because each bus

may not exactly stop at each transit stop for passenger boarding. The inferred

bus arrival time is subject to incur errors when it is used to match with smart

card data. To validate the accuracy of the proposed data fusion algorithm for

passenger origin inference, on-board transit survey was undertaken to collect

bus door opening time and arrival location for each stop of route 651 on

January, 13th, 2013. Hand holding GPS devices were used to track the

geospatial location of moving buses every 15 seconds. The survey duration

was from 8:00 AM to 1: 00 PM, and a total of 75 bus door opening time was

manually recorded. These bus door opening time records were then compared

with smart card transactions from 417 passengers, and these estimated stops

can be considered as the ground-truth data. By comparing the ground-truth

EBSCOhost – printed on 10/28/2022 9:45 AM via UNIVERSITY OF THE CUMBERLANDS. All use subject to

Transit Passenger Origin Inference Using Smart Card Data … 11

data with the results from the proposed GPS data fusion approach, 406

boarding stops were accurately inferred and 11 boarding stops differ from the

ground-truth data within one-stop-error range. The proposed algorithm

demonstrates its accuracy as high as 97.4%.

Passenger Origin Inference with Smart Card Data

There are still a fair amount of buses without GPS devices, and thus the

bus arrival time at each transit stop is not directly measured. However, most

passengers scan their cards immediately when boarding and almost all

passengers should complete the check-in scan before arriving to the next stop.

This indicates that the first passenger’s transaction time can be safely assumed

as the group of passengers’ boarding time at the same stop. The challenge is

then to identify the bus location at the moment of the SC transaction so that we

can infer the onboard stop for that passenger. However, this is not easy

because the SC system for the flat-rate bus does not record bus location. We

know the time each transaction occurred on a bus of a particular route under

the operation of a particular driver, but nothing else is known from the SC

transaction database. Nonetheless, we are able to extract boarding volume

changes with time and passengers who made transfers. By mining these data

and combining transit route maps, we may be able to accomplish our goal.

Therefore, a two-step approach is designed for passenger origin data

extraction: smart card data clustering and transit stop recognition. To

implement the proposed algorithm in an efficient manner, a Markov Chain

based optimization approach is applied to reduce the computational


Smart Card Data Clustering

Our website has a team of professional writers who can help you write any of your homework. They will write your papers from scratch. We also have a team of editors just to make sure all papers are of HIGH QUALITY & PLAGIARISM FREE. To make an Order you only need to click Ask A Question and we will direct you to our Order Page at WriteDemy. Then fill Our Order Form with all your assignment instructions. Select your deadline and pay for your paper. You will get it few hours before your set deadline.

Fill in all the assignment paper details that are required in the order form with the standard information being the page count, deadline, academic level and type of paper. It is advisable to have this information at hand so that you can quickly fill in the necessary information needed in the form for the essay writer to be immediately assigned to your writing project. Make payment for the custom essay order to enable us to assign a suitable writer to your order. Payments are made through Paypal on a secured billing page. Finally, sit back and relax.

Do you need an answer to this or any other questions?

About Wridemy

We are a professional paper writing website. If you have searched a question and bumped into our website just know you are in the right place to get help in your coursework. We offer HIGH QUALITY & PLAGIARISM FREE Papers.

How It Works

To make an Order you only need to click on “Order Now” and we will direct you to our Order Page. Fill Our Order Form with all your assignment instructions. Select your deadline and pay for your paper. You will get it few hours before your set deadline.

Are there Discounts?

All new clients are eligible for 20% off in their first Order. Our payment method is safe and secure.

Hire a tutor today CLICK HERE to make your first order

Related Tags

Academic APA Writing College Course Discussion Management English Finance General Graduate History Information Justify Literature MLA