Chat with us, powered by LiveChat Read the Kelleher Textbook: Ch | Wridemy

Read the Kelleher Textbook: Ch

  

  1. Read the Kelleher Textbook: Chapters 1 and all the postings on Topic 1 – Epistemology I
  2. Your topic is: Big Data
    1. What is big data? Give the history.
    2. How is big data used today in decision       science?
    3. Give big data examples.
    4. What are the big data types?
    5. What is big data technology.
    6. Describe big data analytics.
    7. How big is big data?
    8. Integrate your faith, the textbook, your       values, your future career plans, and your past employment history as       applicable. 

DATA SCIENCE

The MIT Press Essential Knowledge Series

Auctions, Timothy P. Hubbard and Harry J. Paarsch The Book, Amaranth Borsuk Cloud Computing, Nayan Ruparelia Computing: A Concise History, Paul E. Ceruzzi The Conscious Mind, Zoltan L. Torey Crowdsourcing, Daren C. Brabham Data Science, John D. Kelleher and Brendan Tierney Free Will, Mark Balaguer The Future, Nick Montfort Information and Society, Michael Buckland Information and the Modern Corporation, James W. Cortada Intellectual Property Strategy, John Palfrey The Internet of Things, Samuel Greengard Machine Learning: The New AI, Ethem Alpaydin Machine Translation, Thierry Poibeau Memes in Digital Culture, Limor Shifman Metadata, Jeffrey Pomerantz The Mind–Body Problem, Jonathan Westphal MOOCs, Jonathan Haber Neuroplasticity, Moheb Costandi Open Access, Peter Suber Paradox, Margaret Cuonzo Post-Truth, Lee McIntyre Robots, John Jordan Self-Tracking, Gina Neff and Dawn Nafus Sustainability, Kent E. Portney Synesthesia, Richard E. Cytowic The Technological Singularity, Murray Shanahan Understanding Beliefs, Nils J. Nilsson Waves, Frederic Raichlen

The MIT Press | Cambridge, Massachusetts | London, England

DATA SCIENCE J O H N D . K E L L E H E R

A N D B R E N D A N T I E R N E Y

© 2018 Massachusetts Institute of Technology

All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher.

This book was set in Chaparral Pro by Toppan Best-set Premedia Limited. Printed and bound in the United States of America.

Library of Congress Cataloging-in-Publication Data

Names: Kelleher, John D., 1974- author. | Tierney, Brendan, 1970- author. Title: Data science / John D. Kelleher and Brendan Tierney. Description: Cambridge, MA : The MIT Press, [2018] | Series: The MIT Press

essential knowledge series | Includes bibliographical references and index. Identifiers: LCCN 2017043665 | ISBN 9780262535434 (pbk. : alk. paper) Subjects: LCSH: Big data. | Machine learning. | Data mining. | Quantitative

research. Classification: LCC QA76.9.B45 K45 2018 | DDC 005.7–dc23 LC record

available at https://lccn.loc.gov/2017043665

10 9 8 7 6 5 4 3 2 1

Series Foreword vii Preface ix Acknowledgments xiii

1 What Is Data Science? 1 2 What Are Data, and What Is a Data Set? 39 3 A Data Science Ecosystem 69 4 Machine Learning 101 97 5 Standard Data Science Tasks 151 6 Privacy and Ethics 181 7 Future Trends and Principles of Success 219

Glossary 239 Notes 247 Further Readings 251 References 253 Index 261

C O N T E N T S

S E R I E S FO R E W O R D

The MIT Press Essential Knowledge series offers acces- sible, concise, beautifully produced pocket-size books on topics of current interest. Written by leading thinkers, the books in this series deliver expert overviews of subjects that range from the cultural and the historical to the sci- entific and the technical.

In today’s era of instant information gratification, we have ready access to opinions, rationalizations, and super- ficial descriptions. Much harder to come by is the founda- tional knowledge that informs a principled understanding of the world. Essential Knowledge books fill that need. Synthesizing specialized subject matter for nonspecialists and engaging critical topics through fundamentals, each of these compact volumes offers readers a point of access to complex ideas.

Bruce Tidor Professor of Biological Engineering and Computer Science Massachusetts Institute of Technology

P R E FA C E

The goal of data science is to improve decision making by basing decisions on insights extracted from large data sets. As a field of activity, data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from large data sets. It is closely related to the fields of data mining and machine learning, but it is broader in scope. Today, data science drives decision making in nearly all parts of modern societies. Some of the ways that data science may affect your daily life include determining which advertise- ments are presented to you online; which movies, books, and friend connections are recommended to you; which emails are filtered into your spam folder; what offers you receive when you renew your cell phone service; the cost of your health insurance premium; the sequencing and tim- ing of traffic lights in your area; how the drugs you may need were designed; and which locations in your city the police are targeting.

The growth in use of data science across our societies is driven by the emergence of big data and social media, the speedup in computing power, the massive reduction in the cost of computer memory, and the development of more powerful methods for data analysis and modeling, such as deep learning. Together these factors mean that

x P r e fa c e

it has never been easier for organizations to gather, store, and process data. At the same time, these technical inno- vations and the broader application of data science means that the ethical challenges related to the use of data and individual privacy have never been more pressing. The aim of this book is to provide an introduction to data science that covers the essential elements of the field at a depth that provides a principled understanding of the field.

Chapter 1 introduces the field of data science and pro- vides a brief history of how it has developed and evolved. It also examines why data science is important today and some of the factors that are driving its adoption. The chapter finishes by reviewing and debunking some of the myths associated with data science. Chapter 2 introduces fundamental concepts relating to data. It also describes the standard stages in a data science project: business un- derstanding, data understanding, data preparation, mod- eling, evaluation, and deployment. Chapter 3 focuses on data infrastructure and the challenges posed by big data and the integration of data from multiple sources. One aspect of a typical data infrastructure that can be chal- lenging is that data in databases and data warehouses of- ten reside on servers different from the servers used for data analysis. As a consequence, when large data sets are handled, a surprisingly large amount of time can be spent moving data between the servers a database or data ware- house are living on and the servers used for data analysis

P r e fa c e x i

and machine learning. Chapter 3 begins by describing a typical data science infrastructure for an organization and some of the emerging solutions to the challenge of mov- ing large data sets within a data infrastructure, which in- clude the use of in-database machine learning, the use of Hadoop for data storage and processing, and the develop- ment of hybrid database systems that seamlessly combine traditional database software and Hadoop-like solutions. The chapter concludes by highlighting some of the chal- lenges in integrating data from across an organization into a unified representation that is suitable for machine learn- ing. Chapter 4 introduces the field of machine learning and explains some of the most popular machine-learning algorithms and models, including neural networks, deep learning, and decision-tree models. Chapter 5 focuses on linking machine-learning expertise with real-world prob- lems by reviewing a range of standard business problems and describing how they can be solved by machine-learning solutions. Chapter 6 reviews the ethical implications of data science, recent developments in data regulation, and some of the new computational approaches to pre- serving the privacy of individuals within the data science process. Finally, chapter 7 describes some of the areas where data science will have a significant impact in the near future and sets out some of the principles that are important in determining whether a data science project will succeed.

A C K N O W L E D G M E N T S

John and Brendan thank Paul McElroy and Brian Leahy for reading and commenting on early drafts. They also thank the two anonymous reviewers who provided detailed and helpful feedback on the manuscript and the staff at the MIT Press for their support and guidance.

John thanks his family and friends for their sup- port and encouragement during the preparation of this book and dedicates this book to his father, John Bernard Kelleher, in recognition of his love and friendship.

Brendan thanks Grace, Daniel, and Eleanor for their constant support while he was writing yet another book (his fourth), juggling the day jobs, and traveling.

1

W H AT I S D ATA S C I E N C E ?

Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting non- obvious and useful patterns from large data sets. Many of the elements of data science have been developed in related fields such as machine learning and data mining. In fact, the terms data science, machine learning, and data mining are often used interchangeably. The commonality across these disciplines is a focus on improving decision making through the analysis of data. However, although data science borrows from these other fields, it is broader in scope. Machine learning (ML) focuses on the design and evaluation of algorithms for extracting patterns from data. Data mining generally deals with the analysis of structured data and often implies an emphasis on com- mercial applications. Data science takes all of these consid- erations into account but also takes up other challenges,

2 C h a p t e r 1

such as the capturing, cleaning, and transforming of unstructured social media and web data; the use of big- data technologies to store and process big, unstructured data sets; and questions related to data ethics and regulation.

Using data science, we can extract different types of patterns. For example, we might want to extract patterns that help us to identify groups of customers exhibiting similar behavior and tastes. In business jargon, this task is known as customer segmentation, and in data science terminology it is called clustering. Alternatively, we might want to extract a pattern that identifies products that are frequently bought together, a process called association- rule mining. Or we might want to extract patterns that identify strange or abnormal events, such as fraudulent insurance claims, a process known as anomaly or outlier detection. Finally, we might want to identify patterns that help us to classify things. For example, the following rule illustrates what a classification pattern extracted from an email data set might look like: If an email contains the phrase “Make money easily,” it is likely to be a spam email. Identifying these types of classification rules is known as prediction. The word prediction might seem an odd choice because the rule doesn’t predict what will happen in the future: the email already is or isn’t a spam email. So it is best to think of prediction patterns as predicting the missing value of an attribute rather than as predicting

If a human expert can easily create a pattern in his or her own mind, it is generally not worth the time and effort of using data science to “discover” it.

4 C h a p t e r 1

the future. In this example, we are predicting whether the email classification attribute should have the value “spam” or not.

Although we can use data science to extract differ- ent types of patterns, we always want the patterns to be both nonobvious and useful. The example email classifica- tion rule given in the previous paragraph is so simple and obvious that if it were the only rule extracted by a data science process, we would be disappointed. For example, this email classification rule checks only one attribute of an email: Does the email contain the phrase “make money easily”? If a human expert can easily create a pattern in his or her own mind, it is generally not worth the time and effort of using data science to “discover” it. In general, data science becomes useful when we have a large number of data examples and when the patterns are too complex for humans to discover and extract manually. As a lower bound, we can take a large number of data examples to be defined as more than a human expert can check easily. With regard to the complexity of the patterns, again, we can define it relative to human abilities. We humans are reasonably good at defining rules that check one, two, or even three attributes (also commonly referred to as fea- tures or variables), but when we go higher than three attri- butes, we can start to struggle to handle the interactions between them. By contrast, data science is often applied in contexts where we want to look for patterns among tens,

W h at I s D ata s C I e n C e ? 5

hundreds, thousands, and, in extreme cases, millions of attributes.

The patterns that we extract using data science are useful only if they give us insight into the problem that enables us to do something to help solve the problem. The phrase actionable insight is sometimes used in this context to describe what we want the extracted patterns to give us. The term insight highlights that the pattern should give us relevant information about the problem that isn’t ob- vious. The term actionable highlights that the insight we get should also be something that we have the capacity to use in some way. For example, imagine we are working for a cell phone company that is trying to solve a customer churn problem—that is, too many customers are switching to other companies. One way data science might be used to address this problem is to extract patterns from the data about previous customers that allow us to identify current customers who are churn risks and then contact these cus- tomers and try to persuade them to stay with us. A pattern that enables us to identify likely churn customers is useful to us only if (a) the patterns identify the customers early enough that we have enough time to contact them before they churn and (b) our company is able to put a team in place to contact them. Both of these things are required in order for the company to be able to act on the insight the patterns give us.

6 C h a p t e r 1

A Brief History of Data Science

The term data science has a specific history dating back to the 1990s. However, the fields that it draws upon have a much longer history. One thread in this longer history is the history of data collection; another is the history of data analysis. In this section, we review the main develop- ments in these threads and describe how and why they converged into the field of data science. Of necessity, this review introduces new terminology as we describe and name the important technical innovations as they arose. For each new term, we provide a brief explanation of its meaning; we return to many of these terms later in the book and provide a more detailed explanation of them. We begin with a history of data collection, then give a his- tory of data analysis, and, finally, cover the development of data science.

A History of Data Gathering The earliest methods for recording data may have been notches on sticks to mark the passing of the days or poles stuck in the ground to mark sunrise on the solstices. With the development of writing, however, our ability to re- cord our experiences and the events in our world vastly increased the amount of data we collected. The earliest form of writing developed in Mesopotamia around 3200 BC and was used for commercial record keeping. This type

W h at I s D ata s C I e n C e ? 7

of record keeping captures what is known as transactional data. Transactional data include event information such as the sale of an item, the issuing of an invoice, the deliv- ery of goods, credit card payment, insurance claims, and so on. Nontransactional data, such as demographic data, also have a long history. The earliest-known censuses took place in pharaonic Egypt around 3000 BC. The reason that early states put so much effort and resources into large data-collection operations was that these states needed to raise taxes and armies, thus proving Benjamin Franklin’s claim that there are only two things certain in life: death and taxes.

In the past 150 years, the development of the elec- tronic sensor, the digitization of data, and the invention of the computer have contributed to a massive increase in the amount of data that are collected and stored. A milestone in data collection and storage occurred in 1970 when Edgar F. Codd published a paper explaining the re- lational data model, which was revolutionary in terms of setting out how data were (at the time) stored, indexed, and retrieved from databases. The relational data model enabled users to extract data from a database using simple queries that defined what data the user wanted without requiring the user to worry about the underlying structure of the data or where they were physically stored. Codd’s paper provided the foundation for modern databases and the development of structured query language (SQL), an

8 C h a p t e r 1

international standard for defining database queries. Re- lational databases store data in tables with a structure of one row per instance and one column per attribute. This structure is ideal for storing data because it can be decom- posed into natural attributes.

Databases are the natural technology to use for storing and retrieving structured transactional or operational data (i.e., the type of data generated by a company’s day-to-day operations). However, as companies have become larger and more automated, the amount and variety of data generated by different parts of these companies have dra- matically increased. In the 1990s, companies realized that although they were accumulating tremendous amounts of data, they were repeatedly running into difficulties in analyzing those data. Part of the problem was that the data were often stored in numerous separate databases within the one organization. Another difficulty was that databases were optimized for storage and retrieval of data, activities characterized by high volumes of simple opera- tions, such as SELECT, INSERT, UPDATE, and DELETE. In order to analyze their data, these companies needed tech- nology that was able to bring together and reconcile the data from disparate databases and that facilitated more complex analytical data operations. This business chal- lenge led to the development of data warehouses. In a data warehouse, data are taken from across the organization

W h at I s D ata s C I e n C e ? 9

and integrated, thereby providing a more comprehensive data set for analysis.

Over the past couple of decades, our devices have be- come mobile and networked, and many of us now spend many hours online every day using social technologies, computer games, media platforms, and web search en- gines. These changes in technology and how we live have had a dramatic impact on the amount of data collected. It is estimated that the amount of data collected over the five millennia since the invention of writing up to 2003 is about 5 exabytes. Since 2013, humans generate and store this same amount of data every day. However, it is not only the amount of data collected that has grown dra- matically but also the variety of data. Just consider the following list of online data sources: emails, blogs, pho- tos, tweets, likes, shares, web searches, video uploads, online purchases, podcasts. And if we consider the meta- data (data describing the structure and properties of the raw data) of these events, we can begin to understand the meaning of the term big data. Big data are often defined in terms of the three Vs: the extreme volume of data, the va- riety of the data types, and the velocity at which the data must be processed.

The advent of big data has driven the development of a range of new database technologies. This new gen- eration of databases is often referred to as “NoSQL da- tabases.” They typically have a simpler data model than

1 0 C h a p t e r 1

traditional relational databases. A NoSQL database stores data as objects with attributes, using an object notation language such as the JavaScript Object Notation (JSON). The advantage of using an object representation of data (in contrast to a relational table-based model) is that the set of attributes for each object is encapsulated within the object, which results in a flexible representation. For ex- ample, it may be that one of the objects in the database, compared to other objects, has only a subset of attributes. By contrast, in the standard tabular data structure used by a relational database, all the data points should have the same set of attributes (i.e., columns). This flexibility in object representation is important in contexts where the data cannot (due to variety or type) naturally be decom- posed into a set of structured attributes. For example, it can be difficult to define the set of attributes that should be used to represent free text (such as tweets) or images. However, although this representational flexibility allows us to capture and store data in a variety of formats, these data still have to be extracted into a structured format be- fore any analysis can be performed on them.

The existence of big data has also led to the develop- ment of new data-processing frameworks. When you are dealing with large volumes of data at high speeds, it can be useful from a computational and speed perspective to distribute the data across multiple servers, process que- ries by calculating partial results of a query on each server,

W h at I s D ata s C I e n C e ? 1 1

and then merge these results to generate the response to the query. This is the approach taken by the MapReduce framework on Hadoop. In the MapReduce framework, the data and queries are mapped onto (or distributed across) multiple servers, and the partial results calculated on each server are then reduced (merged) together.

A History of Data Analysis Statistics is the branch of science that deals with the col- lection and analysis of data. The term statistics originally referred to the collection and analysis of data about the state, such as demographics data or economic data. How- ever, over time the type of data that statistical analysis was applied to broadened so that today statistics is used to analyze all types of data. The simplest form of statisti- cal analysis of data is the summarization of a data set in terms of summary (descriptive) statistics (including mea- sures of a central tendency, such as the arithmetic mean, or measures of variation, such as the range). However, in the seventeenth and eighteenth centuries the work of people such as Gerolamo Cardano, Blaise Pascal, Jakob Bernoulli, Abraham de Moivre, Thomas Bayes, and Richard Price laid the foundations of probability theory, and through the nineteenth century many statisticians began to use probability distributions as part of their analytic tool kit. These new developments in mathematics enabled statis- ticians to move beyond descriptive statistics and to start

1 2 C h a p t e r 1

doing statistical learning. Pierre Simon de Laplace and Carl Friedrich Gauss are two of the most important and famous nineteenth-century mathematicians, and both made im- portant contributions to statistical learning and modern data science. Laplace took the intuitions of Thomas Bayes and Richard Price and developed them into the first ver- sion of what we now call Bayes’ Rule. Gauss, in his search for the missing dwarf planet Ceres, developed the method of least squares, which enables us to find the best model that fits a data set such that the error in the fit minimizes the sum of squared differences between the data points in the data set and the model. The method of least squares provided the foundation for statistical learning methods such as linear regression and logistic regression as well as the development of artificial neural network models in artifi- cial intelligence (we will return to least squares, regression analysis, and neural networks in chapter 4).

Between 1780 and 1820, around the same time that Laplace and Gauss were making their contributions to statistical learning, a Scottish engineer named William Playfair was inventing statistical graphics and laying the foundations for modern data visualization and exploratory data analysis. Playfair invented the line chart and area chart for time-series data, the bar chart to illustrate compari- sons between quantities of different categories, and the pie chart to illustrate proportions within a set. The advan- tage of visualizing quantitative data is that it allows us to

W h at I s D ata s C I e n C e ? 1 3

use our powerful visual abilities to summarize, compare, and interpret data. Admittedly, it is difficult to visualize large (many data points) or complex (many attributes) data sets, but data visualization is still an important part of data science. In particular, it is useful in helping data sci- entists explore and understand the data they are working with. Visualizations can also be useful to communicate the results of a data science project. Since Playfair’s time, the variety of data-visualization graphics has steadily grown, and today there is research ongoing into the development of novel approaches to visualize large, multidimensional data sets. A recent development is the t-distributed stochas- tic neighbor embedding (t-SNE) algorithm, which is a use- ful technique for reducing high-dimensional data down to two or three dimensions, thereby facilitating the visualiza- tion of those data.

The developments in probability theory and statis- tics continued into the twentieth century. Karl Pearson developed modern hypothesis testing, and R.  A. Fisher developed statistical methods for multivariate analysis and introduced the idea of maximum likelihood estimate into statistical inference as a method to draw conclusions based on the relative probability of events. The work of Alan Turing in the Second World War led to the inven- tion of the electronic computer, which had a dramatic impact on statistics because it enabled much more com- plex statistical calculations. Throughout the 1940s and

1 4 C h a p t e r 1

subsequent decades, a number of important computa- tional models were developed that are still widely used in data science. In 1943, Warren McCulloch and Walter Pitts proposed the first mathematical model of a neural net- work. In 1948, Claude Shannon published “A Mathemati- cal Theory of Communication” and by doing so founded information theory. In 1951, Evelyn Fix and Joseph Hodges proposed a model for discriminatory analysis (what would now be called a classification or pattern-recognition prob- lem) that became the basis for modern nearest-neighbor models. These postwar developments culminated in 1956 in the establishment of the field of artificial intelligence at a workshop in Dartmouth College. Even at this early stage in the development of artificial intelligence, the term machine learning was beginning to be used to de- scribe programs that gave a computer the ability to learn from data. In the mid-1960s, three important contribu- tions to ML were made. In 1965, Nils Nilsson’s book titled Learning Machines showed how neural networks could be used to learn linear models for classification. The follow- ing year, 1966, Earl B. Hunt, Janet Marin, and Philip J. Stone developed the concept-learning system frame- work, which was the progenitor of an important family of ML algorithms that induced decision-tree models from data in a top-down fashion. Around the same time, a number of independent researchers developed and pub- lished early versions of the k-means clustering algorithm,

W h at I s D ata s C I e n C e ? 1 5

now the standard algorithm used for data (customer) segmentation.

The field of ML is at the core of modern data science because it provides algorithms that are able to automati- cally analyze large data sets to extract potentially interest- ing and useful patterns. Machine learning has continued to develop and innovate right up to the present day. Some of the most important developments include ensemble models, where predictions are made using a set (or com- mittee) of models, with each model voting on each query, and deep-learning neural networks, which have multiple (i.e., more than three) layers of neurons. These deeper lay- ers in the network are able to discover and learn complex attribute representations (composed of multiple, interact- ing input attributes that have been processed by earlier layers), which in turn enable the network to learn patterns that generalize across the input data. Because of their abil- ity to learn complex attributes, deep-learning networks are particularly suitable to high-dimensional data and so have revolutionized a number of fields, including machine vision and natural-language processing.

As we discussed in our review of database history, the early 1970s marked the beginning of …

Our website has a team of professional writers who can help you write any of your homework. They will write your papers from scratch. We also have a team of editors just to make sure all papers are of HIGH QUALITY & PLAGIARISM FREE. To make an Order you only need to click Ask A Question and we will direct you to our Order Page at WriteDemy. Then fill Our Order Form with all your assignment instructions. Select your deadline and pay for your paper. You will get it few hours before your set deadline.

Fill in all the assignment paper details that are required in the order form with the standard information being the page count, deadline, academic level and type of paper. It is advisable to have this information at hand so that you can quickly fill in the necessary information needed in the form for the essay writer to be immediately assigned to your writing project. Make payment for the custom essay order to enable us to assign a suitable writer to your order. Payments are made through Paypal on a secured billing page. Finally, sit back and relax.

Do you need an answer to this or any other questions?

About Wridemy

We are a professional paper writing website. If you have searched a question and bumped into our website just know you are in the right place to get help in your coursework. We offer HIGH QUALITY & PLAGIARISM FREE Papers.

How It Works

To make an Order you only need to click on “Order Now” and we will direct you to our Order Page. Fill Our Order Form with all your assignment instructions. Select your deadline and pay for your paper. You will get it few hours before your set deadline.

Are there Discounts?

All new clients are eligible for 20% off in their first Order. Our payment method is safe and secure.

Hire a tutor today CLICK HERE to make your first order

Related Tags

Academic APA Writing College Course Discussion Management English Finance General Graduate History Information Justify Literature MLA