Spark is at the heart of today’s Big Data revolution, helping data professionals supercharge efficiency and performance in a wide range of data processing and analytics tasks. In this guide, Big Data expert Jeffrey Aven covers all you need to know to leverage Spark, together with its extensions, subprojects, and wider ecosystem.
Aven combines a language-agnostic introduction to foundational Spark concepts with extensive programming examples utilizing the popular and intuitive PySpark development environment. This guide’s focus on Python makes it widely accessible to large audiences of data professionals, analysts, and developers even those with little Hadoop or Spark experience.
Aven’s broad coverage ranges from basic to advanced Spark programming, and Spark SQL to machine learning. You’ll learn how to efficiently manage all forms of data with Spark: streaming, structured, semi-structured, and unstructured. Throughout, concise topic overviews quickly get you up to speed, and extensive hands-on exercises prepare you to solve real problems.
Coverage includes: • Understand Spark’s evolving role in the Big Data and Hadoop ecosystems • Create Spark clusters using various deployment modes • Control and optimize the operation of Spark clusters and applications • Master Spark Core RDD API programming techniques • Extend, accelerate, and optimize Spark routines with advanced API platform constructs, including shared variables, RDD storage, and partitioning • Efficiently integrate Spark with both SQL and nonrelational data stores • Perform stream processing and messaging with Spark Streaming and Apache Kafka • Implement predictive modeling with SparkR and Spark MLlib
Preface xi Introduction 1
PART I: SPARK FOUNDATIONS Chapter 1 Introducing Big Data, Hadoop, and Spark 5 Introduction to Big Data, Distributed Computing, and Hadoop 5 A Brief History of Big Data and Hadoop 6 Hadoop Explained 7 Introduction to Apache Spark 13 Apache Spark Background 13 Uses for Spark 14 Programming Interfaces to Spark 14 Submission Types for Spark Programs 14 Input/Output Types for Spark Applications 16 The Spark RDD 16 Spark and Hadoop 16 Functional Programming Using Python 17 Data Structures Used in Functional Python Programming 17 Python Object Serialization 20 Python Functional Programming Basics 23 Summary 25 Chapter 2 Deploying Spark 27 Spark Deployment Modes 27 Local Mode 28 Spark Standalone 28 Spark on YARN 29 Spark on Mesos 30 Preparing to Install Spark 30 Getting Spark 31 Installing Spark on Linux or Mac OS X 32 Installing Spark on Windows 34 Exploring the Spark Installation 36 Deploying a Multi-Node Spark Standalone Cluster 37 Deploying Spark in the Cloud 39 Amazon Web Services (AWS) 39 Google Cloud Platform (GCP) 41 Databricks 42 Summary 43 Chapter 3 Understanding the Spark Cluster Architecture 45 Anatomy of a Spark Application 45 Spark Driver 46 Spark Workers and Executors 49 The Spark Master and Cluster Manager 51 Spark Applications Using the Standalone Scheduler 53 Spark Applications Running on YARN 53 Deployment Modes for Spark Applications Running on YARN 53 Client Mode 54 Cluster Mode 55 Local Mode Revisited 56 Summary 57 Chapter 4 Learning Spark Programming Basics 59 Introduction to RDDs 59 Loading Data into RDDs 61 Creating an RDD from a File or Files 61 Methods for Creating RDDs from a Text File or Files 63 Creating an RDD from an Object File 66 Creating an RDD from a Data Source 66 Creating RDDs from JSON Files 69 Creating an RDD Programmatically 71 Operations on RDDs 72 Key RDD Concepts 72 Basic RDD Transformations 77 Basic RDD Actions 81 Transformations on PairRDDs 85 MapReduce and Word Count Exercise 92 Join Transformations 95 Joining Datasets in Spark 100 Transformations on Sets 103 Transformations on Numeric RDDs 105 Summary 108
PART II: BEYOND THE BASICS Chapter 5 Advanced Programming Using the Spark Core API 111 Shared Variables in Spark 111 Broadcast Variables 112 Accumulators 116 Exercise: Using Broadcast Variables and Accumulators 119 Partitioning Data in Spark 120 Partitioning Overview 120 Controlling Partitions 121 Repartitioning Functions 123 Partition-Specific or Partition-Aware API Methods 125 RDD Storage Options 127 RDD Lineage Revisited 127 RDD Storage Options 128 RDD Caching 131 Persisting RDDs 131 Choosing When to Persist or Cache RDDs 134 Checkpointing RDDs 134 Exercise: Checkpointing RDDs 136 Processing RDDs with External Programs 138 Data Sampling with Spark 139 Understanding Spark Application and Cluster Configuration 141 Spark Environment Variables 141 Spark Configuration Properties 145 Optimizing Spark 148 Filter Early, Filter Often 149 Optimizing Associative Operations 149 Understanding the Impact of Functions and Closures 151 Considerations for Collecting Data 152 Configuration Parameters for Tuning and Optimizing Applications 152 Avoiding Inefficient Partitioning 153 Diagnosing Application Performance Issues 155 Summary 159 Chapter 6 SQL and NoSQL Programming with Spark 161 Introduction to Spark SQL 161 Introduction to Hive 162 Spark SQL Architecture 166 Getting Started with DataFrames 168 Using DataFrames 179 Caching, Persisting, and Repartitioning DataFrames 187 Saving DataFrame Output 188 Accessing Spark SQL 191 Exercise: Using Spark SQL 194 Using Spark with NoSQL Systems 195 Introduction to NoSQL 196 Using Spark with HBase 197 Exercise: Using Spark with HBase 200 Using Spark with Cassandra 202 Using Spark with DynamoDB 204 Other NoSQL Platforms 206 Summary 206 Chapter 7 Stream Processing and Messaging Using Spark 209 Introducing Spark Streaming 209 Spark Streaming Architecture 210 Introduction to DStreams 211 Exercise: Getting Started with Spark Streaming 218 State Operations 219 Sliding Window Operations 221 Structured Streaming 223 Structured Streaming Data Sources 224 Structured Streaming Data Sinks 225 Output Modes 226 Structured Streaming Operations 227 Using Spark with Messaging Platforms 228 Apache Kafka 229 Exercise: Using Spark with Kafka 234 Amazon Kinesis 237 Summary 240 Chapter 8 Introduction to Data Science and Machine Learning Using Spark 243 Spark and R 243 Introduction to R 244 Using Spark with R 250 Exercise: Using RStudio with SparkR 257 Machine Learning with Spark 259 Machine Learning Primer 259 Machine Learning Using Spark MLlib 262 Exercise: Implementing a Recommender Using Spark MLlib 267 Machine Learning Using Spark ML 271 Using Notebooks with Spark 275 Using Jupyter (IPython) Notebooks with Spark 275 Using Apache Zeppelin Notebooks with Spark 278 Summary 279 Index 281