PySpark on Google Colab 101

A Beginner’s Hands-on Guide to PySpark with Google Colab

Garvit Arya
Towards Data Science

--

Photo by Chris Ried on Unsplash

Apache Spark is a lightning-fast framework used for data processing that performs super-fast processing tasks on large-scale data sets. It also can distribute data processing tasks across multiple devices, on its own, or in collaboration with other distributed computing tools.

PySpark is the interface that gives access to Spark using the Python programming language. PySpark is an API developed in python for spark programming and writing spark applications in Python style, although the underlying execution model is the same for all the API languages.

Colab by Google is an incredibly powerful tool that is based on Jupyter Notebook. Since it runs on the Google server, we don’t need to install anything in our system locally, be it Spark or any deep learning model.

In this article, we will see how we can run PySpark in a Google Colaboratory notebook. We will also perform some basic data exploratory tasks common to most data science problems. So, let’s get cracking!

Note — I am assuming you are already familiar with the basics of Python, Spark, and Google Colab.

Setting up PySpark in Colab

Spark is written in the Scala programming language and requires the Java Virtual Machine (JVM) to run. Therefore, our first task is to download Java.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Next, we will download and unzip Apache Spark with Hadoop 2.7 to install it.

Note — For this article, I am downloading the 3.1.2 version for Spark, which is currently the latest stable version. If this step fails, then probably a new version for spark has replaced it. So, check their latest version online and use that instead.

!wget -q https://www-us.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop2.7.tgz!tar xf spark-3.1.2-bin-hadoop2.7.tgz

Now, it’s time to set the ‘environment’ path.

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop2.7"

Then we need to install and import the ‘findspark library that will locate Spark on the system and import it as a regular library.

!pip install -q findsparkimport findspark
findspark.init()

Now, we can import SparkSession from pyspark.sql and create a SparkSession, which is the entry point to Spark.

from pyspark.sql import SparkSession
spark = SparkSession.builder\
.master("local")\
.appName("Colab")\
.config('spark.ui.port', '4050')\
.getOrCreate()

That’s it! Now let’s get started with PySpark!

Photo by Mike van den Bos on Unsplash

Loading data into PySpark

Spark has a variety of modules to read data of different formats. It also automatically determines the data type for each column, but it has to go over it once.

For this article, I have created a sample JSON dataset in Github. You can download the file directly into Colab using the ‘wget’ command like this:

!wget --continue https://raw.githubusercontent.com/GarvitArya/pyspark-demo/main/sample_books.json -O /tmp/sample_books.json

Now read this file into a Spark dataframe using the read module.

df = spark.read.json("/tmp/sample_books.json")

It is now time to use the PySpark dataframe functions to explore our data.

Exploratory Data Analysis with PySpark

Let’s check out its Schema:

Before doing any slice & dice of the dataset, we should first be aware what all columns it has and its data types.

df.printSchema()Sample Output:
root
|-- author: string (nullable = true)
|-- edition: string (nullable = true)
|-- price: double (nullable = true)
|-- title: string (nullable = true)
|-- year_written: long (nullable = true)

Show me some Samples:

df.show(4,False)Sample Output:
+----------------+---------------+-----+--------------+------------+
|author |edition |price|title |year_written|
+----------------+---------------+-----+--------------+------------+
|Tolstoy, Leo |Penguin |12.7 |War and Peace |1865 |
|Tolstoy, Leo |Penguin |13.5 |Anna Karenina |1875 |
|Woolf, Virginia |Harcourt Brace |25.0 |Mrs. Dalloway |1925 |
|Dickens, Charles|Random House |5.75 |Bleak House |1870 |
+----------------+---------------+-----+--------------+------------+

How big is the Dataset:

df.count()Sample Output:
13

Select a few columns of interest:

df.select(“title”, “price”, “year_written”).show(5)Sample Output:
+----------------+-----+------------+
| title|price|year_written|
+----------------+-----+------------+
|Northanger Abbey| 18.2| 1814|
| War and Peace| 12.7| 1865|
| Anna Karenina| 13.5| 1875|
| Mrs. Dalloway| 25.0| 1925|
| The Hours|12.35| 1999|
+----------------+-----+------------+

Filter the Dataset:

# Get books that are written after 1950 & cost greater than $10df_filtered = df.filter("year_written > 1950 AND price > 10 AND title IS NOT NULL")df_filtered.select("title", "price", "year_written").show(50, False)Sample Output:
+-----------------------------+-----+------------+
|title |price|year_written|
+-----------------------------+-----+------------+
|The Hours |12.35|1999 |
|Harry Potter |19.95|2000 |
|One Hundred Years of Solitude|14.0 |1967 |
+-----------------------------+-----+------------+
# Get books that have Harry Porter in their titledf_filtered.select("title", "year_written").filter("title LIKE '%Harry Potter%'").distinct().show(20, False)Sample Output:
+------------+------------+
|title |year_written|
+------------+------------+
|Harry Potter|2000 |
+------------+------------+

Using Pyspark SQL functions:

from pyspark.sql.functions import max# Find the costliest book
maxValue = df_filtered.agg(max("price")).collect()[0][0]
print("maxValue: ",maxValue)
df_filtered.select("title","price").filter(df.price == maxValue).show(20, False)Sample Output:
maxValue: 29.0
+-----------------------------+------+
|title |price |
+-----------------------------+------+
|A Room of One's Own |29.0 |
+-----------------------------+------+

End Notes

I hope you enjoyed working with PySpark in Colab as much as I did in writing this article! You can find this complete working sample Colab file in my Github repository at - https://github.com/GarvitArya/pyspark-demo.

▶️ Please feel free to ask any questions/doubts or share any suggestions in the comments below.

▶️ If you like this article then please consider following me & sharing it with your friends too :)

▶️ You can reach out to me at — Linkedin | Twitter | Github | Instagram | Facebook (Practically everywhere :P)

Photo by Courtney Hedger on Unsplash

--

--

I am a Data Sherpa who converts data into insights at day and spend my nights exploring & learning new technologies!