Analyzing Banking Data with PySpark

Aditya Daria
5 min readJan 27, 2023

PySpark is the alchemist of big data, turning vast amounts of information into actionable insights and knowledge.

PySpark’s ability to handle large datasets makes it a valuable tool for data processing and analysis in the banking industry. In this project, we will utilize PySpark to analyze banking transactions and gain insights into customer behavior. From data cleaning to feature engineering and model building, we will explore the various capabilities of PySpark and how it can be applied to real-world problems in the banking sector.

Introduction

In this project, we will be using PySpark, a powerful Python library for big data processing, to analyze a dataset from a bank. The goal of this project is to gain insights into the banking industry by cleaning, transforming, and analyzing the data using PySpark.

Data

The dataset used in this project is a real-world dataset from a bank. It contains information such as customer demographics, account balances, and transaction history. The dataset is stored in a CSV file and will be loaded into a PySpark DataFrame for analysis.

Data Cleaning

The first step in this project is to clean the data. This includes handling missing values, removing duplicate records, and ensuring that the data is in a format that can be easily analyzed. We will use PySpark’s built-in functions for data cleaning, such as dropna(), dropDuplicates(), and cast().

Feature Engineering

Once the data is cleaned, we will move on to feature engineering. This step involves creating new features from the existing data that will be useful for our analysis. We will create new features such as average account balance, number of transactions, and customer age.

Exploratory Data Analysis

After the data is cleaned and feature engineered, we will perform exploratory data analysis (EDA) to gain a better understanding of the data. We will use PySpark’s built-in functions for EDA, such as describe(), groupBy(), and count(). We will also use PySpark's built-in visualization library, matplotlib, to create visualizations of the data.

Model Building

Finally, we will use the cleaned and feature-engineered data to build a model. We will train and evaluate different models using PySpark’s machine learning library, ml, to predict customer churn. We will also use cross-validation and grid search to find the best parameters for our model.

# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Create a SparkSession
spark = SparkSession.builder.appName("BankingProject").getOrCreate()

# Load the data into a DataFrame
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

# Data cleaning
df = df.dropna()
df = df.dropDuplicates()
df = df.withColumn("age", col("age").cast("integer"))

# Feature engineering
df = df.withColumn("avg_balance", (col("balance") / col("transactions")))
df = df.withColumn("customer_age", (col("year") - col("birth_year")))

# Exploratory Data Analysis
df.describe().show()
df.groupBy("churn").count().show()
df.select("avg_balance").show()

# Prepare data for model building
assembler = VectorAssembler(inputCols=["avg_balance", "customer_age"], outputCol="features")
data = assembler.transform(df)

# Split data into training and test sets
train, test = data.randomSplit([0.7, 0.3])

# Build the model
lr = LogisticRegression(labelCol="churn")

# Set up the parameter grid for cross-validation
paramGrid = ParamGridBuilder() \
.addGrid(lr.regParam, [0.1, 0.01]) \
.addGrid(lr.elasticNetParam, [0.5, 0.8]) \
.build()

# Set up the evaluator
evaluator = BinaryClassificationEvaluator(labelCol="churn")

# Set up the cross-validator
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator)

# Train the model
model = cv.fit(train)

# Make predictions on the test set
predictions = model.transform(test)

# Evaluate the model
print("Accuracy: ", evaluator.evaluate(predictions))

The code above is a sample implementation of a PySpark project that aims to predict customer churn in a banking dataset using machine learning.

The first step is to import necessary libraries such as SparkSession, functions, VectorAssembler, LogisticRegression, BinaryClassificationEvaluator, and CrossValidator.

Then a SparkSession is created, this is the entry point for creating DataFrames, DataSets, and working with the Data API.

The data is loaded into a DataFrame from a CSV file with the help of spark.read.csv() method, with the inferSchema option set to true and header set to true.

Next, the data is cleaned by dropping missing values and duplicates, also the age column is casted to integer type.

Then, feature engineering is applied to the dataset, two new columns are added avg_balance and customer_age.

After that, Exploratory Data Analysis (EDA) is done to understand the data better, describe(), groupBy("churn").count().show() and select("avg_balance").show() methods are used to get the statistical summary, group the data by churn, and check the avg_balance column respectively.

Once EDA is done, the data is prepared for model building by creating a VectorAssembler object, this object is used to combine the input columns into a single vector column.

The data is then split into training and test sets in the ratio of 70:30 using the randomSplit() method.

The next step is building the model, a LogisticRegression object is created with the labelCol set to "churn".

A parameter grid is created using the ParamGridBuilder() method, this grid contains different combinations of regularization parameter and elastic net mixing parameter.

A BinaryClassificationEvaluator object is created to evaluate the model’s performance.

A CrossValidator object is created that takes the estimator, parameter grid, and evaluator as arguments, this object is used to train the model.

The model is trained using the fit() method on the training data.

Predictions are made on the test set using the transform() method, the accuracy of the model is printed using the evaluate() method of the BinaryClassificationEvaluator.

In conclusion, the project above demonstrates how PySpark can be used to analyze banking transactions and predict customer churn. By using PySpark’s powerful data processing capabilities and machine learning libraries, we were able to clean and prepare the data, perform exploratory data analysis, and train a model to accurately predict customer churn. This project shows the potential of PySpark in the banking industry and how it can be used to gain valuable insights from large datasets. It could be a good reference for anyone who would like to apply PySpark on banking dataset to get some insight.

I hope that you will find this article insightful and informative. If you enjoyed it, please consider sharing the link with your friends, family, and colleagues. If you have any suggestions or feedback, please feel free to leave a comment. And if you’d like to stay updated on my future content, please consider following and subscribing using the provided link. Thank you for your support!

--

--