P&G Data Mart

P&G Data Mart

Problem Statement:

Build a highly scalable data mart (Terabyte scale) for one of the CPG giant for various analytics use cases related to loyalty management programs (Dashboards and Statistical models).

info-vs-oil

Summary:

Pull data from multiple diverse data sources, including

  1. The client’s own Loyalty Management Platform- Reward data which is Azure SQLServer database
  2. Customer Portal database shared as weekly extracts over SFTP
  3. App transaction data over SFTP
  4. Advertisement data from Amazon S3 bucket
  5. User activity from Google Analytics

and co-located in Amazon Redshift database.

Solution:

Our 2-step approach for this ETL pipeline is:

  1. Create two schema on Amazon Redshift
    1. Staging – to store data pulled out from various sources.
    2. Datamart – to store the processed data from staging schema
  2. Create Spark Job
    1. Implement Spark connectors to pull data from all the above sources and load it into the staging schema
    2. Implement Spark job to transform the above data as per the business logic
    3. Load the data into data mart (Amazon Redshift)

Techniques Used:Spark MapReduce, Spark SQL, Amazon EMR, Amazon Redshift, Amazon Lambda, Scala

Role:

  1. Designed and optimized the data mart schema, a star schema, using Amazon Redshift.
  2. Setup the Big Data cluster on Amazon cloud – EMR cluster.
  3. Architected and developed the data pipeline using Big Data tech stack – Spark MapReduce, Spark SQL, Scala and AWS Lambda script.
  4. Validating the data mart against the source data sets and business rules.

This program provides a strong foundation in data analytics by bringing a diverse body of knowledge starting from exploratory data analysis (EDA), applied statistics, applied mathematics, computer science, optimization, consumer behavior and decision theory.