Responsibilities
- Build a centralized data lake on GCP Data services by integrating diverse data sources throughout the enterprise
- Develop, maintain, and optimize SPARK-powered batch and streaming data processing pipelines. Leverage GCP data services for complex data engineering tasks and ensure smooth integration with other platform components
- Design and implement data validation and quality checks to ensure data's accuracy, completeness, and consistency as it flows through the pipelines
- Work with the Data Science and Machine Learning teams to engage in advanced analytics
- Collaborate with cross-functional teams, including data analysts, business users, operational and marketing teams, to extract insights and value from data
- Collaborate with the product team to design, implement, and maintain the data models for analytical use cases
- Design, develop, and upkeep data dashboards for various teams using Looker Studio
- Engage in technology explorations, research and development, POC’s and conduct deep investigations and troubleshooting
- Design and manage ETL/ELT processes, ensuring data integrity, availability, and performance
- Troubleshoot data issues and conduct root cause analysis when reporting data is in question
Required Technical Skills
- PySpark Batch and Streaming
- GCP Dataproc, Dataflow, DataStream, Dataplex, Pub/Sub, BigQuery and
- Cloud Storage
- NoSQL (preferably MongoDB
- Programming languages: Scala/Python
- Great Expectation, or similar DQ framework
- Familiarity with workflow management tools like: Airflow, Prefect or Luigi
- Understanding of Data Governance, Data Warehousing and Data Modelling
- Good SQL knowledge