E-commerce Platform - Data Pipelines with Debezium and monitoring

E-commerce Platform - Data Pipelines with Debezium and monitoring

What are these Data Pipelines

  • The analysis tasks of e-commerce platform rely on a Postgres database. We have a data pipeline to sync our data from MongoDB and MySQL to Postgres.
  • The data pipeline of MongoDB to Postgres was achieved by Change Streams
  • The data pipeline of MySQL to Postgres was achieved by Debezium

Why we did this

  • We need to ensure these data pipelines are in-sync and functional.
  • We need to build a CDC solution to stream MySQL changes to Postgres.

In this project, I contributed to those items

  • Used Nodejs to implement a custom Prometheus Exporter to collect the latest timestamp of specific MongoDB Collection and get the latest timestamp of record we wrote to Postgres.

  • Used Prometheus and Grafana to build a dashboard and alerts for the data pipeline. Once the data in the Postgres behind MongoDB for more than 5 minutes, it sends out an alert.

  • Deployed Debezium in EKS to fulfill MySQL CDC requirement.

  • Developed a custom Helm Chart to help us add Service Monitor more easily.

    {{- range $serviceMonitorName, $ref := .Values.serviceMonitors }}
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: {{ $serviceMonitorName }}
      labels:
        release: prometheus-operator
      {{- if hasKey $ref "labels" }}
        {{- range $key, $value := $ref.labels }}
        {{ $key }}: {{ $value | quote }}
        {{- end }}
      {{- end }}
    spec:
      namespaceSelector:
        matchNames:
        {{- range $namespace := $ref.namespaceSelector }}
          - {{ $namespace }}
        {{- end }}
      selector:
        matchLabels:
          {{- range $key, $value := $ref.selector.matchLabels }}
          {{ $key }}: {{ $value | quote }}
          {{- end }}
      endpoints: {{- toYaml $ref.endpoints | nindent 4 }}
    {{- end}}
    

Architecture

Data-Pipeline-and-Monitoring

The high level view of this solution.

Result

alert

We can monitor the delay of data pipeline with near real-time.