CDC as magical driver for new use cases

7 min readNov 9, 2019

In the first of this series on how to build a data platform, Change Data Capture (CDC) was mentioned as a core part of the platform’s architecture. CDC is intended to have a non-intrusive way of retrieving data in real-time from operational system’s databases in order not to overload them. The idea is to get the data not directly from the tables using JDBC but to get it from the Write Ahead Log of the database of a system. Several vendors are available for doing log-based CDC: Oracle GoldenGate, Attunuity, StreamSets or Striim. The solutions described here are utilizing Striim. Striim retrieves data from CRM and billing systems and sends it to Kafka topics from where it gets transferred to a staging zone on Kudu via Spark Streaming Jobs. This last post focuses on why exactly this mechanism is of such high importance to various use cases including GDPR compliance and Targeted Ads.

GDPR Compliance

In May 2018 the General Data Protection Regulation (GDPR) came into force. Since then, companies across the EU are obliged to comply with a number of regulation if they want to do data processing. One of these requirements is to delete customer’s data from an enterprise’s systems if the customer cancels his relationship, e.g. a contract or subscription with the company.

However, in most cases the data cannot be simply deleted. First it has to be established whether the customer or the company has certain obligations towards each other like for example unsettled bills. Determining such obligations requires complex analytical queries on various tables containing this data. Executing these queries directly on one’s systems of records is a certain way to kill these systems by combining both transactional and analytical workloads on the same database. Utilizing a data platform that offers analytical capabilities to cover the requirements of this transactional use case is a much better approach.

For calculating these obligations a set of business rules is defined. This is being stored in Elasticsearch and used to perform the obligations and status calculations that are executed via Spark Jobs. The results of these calculations get send back to the systems of record in order to use the Data Hub as a proper hub for orchestrating the entire GDPR process across various systems of record. Now the question remains: How does the data land in the data platform for the calculation Jobs to retrieve it? The answer is the CDC pipeline that continuously pumps data from the systems of record to a staging layer in Kudu. This is a simplified, high-level overview of the architecture for this use case:

It is important to mention that the status calculation jobs get executed in a scheduled batch mode, they are not running every few seconds. So why using streaming and CDC here? Because it is easier to handle a streaming flow of events than batches of data and to calculate deltas. Also, existing components that are used also for other use cases can be reused. But before looking at further CDC powered use cases, let’s have a look at another aspect of GDPR compliance.

Some customer data is required to be archived for 10 years after deletion from the with the need for seldom queries for audits. So, when the status calculation jobs have notified the systems of records that all obligations are settled, deletions can be started there and some data needs to be archived now. But where to put it for archival so that it is not accessible except for the audits?

A possible solution is to use HDFS. In a cloud-native architecture AWS Glacier or Google’s ColdLine storage could offer the same features. On-prem, HDFS offers the following advantages:

· It is cheap storage even for large amounts of data.

· It can be queried using Hive tables.

· By enabling encryption here, no one except for authorized persons, not even Hadoop admins can access the data. Thus, all compliance requirements get met.

To draw a bottom line, this use case shows how various data driven capabilities can be used to implement a transactional compliance use case on a data platform.

Targeted Ads

Not only GDPR compliance requires troves of customer data and analytical capabilities, also offering targeted ads needs this. But before diving into technical details, why should a company from outside the advert industry do targeted Ads? This use case is a perfect example of how an entertainment company for leveraging data driven capabilities for creating an entirely new revenue model. The idea is to offer targeted ads to enterprise customers based on segments in that customers can be divided based on their data. The segments get created by retrieving data from multiple sources, not only conventional CRM data but also viewership data and external data. Using these sources, it is possible to understand customer’s interests and preferences. Any company can also use the very same approach to build its own system for executing its own marketing campaigns to do up- and cross-selling for its existing customers.

What does a conventional approach to create customer segments look like? Normally the required data gets retrieved from BI (that has the data from systems of record) and provisioned to the segmentation engine to prepare advertisement campaigns there. The data flow looks like this:

Now the idea is to be able to match a customer to the correct segment based on his data. The problem here is that it is required to calculate the delta for each customer in order to find out which attributes have changed and to which segments does he belong now. This a resource heavy batch-flow that requires full scans of the segmentation’s engine database for the entire customer base overnight. Such a solution is complex, does not scale and is slow. So how to do this better?

The first idea is to use streaming instead of batch flows in order to remove the calculation of a delta. Instead customer data changes get propagated via CDC in real-time to a staging area on the data platform in Kudu tables from where on they get send to the segmentation engine. This ensures a light-weight data provisioning for the segmentation engine. To provide a segmentation engine, Elasticsearch as a database for the segments and Zoomdata as a visualization-tool can be used for the segmentation of customers. In addition to this, this solution does not require heavy scans on the segmentation engine’s database (Elasticsearch). Instead, a Percolator gets used. It is part of Elasticsearch and gets used to assign customers to the correct segments. It runs a set of predefined analytical queries to determine whether a customer still belongs to the current segment or now belongs to another one. These analytical queries however just get executed for customers where changes in their data have happened thus significantly reducing the resource usage. The Percolator knows about which customers were affected by a change in their data since this information gets propagated via CDC to the Kudu staging area and then further to Elasticsearch via Spark Jobs. Here’s a high-level overview of the architecture for Targeted Ads:

The CDC pipeline mentioned in the diagram is exactly the same as the one used for GDPR compliance. As it proves, a CDC pipeline that can move data in real-time from systems of record to the data platforms acts as a fertilizer as use cases start growing like mushrooms on it if this data gets available on the data platform. It quickly forms a cornerstone for a high amount of very critical and business value creating use cases.

What’s next?

We hope that you liked this series of articles on how exactly to build a data platform and how to solve architecture challenges for the use cases featured in the series. In the future, we will publish some insights on how to fix incidents with Kafka Connect and Spark Streaming as well as an approach on how to migrate an analytical toolbox for Machine Learning from on-prem to the Google Cloud. Above all we will describe how to build a modern CRM system and how to integrate it with the rest of your application landscape without breaking everything.

About the Authors: Mark and Vladimir Elvov work in different functions in Data Architecture. New technologies are our core passion — using architecture to drive their adoption is our daily business. In this blog we write about everything that is related to Big Data and Cloud technologies: From High-Level strategies via use cases down into what we’ve seen in the front-line trenches of production and daily operations work.

CDC as magical driver for new use cases

Written by Vladimir Elvov