Taking Data Analytics to the Next Level
Today everyone wants to have his needs served immediately. What is true in e-commerce and B2C, is also valid for business-people who want to receive their reports on time. In established enterprises the reality is rather different. As businesses grow, they get more data due to both additional products they offer and new customers they can win over. In most enterprises the data warehouse infrastructure however does not adapt to these new developments. In most cases the data warehouse is based on relational databases that are running in on-prem datacentres. A huge number of ETLs and SQL-scripts get executed on this data in the relational databases. Ad-hoc queries from data analysts add more load to the already overstretched resources. Often this ends up in exasperated business-owners waiting for ages to receive their reports as ETLs are delayed due to the lack of resources and non-existing scalability. Especially when something special happens that boosts the business, like e.g. the Black Friday in e-commerce or a sports event in the entertainment industry, data volumes surge causing serious delays for reports. But even monthly billing cycles can bring a conventional data-warehouse to its limits. This is a huge problem for any company that wants to be data-driven: At crucial moments when data is required for decision making, it simply is not available. Real-time analytics and reporting is something as far away as a distant galaxy for companies that have to rely on a conventional data-warehouse.
Cloud-vendors and other software companies like Snowflake have realised how vast the business-opportunities for providing a scalable data-warehouse solution are. Apparently, the cloud offers by design various features like elastic scalability by simply adding additional instances on demand or high availability due to availability zones and multiple datacenters. Besides Snowflake, the three major Cloud vendors AWS (Redshift), Microsoft (Azure Data Warehouse) and Google (BigQuery) offer solutions. Each has their pros and cons, but BigQuery is one of the best cloud data-warehouses since it comes as a fully managed services that scales out and down with every peak. Customers pay for exactly the amount of data they store and the queries they run. With prices being rather low, BigQuery is a cost-sensitive solution as there is no more hardware, maintenance or infrastructure operations costs. Both IT and business teams can now focus on what actually matters: Making their business data-driven. So in this article we will focus on how to build a solution using BigQuery as data-warehouse.
However, coming into this brave new world is not a easy. In most cases businesses need their old data in the new cloud data-warehouse. Also not all applications generating the raw data are migrated to the cloud from day one, in fact some never will. So as usual when introducing new technologies, a migration is required. Additionally, a continuous data loading might be required for integrating on-prem sources with the cloud data-warehouse. This results in a hybrid cloud architecture. So how to solve this architectural challenge?
Before looking at the answer to this question, let’s look at another group of people who have their patience put at test with a traditional data analytics architecture. These people are data scientists. For being able to work their magic, i.e. unleash complex algorithms on troves of data, they require, of course, data. But relational databases do not scale, and real-time ML model execution is not really feasible if your data is at least 24 hours old. Sometimes it takes two days for preparing the data sets for the ML model training when not using a distributed system.
Also here cloud vendors offer salvation by providing Data Scientists with things they need: notebooks, fast query engines and distributed computing capabilities. Notebooks and fast query engines are of course also available on-prem from vendors like Cloudera, MapR or Databricks and are often easy to use. But when it comes to distributed computing and complex data transformations or real-time ML model scoring, Data Scientists often need to write additional code using a specific framework like Apache Spark. Google Cloud offers a solution called AI Platform where a Data Scientist can in fact get the power of distributed computing without having to know a respective framework like Apache Spark. AI Platform makes it possible to train and execute ML models in a distributed manner without writing any additional code. Under the hood, AI Platform runs Apache Beam jobs in a Dataflow cluster. The Data Scientists need only to specify their model and the data that the model will use (e.g. from BigQuery) when configuring the execution of the job. This leads to lower time-to-market for ML projects and Data Scientists that can now focus on the Data Science instead of Data Engineering.
In our case, the Data Science team wanted to leverage the power of GCP’s Machine Learning solutions including AI Platform and BigQuery for preparing the data for training the ML models. So we needed to migrate our Machine Learning platform, including Cloudera Impala as querying engine. In the old solution data was loaded first from the Oracle Datawarehouse to Impala where ETLs were applied. After these ETLs the data was ready for training in the Cloudera Data Science Workbench. Our high-level architecture looks as follows:
But how to solve the integration between on-prem Cloudera and the Cloud? Here we face again the same question that everyone faces who wants to migrate their on-prem data-warehouse to the cloud. Generally speaking, there’re two options: Custom coding of applications that move data or using a commercial data integrator.
We were already using a data integrator for Change Data Capture (CDC) — Striim. Striim also offers connectors for retrieving data from Cloudera (e.g. Kudu, HBase or HDFS) and ingesting it to various GCP storage systems, including BigQuery. So we decided to use Striim to do both the initial load and building the continuous data integration pipeline. One advantage of the Striim approach was that both the initial load and the continuous integration could be handled in the same way by using the same data pipeline. Another reason was that although right now we were only required to provide data updates on a daily basis, by using Striim we could also have provided data in real-time should the requirement come in the future. In fact it would be possible to retrieve data in real-time from any relational on-prem database via Striim’s CDC connectors and to load it in real-time to BigQuery. Finally, Striim’s BigQuery connector is not only able to append data to a BigQuery table. It’s also capable of creating an exact copy of the source table from the Oracle data-warehouse by running BigQuery load jobs under the hood. All that is required to do for this, is to configure the BigQuery connector respectively — with just a few clicks.
So Striim would retrieve data via a HDFS reader from the source on Cloudera’s side. The data would be available in form of csv files. For Striim to be able to load data to BigQuery, it would first be required to parse the csv file. Afterwards the content of the csv file would get mapped to the actual table schema in BigQuery. This happens in the ContinuousQuery by writing a small query in a SQL-like language. Finally the output of this ContinuousQuery is sent to the BigQueryWriter where the respective table in BigQuery is specified where to write the data into. This is what the flow inside Striim looks like:
Now this looks pretty simple, doesn’t it? In fact this approach works perfectly fine. Things however get very messy, when one has to deal with about 50 tables and the majority of these tables have more than 50 columns with some tables reaching even 300 columns. Dealing with the schema for such tables is rather a headache. If the data would come from a database, Striim would require no further mappings. However in our current use case the data comes from a file that has to be parsed and the content then mapped to the Bigquery table schema causing messy schema issues. In order to find out how to avoid these , check out our next story!
About the Authors: Mark and Vladimir Elvov work in different functions in Data Architecture. New technologies are our core passion — using architecture to drive their adoption is our daily business. In this blog we write about everything that is related to Big Data and Cloud technologies: From High-Level strategies via use cases down into what we’ve seen in the front-line trenches of production and daily operations work.