Big Data is much more than simply a matter of size — it presents an opportunity to discover key insights and emerging trends in data, makes businesses more agile, board room decisions better informed, and answer questions that have previously been considered unanswerable. With all the hype around big data, insightful data is eventually most important to business.
At Dream11, with a growing user base of over 18 million users & ~3x YoY growth, running deep analytics & building useful insights from the data, started seeing scalability bottlenecks. The need for having a central & scalable data platform became unavoidable. This post is about how we overcame the scalability and architecture challenges.
1. Evolving/Changing Needs
The fast pace at which we were growing and the rising deep analytics requirements soon outdated our existing systems. Amazon Aurora is our choice of backend relational database and all our transactional data resides there. With a rapidly growing, current data volume of a few TBs, running deep data analytics was not possible anymore.
2. Fragmented Analytics
All our transactional data & user-activities event data is being maintained in 2 distinct platforms. While one is in-house, the other one is a store maintained by a third party tool, Segment. With 2 disconnected data platforms, there is no way to run joint analytics. For instance: we need to run a simple A/B test on our platform wherein we want to experiment with the colour of “Join Contest” button. The success metric is how many of our users end up doing a transaction after they click on the particular button. This needs access to transactional data and clickstream data that are on different and disjointed data platforms.
3. Real-time analytics at scale
The current data pipeline lacked streamlining. There was a huge opportunity here to enable & deliver real-time analytics.
Step 1: Identifying the Data Lake
To begin with, we needed to identify a Data Lake, where we could keep all type of data from all possible sources, in its raw and native format. This unstructured/semi-structured/structured data could then be refined and sourced into other data-stores based on the use-cases as well as business requirements.
Primary features we needed from the Data Lake:
We narrowed down our choice to S3.
Why S3 as Data Lake?
This is the broad platform overview we came up with:
Step 2: Identifying the Data Warehouse
Next, we needed a data-warehouse that would store hot/warm data, with at least these features
Redshift worked for us perfectly. Five key performance enablers for Redshift: MPP engine, columnar data storage, data compression, query optimisation, and compiled code. Columnar storage optimises query costs by increasing the effective storage capacity of the data nodes and improves performance by reducing I/O needed to process SQL requests. Redshift’s execution engine also transforms each query into compiled code, which then runs in parallel on each compute node.
Some other side benefits included:
An important benefit of our choices of Data Lake and Data Warehouse was that they worked really well with each other. With S3, data ingestion in Redshift can be massively parallelised, with each compute node directly working on data load, rather than being bottlenecked on leader node’s network I/O.
This is the Data Platform we have in place right now:
While the user-events data still needs to be handled and pipelined into the system, the transactional data becomes available in the Data Warehouse within 2 hours of occurrence.
Some important pointers:
Step 3: Identifying data pipeline building blocks
Aurora to S3: Confluent connectors
Confluent is a platform that improves Apache Kafka by expanding its integration capabilities, adding tools to optimise and manage Kafka clusters. Confluent Platform makes Kafka easier to build and easier to operate. For sourcing data from our Aurora cluster to S3 via Kafka, we are using Confluent Connectors.
Why Confluent Connector:
S3 to Redshift
For this part of the pipeline, we needed the ability of handling duplicates, as many transactional datasets are mutable in nature. We followed the route suggested by Redshift for this process:
a) Load all incremental data in a staging table
b. De-dupe that data
c) Delete all corresponding entries from the main table
d) Insert new records.
We have a multi-threaded script that incrementally loads multiple datasets, following the above process, using the Redshift Copy command for initial load.
Right now, we have a 1-to-1 mapping between our transactional tables & the ones in Redshift. In the long term, we plan to have denormalized view of these datasets, which makes more sense to the business & all end-users. Apart from this, we’re also evaluating having insert-only fact tables as our first layer.
Data pipelines are sensitive to schema changes & even issues as small as source/sink network volatility can bring them down. Planning & building a good monitoring system is as important as building the platform.
Schema changes are inevitable. Such scenarios can easily break our data pipeline and the dependent ETLs. Monitoring pipelines with Confluent Connectors just narrows down to playing with their standard REST APIs. Since Kafka Connect is tightly coupled with data schemas, any incompatible schema change breaks the pipeline. We have a monitoring system keeping a check on its “Status API” and generating alerts whenever a “failure” occurs. Once the consumer processes are changed for schema compatibility, the pipeline is restarted.
We’ve come a long way, but there’s still plenty to do. Watch this space as we keep sharing!