Making advanced analytics accessible to more companies

Márton Kodok
Google Developer Expert & Senior Software Architect @ REEA

PROGRAMMING

The great change that has been taking place in the business landscape and, in fact, in every company, no matter how far from tech, is the fact that everything is becoming a software company, and, by extension, a data company. Companies often don't need a complex architecture and an engineering team right away, in order to start making advanced analytics for their business intelligence (BI).
When you are unsure about how to proceed when running some advanced analytics, without spending additional effort on strategy, architecture, engineering team, storage costs, infrastructure - you are convinced that your organization needs a new prospect. This is where Middlewares and cloud data warehouse solutions, such as Google BigQuery, come into play.

In our work with dozens of companies, we have found that three principles are required to explore and run advanced analytics, and have BI insights. First, companies must be able to identify, combine, and manage multiple data sources. Second, they need the capability to build advanced analytics using concepts they are familiar with. Third, one of the most critical aspects, they need to deploy the right technology architecture that matches their capabilities.

The Challenge:

On the one hand, we want to run a backend that will scale to terabytes, support deep analytics like large, multi-source, complex, unstructured data, with minimal engineering efforts. On other hand, the system should be cost effective and have a simple query language (SQL or Javascript prefered). We will discuss how Google BigQuery solves this problem and helps us reach the ultimate goal: going from small data to Bigdata, and storing everything accessible via SQL.

BigQuery is a data warehouse service, managed on the Google Cloud platform. It is designed for Online Analytical Processing (OLAP), where we can ask sophisticated queries. It has great query performance, low maintenance and it is easy to wire into existing platforms. It suits all businesses from small to emerging startups, as it doesn't need complex architecture. Moreover, it matches the everyday developer skills: SQL, REST, ODBC, streaming. It is event-based, infinitely scalable into petabytes, it runs on Google's infrastructure, it requires no provisioning, and it never runs out of resources.

From small data to BigData:

Nowadays, in the era of abundant storage, we can afford keeping the same data in multiple places. Moreover, software engineers decouple services in smaller pieces, called microservices. The hardest part about microservices is data, as these systems own their copy of data, they are not ready for cross queries, and we are not able to run combined analytics.

Our first principle tells that we first need to be able to identify, combine, and manage multiple sources of data.

When we talk about identifying the proper data, we often just look at our persistent database schema, and we don't realize the potential lying in all the numbers and data we use at runtime. As a result, we never capture and store the underlying insight. The data might be available at script-level, for the decision-making process, but it might not be available for analytics just yet.
We should combine all the persistent data with the metrics we obtain from runtime logic in a form of event-based stream. Then, the result should be shipped to the backend for later inspection. This gives us BigData, a product characterized by high velocity, volume, and variety. Suddenly, we realize we need a backend capable to store large amounts of data, a big stream of data, and other unstructured forms, to achieve business intelligence.

We earlier mentioned we are considering Google BigQuery for the backend, as it is a managed service, and we don't need to consider equipment strategy. There are serious budget consequences, if we need to run in-house equipment and purchase our own hardware to store terabytes of data. Managing our own hosted solution, such as the ElasticSearch/MongoDB cluster, has its own engineering efforts, as it needs sysadmins on standby for administrative maintenance, as well as keeping the cluster up and running for the increased demand. From our own experience, we know that nodes on EL/MongoDb cluster have finite capacity.

Architecting for the Cloud:

When having data in multiple subsystems, we need a unified data collector. FluentD and Kafka Stream Processing are open-source components, which ship data to various output endpoints. Both have lots of connector libraries. We will use FluentD as a middleware. It is easy to install, it has simple configuration file, and it is easy to get started, as we just need to send data to input-defined entities such as a TCP endpoint. The event router takes the data and multiples it to different outputs. Having such a component in our architecture allows us to copy data outside our stack as well, to platforms such as BigQuery, or to cold archives like Google Storage or Amazon S3.

We need to make a small change in our application to send JSON events to FluentD, which adds a CURL call with the right JSON structure. The data that will be further routed to various outputs depends on the template configuration.

On Google BigQuery, we can insert data in two ways, either by a load job using a large file of up to 5TB, or by a streaming insert. The streaming insert endpoint standard limit is 100.000 rows/second. Just by looking at the above two numbers we realize BigQuery is suitable for a petabyte scale solution and for enterprise data warehouse as well, so if our startup outperforms our projected growth, we will have a long term analytics backend for our data.
FluentD has a BigQuery plugin that we can use and our data will be streamed to a BigQuery table. BigQuery has a familiar DB structure: tables, views, records, nested, JSON. Query language is based on the SQL 2011 standard, so it has all the fancy modern SQL functions: window functions, statements etc., which means that even a small company will find talent that can work with BigQuery using SQL, without a BigData engineering team.

FluentD configuration

We include here a sample of the FluentD router to showcase 3 actions it can do. The first block is about a module record_transformer where we can transform our input by using Ruby.

<filter frontend.user.*>
  @type record_transformer
  enable_ruby
  remove_keys host
  <record>
    bq {"insert_id":"${uid}","host":"${host}",
     "created":"${time.to_i}"}
  </record>
</filter>

In our example, we remove `host` from the JSON, and then we add a `bq` key to our JSON, having a new structure. We can fully leverage the simplicity and flexibility of this module to transform our records. We often want our data modified to fulfill the business or database schema logic, and we do not want this to go in the application. We may want to transform some datetime, some numbers, string operations etc. We used this transformer on all events matching *frontend.user.**.

<match frontend.user.*>
  @type copy
  <store>
  @type forest
  subtype file
  <template>
      path /tank/storage/${tag}.*.log
      time_slice_format %Y%m%d
      time_slice_wait 10m
    </template>
   </store>
  <store>
    @type bigquery
    method insert
    auth_method json_key
    json_key /etc/td-agent/keys/key-31da042be48c.json
    project project_id
    dataset dataset_name
    time_field timestamp
    time_slice_format %Y%m%d
    table user$%{time_slice}

    ignore_unknown_values
    schema_path /etc/td-agent/schema/user_login.json
  </store>
</match>

The next two blocks demonstrate a use case where the same data is being routed to multiple places. The first one is a logging process into a file. Parameters allow us to choose the format of the file name and even the slicing period.
Next, the same data is inserted into a BigQuery table. We configured a service account on GCP and setup the key to authenticate the BigQuery API for FluentD.

On BigQuery we are able to create day partitioned tables, which means our data is being inserted into a partition defined at day level. Day partition is the recommended way to store data in BigQuery, as you are able to write queries that only touch the last 7 days. These are faster than doing a full table scan, which results in great query performance and reduced costs.

Having FluentD routing a copy of the event to a file as well, we added a feature to our platform. We have all the data, as it happened, archived in a file form. We can say it's a backup of our transaction log. Storing this in the Cloud is cheap nowadays when we have near or cold storage types. Any associated cost is not the subject of this article, but costs for 1 TB are at around twenty dollars monthly.

The benefit is that if we introduce a new type of system or database in our architecture later, we may need to replay the data. Thinking Bigdata and being prepared for the next decade is very important as well. Nowadays, more and more tools that help the Bigdata ecosystem are developed. Later on, more and more tools will be launched on the market. Therefore, having our granular data, as a stream at hand can come handy in the future.

Time for Insight

Now that we have our architecture sorted out, and we know we have our data in BigQuery, which is infinitely scalable, let us see how it adds value to our business.

On a legacy system, where we have only one database and some web service for creating the analytics, the ability to run real-time reports is quite a performance hit, as we need to have a large system to handle the stress.

We have cron jobs, which may kick in minutes, and we may have batch processing (for our data) which runs for hours, and which is often executed during midnight when servers are not subjected to daily load. All these give us days to insight. The data that we analyze could be days old, and we could react or take the wrong decision.

In the BigData world a true real time is indeed desired. Experts in the field often define near real time, as there is a few seconds delay. Why is this important?

If we have reports that analyze real-time or streamed data, we are able to detect opportunities and business intelligence immediately. This stands in stark contrast to legacy systems where we only get the aggregated data and where we need to wait for hours or days until the report is finished.

Use Case 1: a website user registration flow with multiple steps. Imagine at least 10 steps. Imagine that, for each step, an event gets raised and that it ends up in our BigQuery database, as the user fills in their profile. With an append-only pattern these are no updates to a row, these are new rows in the BigQuery table. This means that, for the same user, we have multiple lines of entry, that each line contains a timestamp and that new column(s) are filled in. Insight could be generated under the form of a report to show why step3 takes longer than other steps and to understand what should be improved. This can be done using simple AVG() SQL statements. We will find that step3 has to do with choosing a country, and that it takes longer because the user needs to scroll through a list of countries which takes several seconds. We can improve the page, and recommend a pre-select operation of the country by virtue of the user's current IP address. Alternatively, at the very beginning of the drop-down, we could list the 5 most used countries by our platform end users. Having users quickly choose from the options means they complete the step and spend more time on key business pages.

This is just a use case, but on a traditional system we couldn't afford keeping all events, so we wouldn't have the data or the report to tell us something like our conclusions above.

Use Case 2: tracking our users email open and click habits. We may record every open and click of an email, and we may end up having an image of our users' habits. As a result, we could target them with the right message, at the right time. Using SQL analytics on our data, in the BigQuery database, we could find out that Alice usually opens emails at 9am, and that she usually purchases on our website, at 11am. Having this information, we could schedule an email 5 minutes before 9am, so that this email would be the first one the user receives in their Inbox. If Alice doesn't do the purchase, we could send an offer reminder later, at 11am. We assume that by then, the user would have probably taken a decision about it.

The two use cases described above can be done using simple SQL analytics on the data we stored in the most granular way possible.

As the data is stored when the event happened, analysts are able to run queries on BigQuery even without using third-party GUI tools that integrate with BigQuery, such as Tableau, Talend. BigQuery integrates with Google Docs as well, so the Excel-like experience for some small companies comes handy. Those who know SQL can directly write the queries on BigQuery and get immediate insight. They don't need to raise any request with the developers who implement the reports days later. Since BigQuery is managed by Google and runs on Google's infrastructure. Therefore, running out of resources is not an issue, as both the handling of petabytes of data and fast responses are present.

Conclusions

For a small company, it is important to have access to modern BigData tools without running a dedicated team for it. You only need to raise data events, to use a Middleware like FluentD to route data to multiple backends, and to have a backend like BigQuery for OLAP processing. You will run analytics immediately. BigQuery is a complementary database and does not replace traditional solutions. True benefits are no provisioning, no deploy, no maintenance, no more focus on large scale execution plan, no more running out of resources. What is most important is the fact that you will no longer throw away or aggregate old data, but, instead, you will focus on running the right analytics.