Autoscaling expandable data pipelines with no-code controls


I've designed and built a Cloud Architecture for collecting, storing, transforming, querying and visualizing large amounts of data. This architecture has following properties:

  • Per-pipeline autoscaling from 0 to any available number of CPUs (limited by Project quota)
  • Linear performance growth
  • No-code controls allowing non-techies to use any pipeline individually or chain them together.
  • As little custom code as possible (<300 lines)
  • Unidirectional dataflow
  • Anyone can expand it without knowing a thing about cloud computing, event-driven systems and autoscaling.
  • Capable of handling huge spikes in workload as well as real-time processing with decent latency
  • Fault-tolerant. If it fails to process an input, it retries it. If it fails several times in a row, it restarts the failing compute instance. If it fails with the same input, then it marks this input as invalid and stores in the Data Warehouse.


My client had a bunch of data collection and transformation pipelines written by different developers in different languages with different CPU/Memory requirements. Some of these pipelines were running on-premise, others - on various VPS providers. My job was to add new pipelines to the mix and orchestrate all these dispersed systems.

So, we have an arbitrary number of pipelines now and an arbitrary number of new pipelines being developed. Non-techies need to use them individually or chain them together, schedule execution and process highly volatile number of data points per day ranging from zero to a billion.

Challenge #1: Input concurrency


Data pipelines must be scalable. There are 2 kinds of scaling: vertical and horizontal. Vertical stands for increasing capabilities of a single VM and horizontal stands for creating many VMs and splitting the workload between them. Obviously the horizontal scalability is way more... you know... scalable, because you can create hundreds or thousands of VMs. So, the input method must be able to feed all those VMs with the input data without duplicating the input. Otherwise we will end up with a bunch of VMs doing the same job in parallel instead of splitting the workload.

Challenge #2: Input latency


Whatever a pipeline does, CPU and Memory required for transforming data is higher than CPU and Memory required for just pulling the input. So, if a pipeline is waiting for the input data doing nothing, then a significant part of the resources are wasted.

Challenge #3: Fault tolerance


If an instance of a pipeline fails, the input it was processing must not be lost. It must be returned to the queue and another instance should try processing it. If the same input fails to be processed several times, then this input must be removed from the queue and marked respectively for an operator or engineer to investigate.

Challenge #4: Idle cost.


If the pipeline runs out of input data, then it's supposed to automatically scale down to 0 VM instances and stop wasting your money.

Challenge #5: Self-healing


If one of the VM instances on a pipeline breaks down, it should be automatically repaired. That's obvious, people usually handle this with all kinds of error handling mechanisms. But what if an error handling mechanism fails? Or the VM runs out of memory and the script gets terminated? Or just goes silent for whatever reason. On a larger scale we need to be damn sure that each one of our VMs are actually doing the job and if not - we need a solid mechanism that repairs that VM from outside. And of course, this has to happen automatically.

Challenge #6: Usability

I needed a solid set of controls and monitoring tools over the entire system. And it should be as much no-code as possible for the ops team to use it. Input enqueuing, scheduling and monitoring - all this must be wrapped in a convenient ready-to use interface. The docs must be short and onboarding threshold must be low. Building a custom GUI would be hard to expand and would launch both the development and maintenance budgets through the roof.

Challenge #7: Extendability

If you happen to have solutions to all of the above challenges, that's impressive! Try this one. That first pipeline is probably not the end of the story. You'll need to plug in new pipelines to transform the output of the first pipeline and then the output of the second pipeline and so on. Plugging in new pipelines should be easy and should not send your system complexity through the roof.

Challenge #8: Developer lock-in

I avoid locking in my clients with solutions that are hard for other developers to maintain, because it would lead to unhealthy relationships and me being morally obligated to keep helping past clients. It's much better to solve a client's problem and move on with a good review. So, I have to do all of the above in a concise manner. I prefer to check if system is easy enough to understand by drawing the diagram in my imagination without hurting my brain.

The solution


Use BigQuery as the main data warehouse. Compute Engine Instance Groups will host all individual pipelines and Pub/Sub will glue all this together.

BigQuery cannot interact with Pub/Sub. The recommended way to do this is to use Dataflow. Looks like an overkill to me. I've decided to use a lesser known feature of BigQuery: it is its ability to export query results directly to Cloud Storage.

It also breaks output into multiple files, so the exported data can be handled in parallel.

Cloud Storage, in turn, can trigger a Cloud Function on each new file upload. The function reads file line by line and populates Pub/Sub queue named after the file.

Also, this approach lets us bypass BigQuery, if we want so, and feed data from external sources directly to any pipeline simply by uploading a CSV file to Cloud Storage.

BigQuery has "Saved Queries" feature. I use this feature to save the export query and other queries for the ops team to use. Neat πŸ™‚

And with BigQuery Scheduling we can schedule the export queries in any way we want to populate various queues on a regular basis.

All custom pipelines are being deployed as Compute Engine Instance Groups on Preemptible VMs. I've setup autoscaling to be based on the respective Pub/Sub queue size. To make the process fast and predictable, I wrap each custom pipeline into a Docker container.

Don't forget to use multistage Docker builds to reduce image size. It'll affect each instance boot-up time.

Pub/Sub will supply all our VMs however many we have. The output will be sent to another Pub/Sub queue that can push messages to another Cloud Function (relatively expensive) or we can create a data sink using Preemptible VMs on Compute Engine Instance Groups (higher latency, but costs less).


Thank you for reading this to the end! If you have any comments - pls feel free to shoot me an email.