Streaming data from PostgreSQL to BigQuery with Datastream
GitHub Repo: https://github.com/data-max-hq/terraform-datastream-postgres-bq
In this blog post, we are going to use Datastream to create a stream from a Cloud SQL PostgreSQL database to BigQuery.
What is Datastream?
Serverless and easy-to-use change data capture and replication service.
Access to streaming, low-latency data from MySQL, PostgreSQL, AlloyDB, and Oracle databases.
Near real-time analytics in BigQuery with Datastream for BigQuery.
Easy-to-use setup with built-in secure connectivity for faster time to value.
Serverless platform that automatically scales, with no resources to provision or manage.
Secure foundation for database synchronization and event-driven architectures.
We are going to use Terraform to create a Cloud SQL Postgres instance with a private network then connect this database to datastream using VPC peering to create a stream to BigQuery. As of the time of this writing, October 22, Datastream is not yet supported by Terraform thus some of the steps have to be done manually through the Google Cloud Console.
1. Create a Cloud SQL Postgres instance
The following code uses Terraform to create a private Postgres instance, we use the google_sql_database_instance recourse from Terraform to create the private IP instance as shown in the official documentation provided by Terraform.
Cloud SQL private IP PostgreSQL instance
2. Create a Datastream private connection
The following Terraform code is used to create a private connectivity configuration for Datastream by using the google_datastream_private_connection recourse which creates a private connection between Datastream and a customer's network.
Each network has its own firewall controlling access to and from the instances. All traffic to instances, even from other instances, is blocked by the firewall unless firewall rules are created to allow it, thus we use google_compute_firewall to allow connection to our Postgres database.
Datastream private connection
3. Create a Cloud SQL Auth proxy
When you configure a Cloud SQL instance to use private IP addresses, you use a VPC peering connection between your VPC network and the VPC network of the underlying Google services where your Cloud SQL instance resides.
Because Datastream's network can't peer directly with Cloud SQL's private services network, and because VPC peering isn't transitive, a Cloud SQL Auth proxy is required to bridge the connection from Datastream to your Cloud SQL instance. The following Terraform code creates a VM instance by using google_compute_instance recourse, where we create a Cloud SQL Auth proxy.
4. Complete database prerequisites for the stream to run successfully.
Datastream requires that the source meets these prerequisites for a stream to run successfully for each source you will be given a few steps to complete before starting the stream as shown in the photos below.
For our Postgres instance, we have already completed the first step in Terraform by setting the value of cloudsql.logical_decoding to on. The other steps have to be done manually.
5. Create a stream
The last step is to create the stream through the Datastream UI, in the second step under Hostname IP you have to put the Internal IP of the proxy created and the Username and password of the Datastream user previously created. Under the Define Connectivity method select VPC peering and the private-connection that we created. After that, you can create the stream keeping the default configuration.
After completing all steps select CREATE & START to start the stream between CloudSQL and BigQuery.
In this blog post, we demonstrated a step-by-step guide on how to create an incremental load of data from a CloudSQL Postgres database to BigQuery using Datastream. Changes in your database instance are replicated in real-time to BigQuery. This addresses a long-standing challenge for incremental loads for Postgres in CloudSQL. Happy coding!