<Guy Hummel>Welcome to the introduction to
Google Cloud Dataflow course.
My name's Guy Hummel and I'll be showing you how
to process huge amounts of data in the Cloud.
I'm the Google Cloud Content Lead at Cloud Academy,
and I'm a Google Certified Professional Cloud Architect
and Data Engineer.
If you have any questions, feel free to connect with me
on LinkedIn and send me a message or send an email
to support@cloudacademy.com.
This course is intended for data professionals,
especially those who need to design and build
big data processing systems.
This is an important course to take if you're studying
for the Google Professional Data Engineer exam.
To get the most from this course, you should have
experience with Java because I'll be showing you lots
of examples of code written in Java.
I'll also show you how to run these examples
on the Cloud Dataflow service.
So I recommend that if you don't already have a Google Cloud
account then sign up for a free trial.
It's good for a year and it let's you run up
to $300 worth of services.
Cloud Dataflow executes data processing pipelines.
A pipeline is a sequence of steps that reads data,
transforms it in some way, and writes it out.
Since Dataflow is designed to process very large data sets,
it distributes these processing tasks to a number
of virtual machines in a cluster so they can process
different chunks of the data in parallel.
Cloud Dataflow is certainly not the first
big data processing engine.
It's not even the only one available
on Google Cloud platform.
For example, one alternative is to run Apache Spark
on Google's Dataproc service.
So why would you choose Dataflow?
There are a few reasons.
First, it's essentially serverless.
That is you don't have to manage
the compute resources yourself.
Dataflow will automatically spin up and down
clusters of virtual machines when you run processing jobs.
You can just focus on writing the code
instead of building clusters.
Apache Spark on the other hand requires more configuration,
even if you run it on Cloud Dataproc.
Second, Google has separated the processing code
from the environment where it runs.
In 2016 the open source Dataflow software development kit,
which was released as Apache Beam.
Now you can write Beam programs and run them
on your own systems, or on the Cloud Dataflow service.
In fact, if you look at Google's Dataflow documentation,
you'll see that it tells you to go to
the Apache Beam website for the latest version
of the software development kit.
Third, it was designed to process data in both batch
and streaming modes with the same programming model.
This is a big deal.
Other big data SDKs typically require that you use
different code depending on whether the data comes
in batch or streaming form.
Competitors like Spark are addressing this
but they're not quite there yet.
We'll start with how to build and execute
a simple pipeline locally.
Then I'll show you how to run it on Cloud Dataflow.
Next, we'll look at how to build more complex pipelines
using custom and composite transforms.
Finally, I'll show you how to deal with time
using windows and triggers.
You'll also see how to integrate a pipeline
with Google BigQuery.
By the end of this course, you should be able to
write a data processing program in Java using Apache Beam,
use different Beam transforms to map and aggregate data,
use windows, timestamps and triggers to process
streaming data, deploy a Beam pipeline both locally
and on Cloud Dataflow, and output data from Cloud Dataflow
to Google BigQuery.
We would love to get your feedback on this course,
so please let us know what you think on the comments
tab below or by emailing support@cloudacademy.com.
Now, if you're ready to learn and get the most
out of Dataflow, then let's get started.
Không có nhận xét nào:
Đăng nhận xét