Being a data engineer in 2018

March 1, 2018

Thanks to the disturbances in the arctic vortex, yesterday was one of the coldest day of the winter so far... What better day than to head to Helsinki to Predict What’s Next with Google.

The location of the training was Helsinki Congress Paasitorni, a venue that can accomodate several hundreds of participants and is easily accessible form the main railway station. The snacks and lunch kept us inside, while the two dozens or so organizing staff were present everywhere and were assisting the participants at all times. All this attention to details just shows that the battles in the clouds are hard won on the ground by winning the hearts of the developers (and the pockets of their organizations).

The content of the training was build around Big Data and Machine Learning theme, although in practice it covered a wide range of items starting with an overview of the Google Cloud Platform (GCP) infrastructure and how to create compute instances there. This is quite understandable considering such a large audience that might not be familiar with the platform and its tools.

The training continued with the tools available to import and persist data, with an emphasis of choosing the storage service suitable for your data volume and usage pattern. While the basics were in place, the data being stored in Cloud SQL, we moved on to transforming the data using the Hadoop services provided as part of Cloud Dataproc. A nice surprise here were the interactive experiments performed using Datalab notebooks, which should be very familiar to Jupyter users.

The afternoon part of the training was dedicated to machine learning. We were using BigQuery to create a dataset for a taxi demand forecast system that used the dataset to train a model with TensorFlow. The later part of the training was an overview of the pre-trained models that are offered as Machine Learning APIs. The very end (and very brief) was allocated for the overview of data processing architecture, including managed message queues Cloud Pub/Sub and stream and batch processing using Cloud Dataflow.

During the event, the trainer even mentioned that they believe that they have the greatest machine learning infrastructure in the world. Point taken, considering dedicated machine learning hardware such as Cloud TPU, although AWS or Azure cannot be far behind...

Although the content covered was quite large for one day, the trainer did a good job at emphasizing the key concepts behind each service and how to use them. This made it easy to map them to the equivalent technologies and services popular outside GCP. Secondly, as the training followed two Qwiklabs labs, it would be pretty straight forward to go deeper and get your hands dirty with the expanded seven labs that have been delivered after the training. Again, hats off to Google for organizing this event, it was a day well spent.

I'll conclude the post with a personal perspective... as someone that build data pipelines for analysing bio-medical signals, I can say that with services like these everyone can be a data engineer today, the open question is how long till everyone can be a data scientist, but there is promise there too...