This past Tuesday a group from Pandera Labs attended Snowflake’s Unite the Data Nation conference in Chicago. The conference started at 1:00pm and lasted until about 5:45pm. Among those presenting were keynote speakers from Intelligent Solutions, Snowflake, Talend, and Vibes (one of our close partners and earliest companies on the Chicago tech scene.)
The most interesting talks we heard focused on either macro trends in the data engineering space or on some pretty neat features from the Snowflake platform. Overall, we were impressed and excited to share our thoughts!
Interesting Trends and Practices
The trends we heard about are not anything particularly new, but they’re points we feel are important and were happy to hear them discussed in such a large, progressive forum.
The monolith has been fractured
This idea has been echoing around for years — the days of maintaining, integrating with, and building around big monolithic data architectures are over, right?
Well, maybe not completely… but in many new and emerging use cases (especially where IoT is in the mix) we see this concept of cloud-based “microservices” prevailing. Without going into too much depth on the pros and cons of the monolith vs distributed microservices, there are certainly times where each of these approaches is more preferable than the other.
One of the great things about the Snowflake platform is that it’s really built with both of these cases in mind where you may choose to use it as a replacement for your data warehouse or as a replacement for your analytical/archive database.
IoT adoption is growing at a record pace
One point that resonated with me, personally, was made by Dr. Claudia Imhoff of Intelligent Solutions. In the figure below is a snapshot of her slide showing the number of connected devices, globally, as of 2013 (a little out of date, but illuminating nevertheless.)
It’s clear in the graph that IoT is not only responsible for the greatest number of connected devices at this point in time, but that it’s also the fastest growing category. While this doesn’t necessarily imply that IoT is responsible for the most amount of data generated, it’s still an interesting indicator — I expect to see many more use cases pop up where IoT devices are catered to as the primary source of data generated for decision-making.
Separating storage from compute is the future
This is something we’ve been told by cloud providers since the inception of the cloud, but why? What’s the real benefit of decoupling storage and compute?
Ultimately, the answer to this question comes down to flexibility — the flexibility to decide between cost-savings and enhanced performance as its needed. Thanks to the “pay as you go” model provided by cloud providers (Snowflake included), admins don’t need to spend nearly as much time doing capacity planning and then dealing with the fact that their static estimates were wrong given system load is often made up of peaks and valleys.
The image below from Snowflake’s Kent Graziano illustrates how planning for a static load leads to times where either SLAs are not met (demand exceeds capacity) or companies are stuck paying for systems that are too big (capacity exceeds demand.)
My first question when learning about systems that “scale with demand” is how quickly can the platform scale up or down? Our friends at Vibes, who have experienced some pretty dramatic swings during peak seasons, confirmed that the platform is able to respond very quickly after the scaling trigger is set off!
What makes us at Pandera Labs most excited about Snowflake is that it’s a database product that was built specifically for operating in a cloud environment where compute and storage are already inherently separated. Listed below are some features that we felt really stood out.
Data Security and Compliance
While this certainly isn’t the coolest feature we heard about, it is one that puts Snowflake right at the top of the list for us to offer as a potential solution to our customers. These days, a database that doesn’t support encryption out of the box is essentially worthless to large enterprises. For more detail on Snowflake’s security framework, check out their whitepaper here.
Flat File Ingestion with Snow Pipe
One feature we see as a massive potential timesaver is one they call Snowpipe; with this feature, a service points to a pre-defined S3 bucket and automates the ingestion of flat files into your Snowflake database. Below is a description of the service that we borrowed from Snowflake’s site.
I admit that when I first saw this feature I didn’t think too much of it since we already do this type of thing with AWS Lambda and Firehose; but at the end of the day, AWS has tons of products and we’re constantly switching between them as we develop new products and features — it would be great to have one less thing to manage manually.
Companies that currently do flat file ingestion at scale or work with IoT/telemetry data should be very excited about this one!
Handling Unstructured Data
Snowflake’s ability to handle unstructured data performantly is one that will probably only speak to half of our readers. Most traditional data warehouse guru’s tend to dismiss these features because storing nested fields in a single column of a database is simply not the right, but try convincing app developers to model out all of their data the same way you would in a warehouse.
We believe databases are powerful tools and each department (app development, data science, BI, etc.) should manage their data architectures in a way that best enables their work. This means that there will be different patterns of usage and you’ll almost certainly encounter a situation where you need to efficiently query semi-structured (JSON, Avro, XML) data out of your database.
Snowflake make this super easy — when semi-structured data is loaded into a column of type VARIANT the database engine automatically discovers what attributes exist, looks for repeated attributes, and then organizes and stores those repeated attributes separately. According to Snowflake, this enables much better compression and fast access to those data assets. Read more about Snowflake’s approach to unstructured data in this whitepaper.
Live Data Sharing
One final feature that really got us thinking is called Live Data Sharing. This feature works by creating a virtual database with fine-tuned security settings to grant external users read-only access to the assets you specify. Since Snowflake doesn’t actually copy over all of your data, the virtual database is spun up instantly and in a very cost-effective way.
We think this feature is going to be hugely beneficial to organizations where groups operate their own data marts and wish to share only certain partitions of those data marts with other groups.
It’s a great sign when people leave your conference with new ideas of how to run their businesses. When we left the Snowflake event we felt very excited about the current suite of features and even more optimistic about the future of the platform! Many thanks to the organizers and kudos again to the presenters for great discussions.
Highlights from Snowflake’s ‘Unite the Data Nation’ Conference was originally published in Pandera Labs on Medium, where people are continuing the conversation by highlighting and responding to this story.