Six months past a record-breaking IPO and a recent quarter reporting higher than expected revenues, Wall Street initially took a breather as Snowflake projected growth settling to a mere 84% over the next quarter. Such are the overhyped expectations for Snowflake, whose name has practically become a verb. Although not a scientific observation, we’re getting more frequent pitches like this from third parties positioning Snowflake as a default destination for running data science and analytic workloads in the cloud.
Initially known as a cloud data warehouse, Snowflake now characterizes itself as delivering a “Data Cloud” that provides a variety of services and experiences to different constituencies, from business analysts to data engineers and data scientists, and uses the pooled storage of the cloud to consolidate persistent enterprise data silos.
Snowflake is best known for battling David vs. Goliath with cloud providers, literally on their turf. A classic case of coopetition, Snowflake depends on AWS, Azure, and Google Cloud as it competes head-on with them. It’s part of several larger narratives: cloud vendor independence and best of breed.
While Snowflake runs on all major public clouds, cloud provider data warehousing and analytics platforms have traditionally been confined to their own clouds. The flip side is that, while Snowflake has largely stayed in its lane delivering a data platform for analytics, cloud providers tout their broader portfolios extending from streaming to managing data pipelines, data transformation/ELT, cataloging, AI and machine learning services, and self-service visualization.
Indeed, we believe that a major theme for cloud providers going forward is not about expanding their portfolios per se, but knitting them together. Those are still works in progress. By contrast, from the get-go, Snowflake largely relies on best-of-breed partners like Alation, Collibra, Fivetran, Informatica, H2O.ai, Qlik, Tableau and others to deliver these adjacent services, and so third-party integration has been always a priority.
A number of the innovations that Snowflake helped pioneer have now become checkbox items for rival cloud services. Today, not all of them separate data from compute, but many do, and the same goes with capabilities such as multimodel support, access to data in cloud object storage, and elastic compute. Ironically, while AWS long promoted elastic computing, when it came to elastic data warehousing, Snowflake beat AWS at its own game.
The competitive lines are blurring. For instance, take multicloud, or cloud vendor independence. As cloud providers build software-defined hybrid cloud strategies, that could plant their services behind enemy lines. For instance, through its Anthos Kubernetes server, Google is now previewing support of BigQuery Omni on AWS. We wouldn’t be surprised to see Microsoft pull the same feat with its Arc hybrid cloud platform, but at present, it only supports transaction data services, not Azure Synapse Analytics. Our take? Microsoft just added Azure Machine Learning to Arc, so when it comes to analytics, never say never.
STAYING IN ITS LANE, BUT THE LANE IS GETTING WIDER
Unlike services such as Azure Synapse Analytics, SAP Data Warehouse Cloud, and we expect eventually, Oracle Autonomous Data Warehouse, Snowflake is staying in its lane. It is not building an end-to-end data warehouse that covers every step of the data lifecycle, like Microsoft or SAP; it is not incorporating its own tooling for building data transformation pipelines to cataloging data, building and running ML models, to self-service data visualization. Instead, that will be the role for Snowflake partners; Snowflake’s job is building the host environment where those services can run in-database.
While Snowflake is sticking to its lane, that lane is getting wider in several ways.
Data sharing one pillar of that, and it’s a strategy that’s quite doable when data is in a cloud storage pool, rather than directly attached to local clusters. Snowflake’s guiding notion is that, when you have a critical mass of crowdsourced data, it will become a de facto data destination. Maybe some of you recall Data Sharehouse? Snowflake no longer uses that term, but it planted the seed for what is now the Snowflake Data Marketplace that, over the past year, has tripled to over 100 sources of third-party data sets. Snowflake is managing this with a light touch in that it is not curating the sets per se, and for now, has no plans to monetize it. Instead, it’s about building community.
The other pillar is making the platform more extensible. With its origins as a data warehouse, SQL was the lingua franca of Snowflake. With Snowpark, the developer base grows wider with APIs that allow developers working in the language of choice to develop data pipelines, machine learning models, or other programmatic approaches to analytics. Snowflake is clearly not the first to get there; the ability to use languages like Java, Scala, or Python has become a checkbox item for cloud data warehouses.
Beyond multi-language support, Snowpark has another important element. It’s part of a broader trend to collapse data pipelines and ML modeling to run in-database, eliminating the cost and overhead of moving data; the need for middle tier appservers; and therefore, improving performance and manageability. Cloud-native architectures enable this, thanks to the economics of cloud storage; the ability to put all compute on their own instances; and the ability to turn on or turn off compute. Snowflake is also not alone here in embracing in-database running; for instance, many of its cloud rivals now support in-database running of ML models and ELT data pipelines.
Snowpark pushes down processing into the database, but it takes a different approach from other cloud services that stirve to be end-to-end. For instance, compared to Azure Synapse Analytics, which incorporates data pipelining capabilities from Azure Data Factory and ML modeling capabilities from Azure Machine Learning, Snowflake’s approach is to let third parties perform the work. In this case, you would use a partner tool like Fivetran to develop the pipeline, and tools from partners like Dataiku, DataRobot, or H2O.ai to develop, train, and manage the lifecycle of ML models. After developing in those partner tools, through the Snowpark API, you would run them directly inside Snowflake.
We do have one thing on our wish list for Snowpark. As compute for workloads like data pipelines or ML model training and inferencing can far more highly variable, we would like to see Snowflake introduce a serverless option for Snowpark execution.
DOUBLING DOWN ON GOVERNANCE AND COST CONTROL
Snowflake’s trajectory is redolent of the early days of SQL Server. Both were initially embraced at department level before gaining traction at the enterprise. With more history behind it, SQL Server has built the features to make it enterprise-grade; Snowflake is on the way to getting there in areas such as data governance and unified account management.
Let’s start with account management. It’s all too easy to ramp up compute resources, forget to shut them off, and then get sticker shock at the end of the month. Is it the fault of customers who fail to adequately monitor their consumption, or Snowflake for not providing adequate controls?
Snowflake has upped its game with monitoring and account usage trending tools that can look across all active accounts within an enterprise. It offers capabilities such as enforcing hard limits to consumption and instituting automatic suspend policies for idle workloads. But resource consumption characteristics of different workloads vary. Snowflake does provide the capability to automatically shut off workloads such as ELT that have hard stops. But for now, there is no way for prioritizing workloads or workload types. Of course, that doesn’t let customers off the hook. They must make the hard choices whether running a specific queries or data pipeline jobs command greater priority. They can look at the Snowflake dashboard of all jobs or accounts to make those decisions.
So, we have a couple things on our wish list here. We would like to see Snowflake step up policy-driven features for ramping specific categories of workloads up or down, with the customers having the ability to make the categorizations. There could even be a role of machine learning that could understand the demand of an organization and provide guided alerts or assists when it comes to prioritizing or making exceptions.
Snowflake is in the early days of building data governance – but then again, so are its rivals in the cloud. While each cloud provider offers coarse-grained identity and access management, most currently lack fine grained controls that can dwell down to table, column, or row level, and typically leave that to third party tools. And at this point, only two of Snowflake’s rivals have actually taken the plunge into data governance: Microsoft, which only announced the preview of Azure Purview last fall, and Cloudera, which is further along with SDX.
For data and privacy protection, Snowflake, like most (not all) its cloud rivals, encrypts data from start to finish, and it has recently added dynamic data masking. Snowflake is also offering a private preview of row-based access policies. In some cases, partners also have a role; for instance, rather than reinvent the wheel, it uses external agents from Protegrity for tokenizing data.
Last summer, Snowflake acquired Cryptonumerics, a startup that provides technology for managing data with privacy or residency issues – the acquired company has technology for mitigating or preventing anonymized data from being de-anonymized. While Snowflake has not disclosed what or how they will incorporate the capabilities from the acquisition, one thing is clear: Snowflake needs to have a capability that can autodetect PII data. It could do it itself or work with third parties.
For broader data governance, Snowflake is now laying the foundations. It starts with tagging data objects from which governance attributes can then be associated; it’s a mechanism also used by Azure Purview and Cloudera SDX. The difference is that Azure and Cloudera are relying on an open source technology, Apache Atlas, for tagging, whereas Snowflake is going its own way with a proprietary system.
It’s way too early to talk about interoperability for data governance as the few unified tools that exist are still filling out their functionality. And, for instance, when it comes to data lineage, there are so many versions of the truth proliferating out there — that’s just one example of the hurdles ahead for unified governance. Nonetheless, it’s still time to be looking forward, and by going its own way on tagging, Snowflake risks isolating itself into its own island of governance.
Because Snowflake initially gained traction at department level, it was likened to a data mart. Admittedly, Snowflake is now closing more million dollar enterprise deals, with a dozen exceeding the $5 million mark over the past year. This is no longer your father’s cloud data mart, or no longer your father’s cloud data mart company.
Some large million-dollar Snowflake customers may be building their own consolidated cloud data warehouses, while others may still have lots of separate departmental accounts. In most organizations, departmental systems persist, and in the cloud, those departmental systems – for lack of a better term, we’ll still call them data marts – are getting a lot bigger and broader.
Let’s take that traditional single domain data mart. On-premises, it was limited by cost of hardware and the resources of IT staff to manage it; the data sets therefore tended to be modest sized. Enter the cloud, and in a managed service like Snowflake, the operational complexity and capital budgeting barriers go away. The storage is cheap, and by providing a “data cloud,” Snowflake is opening the path to extending the scope of that traditional departmental data mart. Instead of gigabytes of relational data, you might be dealing with terabytes or conceivably petabytes of heterogeneous data. That modest departmental data mart has practically become a data lake. Size no longer matters.
That’s where Snowflake – and admittedly, its cloud rivals – are striving to fill the gaps. So, what will differentiate Snowflake as cloud analytic stores extend the bounds of the traditional data mart or data warehouse? While many capabilities, such as support of heterogeneous (multimodel) data; access to cloud storage; and in-database processing of ELT and machine learning are becoming increasingly commonplace, the approaches are surprisingly diverse. Some are bridging the gap with “Data Lakehouses” that place, for instance, SQL and Spark processing under the same hood. As noted earlier, others are building end-to-end experiences, covering much if not all of the data lifecycle.
Snowflake is focusing on being an endpoint for data, and emphasizing third-party ecosystem in an updating of the classic best of breed strategy. It is not trying to be all things to all people. It will continue building hooks to enable third party tools to work with or inside the Snowflake platform. But we are waiting for the next shoe to drop, which is to use Snowpark to integrate real-time streaming into the platform. As for its aspirations to become a destination for data, over the past year, more than 4000 customers have taken advantage of the Snowflake Data Marketplace. But if Snowflake really wants to get serious about being the “Data Cloud,” it should pull the trigger and also add a commercial, curated tier.