Data is the new oil. Like oil, data is valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity. so, must data be broken down, analysed for it to have value. Clive Humby.

We all feel our company’s data is precious. We gather it, safeguard from external threats, make investments to generate it faster, easier and (hopefully) cheaper. But as Clive Humby said in the quote above – data has no value if it is not refined. At Phronesis Path we’ve seen a lot of companies that started to analyze their data in order to learn from the past and analyze it’s financial standing against the forecasts. But that’s just a fraction of what data can do, if refined correctly – it will predict the future and act accordingly on its own.

“But how do we get there Phronesis?” you might ask. If data is the new oil, then we need a refinery. You could think that Data Warehouse is your refinery but in reality Data Warehouse is only the building itself and a bunch of unconnected machines – it will allow you to create and store your data but you need to figure out the production process. You could, in theory, do the whole process with just manpower, connecting the various machines and moving some of the subproducts manually or semi-manually.

That would very much resemble a team of engineers facing an empty Data Warehouse and preparing each process from scratch, of course they need to be experience in oil (data) processing so this team needs to be very experienced to achieve results. Still the work will be done mostly manually by coding each structure, creating best practices on the way, onboarding new engineers and evolving their methods.

But this is not how oil is refined. Refineries are mostly automated with flow control, equipment monitoring, quality monitoring in order to make the process as efficient as possible. There are multiple automation tools that you can use in order to build your perfect refinery. They are suitable for different scenarios.

Align your drilling strategy with your goals

Automation tools have different licensing models ranging from size-based calculations, hybrid models where you pay for developer seat and builds to seat-only models. Therefore you should consider ahead what kind of implementation you are looking for: do you want to invest heavy at start and then go into sustain mode? Do you intend to build near-real-time solutions? Or perhaps you want to build enterprise-size single-point-of-truth solutions for your business?

Size-based

Generally speaking, size-based models calculate the objects you have in your installation then convert this to tokens to calculate your bill at the end of the month. This can be very cheap at the start when you are just building your Data Warehouse but with bigger solutions they tend to scale poorly – in other words the cost is low at the start but the grows indefinitely. Also the tokens are usually bought in bulk with a higher fee if you “exceed” the expectations.

For some tools this is ridiculously complicated to calculate, with advanced equations and at least a few different object types that “consume” the tokens. There is no chance you will be able to predict this over the course of a few years.

Hybrid models

Hybrid models combine compute-driven and seat-based license. Usually the developer seat is cheaper but there is a minimum “platform fee” with a set number of seats you need to buy, there is also an additional fee for “builds”. Buids, similar to size-based tokens are bought in bulk and have an overconsumption fee. A build is usually one refresh of an object (table, view) in one job run. So for example if you have 40 tables in your model and refresh it twice a day then after 30 days you would use 2,400 tokens.

This is a bit easier to predict but especially for Data Vault 2.0 installations the “build” cost grows rapidly. In a large installations we observed the build cost to essentially double the developer-seat cost. The development licenses is lower cost for a reason however remember that Data Vault 2.0 multiplies the number of objects to build, some say even 4-5 times! So the hybrid licensing model for Data Vault 2.0 will rapidly grow if builds are calculated.

Developer seat-only model

This is the other end of the spectrum – you pay only for the developers, agian with initial platform fee that includes 3-5 developer licenses in it. You can imagine this is pretty predictable as you know exactly how much you will pay in a year based on the size of your engineering team – if your team shrinks, you pay less over the next year if it grows you pay more.

This is the model Coalesce uses, for us this is the most predictable one but the developer seat is also more expensive than in hybrid model (which hides some of the cost in build packages). If you plan for rapid development the initial cost may be higher, but then next year when you slow down – you can also reduce the license cost.

How these models fit different scenarios?

Depending on your goals and the kind of the infrastructure you are building you should consider some factors:

  • Do you intend to build Data Vault 2.0 model? This creates much more objects than in other modelling techniques. As a consequence seat-only model are more favorable here and significantly so.
  • Do you intend to refresh your data multiple times a day? If you need to refresh every hour then you will quickly run out of build-tokens, then the better option would be size-based license or developer seats for predictability
  • Do you expect rapid expansion in the first year or two but then want to dial down to long term sustain mode with little changes? Generally speaking the only way here to lower the license cost would be lower number of developer seats, so you should look into hybrid or seat-based models.
  • Is scaling up and down expected? If you see “hot” periods of investment that you want to follow with long term “sustain” mode then you need to remember that:
    • size-based model does not scale down unless you delete some models,
    • the hybrid model scales down only in regards to developer license so the cost is not as responsive due to “build-fee”
    • seat-based model scales down easiest but you need to take into account there is a rock-bottom platform fee that you cannot go under (same with hybrid model).

Refining data with Coalesce

Coalesce (https://coalesce.io/) is a unique tool on the automation market. It was built specifically for Snowflake® (currently it also support Databricks and Microsoft Fabric), which means that unlike “universal” tools it supports Snowflake-specific function (e.g. zero-copy cloning, integration with Snowflake Cortex®). This makes it a perfect choice for SnowflakeData Cloud installations to make sure you can utilize all the specialized functions.

Design your oil flow

Coalesce is gui-driven, meaning that most of the code generation is done via “nodes” which are pre-built packages of code. These significantly reduce the time of data pipeline building and also lower the skill bar for engineers. We’ve seen successful implementations driven by Business Intelligence consultants without prior knowledge of Snowflake and even Business users that have basic knowledge of data warehouses. The lineage is visualized meaning that you can clearly see where the data originates from and where it is heading.

The packages open up another unique feature of coalesce – the marketplace. You can expand your template library for free with community-driven templates which give you access to advanced features like Streams or Cortex functions. It also supports Apache Iceberg tables read and write!

One package we hold especially dear is Data Vault by Scalefree.

Data Vault 2.0

Scalefree is a company founded by the fathers of Data Vault 2.0 specification. This advanced way of designing Data Warehouse with the usage of Raw Data Vault, Business Data Vault and objects like Satellites, Links, Hubs is the crown jewel of Data Warehouse modelling. However to built it correctly there is a need to build a very sophisticated code. With the packages however this is all simplified to basically dragging the node into Coalesce workspace and selecting some option from the UI. The Data Vault is built without writing a single line of code… just what your refinery needs.

We can tell you that this is one of the few tools on the market that can do Data Vault 2.0 via graphical user interface and the only one which will let you easily prepare and link it directly in one lineage to Snowflake AI tools.

Keep a sharp eye on the purity of your oil

What would be a refinery point if it produces low-quality products? This is why Coalesce also let’s you build tests along with your pipeline. Each step can have simple tests added from GUI but also let’s you write custom tests on your own. This is very powerful as it basically allows for any kind of testing, run after or before the pipeline step is executed. You can also define the behavior in case of failure.

This is a crucial part of Data Warehouse automation as you never can be fully sure that your source data retains it’s integrity over time.

Make your refinery AI-driven

We mentioned the pre-built packages for nodes and since those are built specifically for Snowflake, they have the ability to connect to Cortex. Cortex gives you access to Large Language Models in Snowflake and Coalesce gives you ability to connect your data to it directly in the GUI. This opens up a whole new set of capabilities – you do not need to bring any other Machine Learning or LLM tool outside of Snowflake ecosystem.

This simplifies administration and cost-optimization significantly as you do not need to worry about multiple components orchestration and all the connectivity nonsense you usually encounter when you have multiple vendors.

Summary

Refining your data is crucial to take the most out of it, speed of delivering data pipelines is crucial for success so using automation tools is necessary. If you are looking to also utilize advanced AI functions, you should look for a tool that is tailored to your Data Warehouse of choice. Once your refinery is built you will have endless possibilities to use it.

Phronesis Path wishes you many, smooth drills!

The SaaS is Not Enough

Contact Us

Ready to take the first step towards unlocking opportunities, realizing goals, and embracing innovation? We're here and eager to connect.

Contact Phronesis Path