What does it mean to have a data stack? To understand it we need to start from something a little more known.
By industrial chemical plant we mean a site that has a storage of one or more raw materials, a reactor, a storage of the finished product with an attached distribution system, leaving out a lot.
For a data stack we have a physical and digital infrastructure that includes data storage, possibly from various sources, a reactor that can mean computer-statistical processing, the delivery of the finished product, which can happen via a data dashboard or more generally an accessible connection of the statistical service provided.
To manage an activity of this kind you need different skills, like an entire chemistry department: the analyst, the engineer, technicians, workers. The same goes for a data plant, unless you outsource it, like Enterprise Statistics. This system can be translated with the English word “stack”, which can be translated as pile, stack, or if you prefer, LEGO attached.
Data Storage
Digital infrastructure
Unfortunately the most well-known is Excel. But it was not born as a database. Microsoft had Access for this function, but legend has it that cloud solutions make more money.
Another improper database comes from Google spreadsheets. Which can be easily connected to various data visualization services, such as Looker.
Airtable is also very abused. Among other things I find it more limited than Google sheet.
The most widely used databases, though not equally well known, come from
PostgreSQL
MySQL: the Hubspot CRM uses this solution in the back end, which is why I talk about CRM as a particular database for prospect and customer data
SQLite
MongoDB: usually used for unstructured data such as documents
etc.
Literally any web service you use uses at least one of those solutions in the back. We usually interact with the counter, like a GUI or website.
The google spreadsheets have advantages over having one of the 4 “real” databases: it does automatic retrievable saves (versioning), it takes care of creating replicas of the file (redundancy), the formulas for descriptively querying the data turn out to be similar to natural language.
Physical infrastructure (even if in the cloud)
Computer or virtual computer of no less than 1 CPU (understood as core), 512MB RAM, 5GB memory.
Data conveyor belt
Data must be connected, just as chemical syntheses need multiple ingredients (molecules).
Digital infrastructure
I think virtually all readers of this blog have heard Zapier. Or Make. But they can soon become limiting or too expensive. That’s why there are solutions you can host at home, which in the medium term save you money and acquire very important skills: n8n, activepieces, airbyte, automatisch, etc.
Physical infrastructure (even if in the cloud).
Varies greatly here depending on workload. Minimum minimum 1 CPU, 1GB RAM, 5GB. For airbytes absolutely not enough, so I would say 4 CPU, 8GB RAM, 20GB.
Data Reactor
Leaving out the data refining or cleaning part.
Digital infrastructure
if we are dealing with certain data explorations or a statistical model, we need working environments for programming languages. RStudio in the case of the R language; Spyder, vscode in the case of Python. There are online working environments called workbooks, such as Google Colab, Jupyter, etc.
For working environments with a graphical interface, thus more accessible, such as JASP or Orange, you need to install programs.
For other data exploration, which also serves as a delivery system, we have data dashboard solutions, or Business Intelligence, such as google Looker, powerBI, Tableau, Apache SuperSet, etc.
Physical infrastructure (although in the cloud)
more extreme and variable case than the previous one. Minimum 1 CPU, 1GB RAM, 10GB. To be safe: 4 CPUs, 16GB RAM, 150GB.
Knowledge distribution and delivery
can be via a data dashboard or more generally an accessible link (URL) of the statistical service delivered.
Digital infrastructure
- Reports, slides: Quarto, etc. Can also take place via GitHub, GitLab.
- Dynamic, special data dashboards with interactions: Shiny (R, Python), Streamlit (Python), etc.
- Data maids (APIs): Plumber, Django, Flask, etc. Used, for example, to get the prediction of a statistical model via a link.
Physical infrastructure (even if in the cloud)
Varies greatly depending on workload. Minimum 0.5 CPU, 256MB RAM, 1GB.
Obviously there are other components, for example regarding security.
In some cases you don’t necessarily need your own physical infrastructure, e.g. for google sheets, GitHub.
Clearly you don’t need a computer for each macro-category, but certainly services should not be put together. Creating a single block (called a monolith) can lead to unpleasant surprises.
Several solutions mentioned have a “frozen meal” version .
With this article, which undoubtedly has some complexity since it involves various disciplines, you may have discovered that you certainly have a data plant yourself, even if it is not complete. If you are interested in exploring points that you think could improve some business performance, let’s hear from you for a free initial call.