Data stack with Apache

There are various data stack solutions, depending on how much control you want over the underlying technologies. But if you want Apache you have very specific needs.

 

For example, using Excel Online + Microsoft SQL server and PowerBI as a data stack, you are using commercial technologies. You may want to use open source technologies, which have the disadvantage of taking more time even if they do not have a cost. Among these we have for example data dashboards, part of a stack, such as Shiny but there are more complex solutions, such as Apache Superset. It happened to read this request:

 

We are looking for an experienced developer with skills in Apache Superset, PostgreSQL and Apache Airflow to design, build and maintain data pipelines, dashboards and reporting solutions.

 

Responsibility:

Develop and manage Superset dashboards for data visualization and insights.

Design, optimize complex queries and schemas in PostgreSQL.

Create, schedule, and monitor workflows using Apache Airflow.

[…]

 

why doesn’t this company use a data visualization and pipeline solution from the main cloud providers? Did they listen to my podcast and decide to do everything in-house? I highly doubt it. Maybe it’s a startup that has the medium-term goal of containing the costs of third-party solutions, such as the cloud, and has decided to build its own infrastructure by combining open source and specialized profiles. Do I recommend it for small and medium-sized businesses? In the vast majority of cases, no. Generally speaking, Apache makes sense for medium-sized businesses and above.

It also has a significant “at rest” cost in terms of computation on a single server, a cost that makes sense for medium-sized companies and above.

 

Leave a Comment

Your email address will not be published. Required fields are marked *

Privacy Policy