In this article you will understand the importance of exploratory statistics.
In Formula One they do tests, before the race. The drivers study the track: how to enter the curves, they identify possible pitfalls, difficulties, etc.
When a location is thought to host oil, the site is explored using geophysical and nongeophysical techniques, and then field surveys are done using probes and nonprobes.
Max Calderan before crossing the Rub’al-Khali (the second largest sandy desert on Earth) as his enterprise, read about the existence of about 40 species of plants and 5 species of animals within that ecosystem. Animals in particular define the risk profile of such a project: bites can carry pathogens into the blood, while plants can replace some active ingredients of medicines or work as emergency hydration.
In statistics, one can be either a pilot of KPIs (important business metrics), an explorer of statistical phenomena (“where no one had ever gone” with statistics…) or an archaeologist of bits to reconstruct the company’s or organization’s gold coin jar. But to do all that, of course, you go through exploration unless you want to go crashing, you want to get lost, you want to destroy artifacts.
Types of explorations
First of all, explorations are part of descriptive statistics: describing by means of sums, counts, etc., representing by means of graphs, summarizing data by means of indicators, which of course are constructed by means of sums, counts, and it just so happens that some of these are found in the STATiCalmo logo, in the introduction of the podcast I elaborated on. Descriptive statistics uses, if you will, non-statistical tools, just as oil exploration involves other disciplines such as geology, geophysics, seismology, etc. They range from computer science to graphics and user interfaces (UIs) for data representations.
Descriptive statistics are done before inferential statistics. In very few cases, strategic decisions can be made with descriptive statistics alone.
The type of exploration also depends on the environment: quantitative variables such as turnover must be explored differently from qualitative variables such as the customer / potential customer (or converted / non-converted) label.
In Enterprise Statistics videos you often see a recurring pattern of explorations, which can be automated if you wish, which we can summarize as follows:
- Distributions or frequency tables: not knowing this feature of the terrain almost certainly makes everything that follows useless, and calculating about the things that follow, without this control, can also become dangerous. Without this control, one does not know how to dress, for example, in the desert of Oman or in the Arctic Circle, or even whether it makes sense to leave.
- Summary of data, taken by column, to see missing values or to understand, without going through graphs, the distribution of variables (columnar).
- Relationships, correlations. If target and explanatory variable belong to quantitative ones (e.g..turnover and PMI index), then one can calculate Pearson’s linear correlation and see the associated statistical significance (part of statistical inference). Linear correlation, shows only one type of relationship between two variables. But there are many others, and mathematics tells us which ones (e.g., trigonometry for temporal data). If one variable belongs to the quantitative and the other to the qualitative (e.g., turnover per weather event), then it is convenient to use another type of correlation. If both belong to the qualitative (e.g., conversion and customer nationality), one cannot use correlations but associations. I generally show the normal, or dirty, and partial correlations , cleaner. However, in some cases this approach can limit. Metaphorically speaking relationships and correlations are part of thebehavior of the environment, which we can associate with land weather, flora and fauna.
- There are particular correlations on the variable itself. For example, a cyclically repeating phenomenon, a temporal recurring pattern. Or a cyclicality between a phenomenon and a behavior, for example in economics or physiology. In other words, certain phenomena depend on themselves (hence the “auto” prefix of autocorrelation) or variables have effects on a target variable at time t+1, t+2, etc. (with t = hour, day, month, etc.) rather than immediately, hence in a delayed manner. Example of a cyclical phenomenon in an area: the monsoons in the Maldives.
- Outlier values (outliers), can also be seen from correlation graphs, within a table of graphs showing correlations, usually nonlinear. For example, in the desert this event can be associated with a sandstorm. In Mykonos to an earthquake, in Tenerife the eruption of Teide. Same for Bali. As you guess, outliers have a low frequency but a very high damage potential. That is why you can see them even with the first point and skipping it can become a disaster.
If you are interested in starting explorations because you are interested in winning the company championship, discovering your company’s treasure hidden under meters of sand or in a thick blanket of greenery, we can have a pre-exploration meeting (briefing), about 30 minutes long, to see if we have compatibility as fellow travelers.