Confusing words coming from the world of data science

Unfortunately, some of the words that follow have a meaning not only in statistics (or data science), but also in database administration (now data engineering), and in the galaxy of linguistic statistical models (LLM or AI). So you experience that linguistic plague, in my opinion, where a word takes on multiple meanings depending on the context.

 

Inference in statistics: generalizing the characteristics of a subset of the population, or sample, to the population. That sample has certain characteristics, and if you don’t control for them, the generalization does damage. Example of inference not statistical but personal: “I was robbed in city X, then all citizens of X are crooks.”

Part of the confusion of this word comes from its Latin origin. It means bring in; bring about, conclude, too bad that a statistician should know that giving conclusions by inference alone turns out to be unscientific. We could use the words “inductive statistics,” but in my opinion it does not improve accessibility much because we would have to give notions of analytic philosophy (e.g., Russell).

 

Inference in linguistic statistical models (GPT, Mistral, LLama, etc.): the process of running real-time data through an artificial intelligence model, trained to make a prediction or solve a task. Example: after training a LLM with PDFs, I ask it a question expecting as an answer information contained in the documents administered as training.

In nonlinguistic statistics we simply call this “inference” the model testing phase, or validation, in some cases.

 

Data modeling: has to do with the design of databases, defining how data are linked and stored so that they can be retrieved efficiently. The need for this approach is seen from medium-sized companies and up. In statistics, modeling data, understood as a variable of interest, brings us to the next point.

 

Statistical model: it is not a person, with statistical studies, with shared, measurable beauty because one can observe the golden section on the face, but a mathematical expression, or equation, that attempts to explain a variable of interest, y, with explanatory variables x. In some family of models it also has an error term that has certain properties.

 

Correlation and connection. In common, or natural, language they are used as synonyms. In statistics, the former measures the relationship between two quantitative variables (e.g., customer satisfaction at the first purchase and number of purchases), the second among qualitative variables, thus involving event frequencies rather than event value (e.g., user visiting a certain site page and purchase presence).

 

The p-value. In the frequentist inference, it indicates the significance value. Surely you have already had blood tests. When you find asterisks (one to three) to the right of the report, you should be concerned, because it means that your values have a significant difference from a certain value in the population. And abnormal values can result effects of unpleasant symptoms and therefore a disease to be diagnosed. With business data, in general, one rejoices more easily when one sees asterisks.

 

Let’s find together the magic word in statistics-lingo that can turn your business around through statistical consulting. Do you want to make a virtual visit of about thirty minutes?

Leave a Comment

Your email address will not be published. Required fields are marked *

Privacy Policy