How to recognize a computer scientist doing statistics and why you need to be careful

Introduction

One of the objectives of statistics is to explain the variability of a phenomenon of interest, through a mathematical expression, i.e. an equation called a model, made up of an objective (dependent) variable and explaining (explanatory, independent) variables. . If the target variable has no variability, then we do not have a variable but a constant. And we don’t need statistics. 

In statistics there are basic models, called inferential, and classification models. The former are used to do the latter, also to get an idea of the basic performance. The name of the basic models depend on the type of objective variable: linear regression for quantitative ones (e.g. sales), the logistic model for dichotomous ones (e.g. customer/non-customer, recurring purchase/occasional purchase, etc.) or categorical ( expense categories, etc.)

Example of statistical model:

ice cream sales = coefficient*day temperature + coefficient*number of tourists + error

Compared to the models you saw in high school, as in geometry or in physics, statistical models have an error term and “dirtier” coefficients, i.e. non-round numbers. 

Computer scientists vs statisticians

–  they tend to only do linear correlations, without significance. Unless we have a dataset with many rows where the significance can be seen by eye through the size of the correlation, inserting non-significant variables into a model creates an unreliable and absolutely unstable model, therefore useless for explaining phenomena, making predictions or classifications .

– they immediately dive into models that are used to make classification and do not start from basic models. This means immediately risking having rigid models (overfit), which will have apparently excellent performance during testing. But it also involves a dangerous and potentially costly automatism in terms of choice of variables, especially if you use solutions where the model is trained using resources from a third-party computer (cloud).

– they use the importance of the variables rather than classical or Bayesian significance. The first is a purely descriptive metric, the last two are inferential. This means creating many more models, because in the absence of a method for selecting the variables we are left with a bit of experimentation, risking what was anticipated in the previous point. That purely descriptive metric can make sense in the problem of classifying certain images, where the variables are the columns of an image, for example if full HD we have 1080 rows and 1920 columns; or in the field of natural language processing (NLP). Computer vision has less to do with statistics than other areas of machine learning, and requires less critical thinking on what you are doing/want to do, as it also has multiple automated steps. 

In general you will notice that the computer scientist approach compensates with computational brute force, a familiar expression in the password field, the theoretical gaps of statistics. In other words, they shoot a lot of bullets but few hit the mark. Statisticians, on the other hand, have less IT skills, with the risk of making them less autonomous, but knowledge of theory makes them more effective in certain practical activities. On the other hand, computer scientists who writes statistics code is more useful to large companies than anything else.

 

ice cream sales = coefficient*day temperature + coefficient*number of tourists + error

Compared to the models you saw in high school, as in geometry or in physics, statistical models have an error term and “dirtier” coefficients, i.e. non-round numbers. 

Computer scientists vs statisticians

–  they tend to only do linear correlations, without significance. Unless we have a dataset with many rows where the significance can be seen by eye through the size of the correlation, inserting non-significant variables into a model creates an unreliable and absolutely unstable model, therefore useless for explaining phenomena, making predictions or classifications .

– they immediately dive into models that are used to make classification and do not start from basic models. This means immediately risking having rigid models (overfit), which will have apparently excellent performance during testing. But it also involves a dangerous and potentially costly automatism in terms of choice of variables, especially if you use solutions where the model is trained using resources from a third-party computer (cloud).

– they use the importance of the variables rather than classical or Bayesian significance. The first is a purely descriptive metric, the last two are inferential. This means creating many more models, because in the absence of a method for selecting the variables we are left with a bit of experimentation, risking what was anticipated in the previous point. That purely descriptive metric can make sense in the problem of classifying certain images, where the variables are the columns of an image, for example if full HD we have 1080 rows and 1920 columns; or in the field of natural language processing (NLP). Computer vision has less to do with statistics than other areas of machine learning, and requires less critical thinking on what you are doing/want to do, as it also has multiple automated steps. 

In general you will notice that the computer scientist approach compensates with computational brute force, a familiar expression in the password field, the theoretical gaps of statistics. In other words, they shoot a lot of bullets but few hit the mark. Statisticians, on the other hand, have less IT skills, with the risk of making them less autonomous, but knowledge of theory makes them more effective in certain practical activities. On the other hand, computer scientists who writes statistics code is more useful to large companies than anything else.

 

Leave a Comment

Your email address will not be published. Required fields are marked *

Privacy Policy