Are millions of rows enough to have useful statistical analysis for the company?

One of the most widespread perceptions among owners, entrepreneurs, managers has to do with this situation: I have a lot of company data, understood as the number of rows in spreadsheets (even on multiple files), surely a statistician will know how to valorize this data. Attachment to this expectation can easily lead to pain because in some cases quantity does not equal quality.

Here is a terribly similar case

We are looking for a Machine Learning expert to develop a cutting-edge product recommendation engine for our email marketing campaigns.

Here’s the challenge:

We have a huge database of 4 million email subscribers. For each subscriber we have a large amount of data, including:

– Subscription date: When they subscribed to our email list.

– Engagement data: Did they open or click on our emails? What content did they interact with?

– Purchase data: What products did they purchase from us?

We have been under the influence, coming from various sectors, where more = better. In this case understood as quantity of data. Indeed, in machine learning it is said that the quantity of rows makes the quality of the model, and sophisticated models are beaten by simpler models trained with much more data, if for example we want to explain a phenomenon of business interest.

Regarding the case above, the 3 columns they mention are part of the so-called “internal” data. Data usually easily obtainable but with limited usefulness. For example, when we download invoices from the IRS website, we have several useless columns, e.g. Invoice Type (the one specifying “between private individuals”), Document Type, Invoice Number/Document, etc. The most useful column comes from Customer VAT Number, because it is a unique piece of data, whereas the customer name is not. In fact from the VAT number we can do enrichment, or create other columns, such as finding the SIC code , region, etc. This gives us information about our ideal customer. In other words, informative or quality rows and columns, to use a more overused term, are enough.

 

Various automations exist to find new columns of data from existing columns, which may also involve third-party services. However, there are very meticulous salespeople or salespeople who write notes for each potential customer or client, which sometimes turn into gold.

Some particularly rich internal data sources, e.g., the database specific to potential clients or customers (CRM), in some cases allow you to create new columns from existing columns or enrich columns automatically (even paying for them).

Do you have thousands and thousands of rows of data and don’t feel confident that you are using it enough? Let’s discuss this in a free call.

Privacy Policy