When does data science become big data?

Big data is part of that inflated set of words from the previous decade.
There are various answers to this question, depending on the tool you use to do data analysis, and from the informality.

  1. When spreadsheets crash, that is, they suddenly close or are unusable because they are too slow. Despite the fact that both Excel and Google Spreadsheets state extensive limits, for sheets, in terms of rows and columns, these arrive much sooner.
  2. When questions (query) on databases (MySQL, PostgreSQL, etc.) take more than 6 minutes.
  3. When creating the simplest model of statistics, linear regression or the logistic model, takes more than 6 minutes.

In the first two points, I talked about supports being used more to make descriptive statistics with structured data (tables): aggregations, counts, sums, averages, etc.

For each of these points, there are different software and hardware solutions, as performance degradation, and/or slowdowns, come from at least one of those aspects. Hardly as an entrepreneur will you deal directly with these, more possible instead for managers.

  1. In the case of the Google spreadsheets, I took the largest file among the clients. It weighs about 2.6MB. To track the cost in resources, on a Windows PC, I pressed CTRL+ALT+Delete:
    1. It is observed that the main bottleneck comes from the CPU (processor), then RAM (volatile memory). So it can be solved by increasing those two resources.
  2. Again, buying a better performing CPU or more RAM, for the server hosting the database, solves the problem if you do not use a remote server (cloud). It is also called vertical scaling. This solution cannot always be applied, then data warehouses, which host databases, intervene. Example: Google BigQuery.
    1. it is possible to act by also optimizing the query, and therefore the code, through normalization for example, but there are rivers of ink only on this strategy.
  3. We have solutions on the code side or just changing the programming language
    1. code: now all CPUs have multiple processors, you can explicitly execute the code using all the CPU resources or even switching to the graphics card (GPU), especially for certain types of statistical models. Or the code can run on multiple machines via distributed computing. The analysis of radio signals for the search for extraterrestrials works like this.
    2. there are programming languages that solve specific tasks, in this case the processing of large amounts of data: Scala and Spark.

However, the problem mentioned mostly affects medium-sized companies or SMEs that have existed for at least five years.

If you need a chat to understand if you are about to cross the threshold of big data, with all the associated difficulties, we can make a free call where I will start to help you get back into the realm of manageable data.

Leave a Comment

Your email address will not be published. Required fields are marked *

Privacy Policy