How big is Big Data?*

Versión en español: http://wp.me/paT7tX-3k

According to an article in Methodology, big data is related to computational social science and has three characteristics. The first is that it involves a large amount of data that conventional databases cannot handle. The second is that it is increasingly important to develop specialized techniques in the handling of this data. The last feature is agent-based simulation, something very popular in social sciences.

Agent-based simulation is an innovative way to explore social phenomena. It is a research method that allows us to easily deal with the complexity, emergence and non-linearity of social phenomena. The creation of these techniques, especially re-sampling techniques and cross-validation techniques, are useful because they facilitate data processing for the researcher, who evaluates the data in a statistical analysis and ensures that it is independent of the partition between training data and test data.

To clarify the concept, the author asks what the size of big data is. The answers can be subjective. For example, Jhon Tukey defined big data as something that does not fit in a device. This is, however, a very subjective answer, because in the development of technology we have had many types of storage devices, from a magnetic tape in 1955 with a capacity of 256 gigabytes to a 2 Terabyte USB.

Although the measurement of big data is not exact, we can count the size of the data that is entered. There is an example in the article: “traffic loop tracking data, also collected by Statistics Netherlands, produces 80 million records per day. One year of data would be about 3 TB and it would only fit on a large hard disk.” This gives us an idea of ​​the size of big data.

Social sciences use big data because society is leaving a larger digital trail that is later analyzed to make inferences about the behaviors of people. The digital trail in economic data can be Facebook or Twitter messages, discussion lists on the Internet, mobile phones, location, calls, etc. All this data is collected to be analyzed.

References

Donoho, D. (2017). 50 Years of Data Science. Journal of Computational and Graphical Statistics, 26(4), 745-766. https://doi.org/10.1080/10618600.2017.1384734

Hox, J. J. (2017). Computational Social Science Methodology, Anyone? Methodology, 13(Supplement 1), 3-12. https://doi.org/10.1027/1614-2241/a000127