Skip to main content

Post #2: History of Big Data

Big Data has been described by some Data Management pundits (with a bit of a snicker) as “huge, overwhelming, and uncontrollable amounts of information.” In 1663, John Graunt dealt with “overwhelming amounts of information” as well, while he studied the bubonic plague, which was currently ravaging Europe. Graunt used statistics and is credited with being the first person to use statistical data analysis. In the early 1800's, the field of statistics expanded to include collecting and analysing data.
The evolution of Big Data includes a number of preliminary steps for its foundation, and while looking back to 1663 isn’t necessary for the growth of data volumes today, the point remains that “Big Data” is a relative term depending on who is discussing it. Big Data to Amazon or Google is very different than Big Data to a medium-sized insurance organisation, but no less “Big” in the minds of those contending with it.
Such foundational steps to the modern conception of Big Data involve the development of computers, smart phones, the internet, and sensory (Internet of Things) equipment to provide data. Credit cards also played a role, by providing increasingly large amounts of data, and certainly social media changed the nature of data volumes in novel and still developing ways. The evolution of modern technology is interwoven with the evolution of Big Data.
The Foundations of Big Data
Data became a problem for the U.S. Census Bureau in 1880. They estimated it would take eight years to handle and process the data collected during the 1880 census, and predicted the data from the 1890 census would take more than 10 years to process. Fortunately, in 1881, a young man working for the bureau, named Herman Hollerith, created the Hollerith Tabulating Machine. His invention was based on the punch cards designed for controlling the patterns woven by mechanical looms. His tabulating machine reduced ten years of labour into three months of labour.
In 1927, Fritz Pfleumer, an Austrian-German engineer, developed a means of storing information magnetically on tape. Pfleumer had devised a method for adhering metal stripes to cigarette papers (to keep a smokers’ lips from being stained by the rolling papers available at the time), and decided he could use this technique to create a magnetic strip, which could then be used to replace wire recording technology. After experiments with a variety of materials, he settled on a very thin paper, striped with iron oxide powder and coated with lacquer, for his patent in 1928.
During World War II (more specifically 1943), the British, desperate to crack Nazi codes, invented a machine that scanned for patterns in messages intercepted from the Germans. The machine was called Colossus, and scanned 5.000 characters a second, reducing the workload from weeks to merely hours. Colossus was the first data processor. Two years later, in 1945, John Von Neumann published a paper on the Electronic Discrete Variable Automatic Computer (EDVAC), the first “documented” discussion on program storage, and laid the foundation of computer architecture today.
It is said these combined events prompted the “formal” creation of the United States’ NSA (National Security Agency), by President Truman, in 1952. Staff at the NSA were assigned the task of decrypting messages intercepted during the Cold War. Computers of this time had evolved to the point where they could collect and process data, operating independently and automatically.
The Internet Effect and Personal Computers
ARPANET began on Oct 29, 1969, when a message was sent from UCLA’s host computer to Stanford’s host computer. It received funding from the Advanced Research Projects Agency (ARPA), a subdivision of the Department of Defense. Generally speaking, the public was not aware of ARPANET. In 1973, it connected with a transatlantic satellite, linking it to the Norwegian Seismic Array. However, by 1989, the infrastructure of ARPANET had started to age. The system wasn’t as efficient or as fast as newer networks. Organizations using ARPANET started moving to other networks, such as NSFNET, to improve basic efficiency and speed. In 1990, the ARPANET project was shut down, due to a combination of age and obsolescence. The creation ARPANET led directly to the Internet.
In 1965, the U.S. government built the first data center, with the intention of storing millions of fingerprint sets and tax returns. Each record was transferred to magnetic tapes, and were to be taken and stored in a central location. Conspiracy theorists expressed their fears, and the project was closed. However, in spite of its closure, this initiative is generally considered the first effort at large scale data storage.
Personal computers came on the market in 1977, when microcomputers were introduced, and became a major stepping stone in the evolution of the internet, and subsequently, Big Data. A personal computer could be used by a single individual, as opposed to mainframe computers, which required an operating staff, or some kind of time-sharing system, with one large processor being shared by multiple individuals. After the introduction of the microprocessor, prices for personal computers lowered significantly, and became described as “an affordable consumer good.” Many of the early personal computers were sold as electronic kits, designed to be built by hobbyists and technicians. Eventually, personal computers would provide people worldwide with access to the internet.
In 1989, a British Computer Scientist named Tim Berners-Lee came up with the concept of the World Wide Web. The Web is a place/information-space where web resources are recognised using URLs, interlinked by hypertext links, and is accessible via the Internet. His system also allowed for the transfer of audio, video, and pictures. His goal was to share information on the Internet using a hypertext system. By the fall of 1990, Tim Berners-Lee, working for CERN, had written three basic IT commands that are the foundation of today’s web:
  • HTML: HyperText Markup Language. The formatting language of the web.
  • URL: Uniform Resource Locator. A unique “address” used to identify each resource on the web. It is also called a URI (Uniform Resource Identifier).
  • HTTP: Hypertext Transfer Protocol. Used for retrieving linked resources from all across the web.

In 1993, CERN announced the World Wide Web would be free for everyone to develop and use. The free part was a key factor in the effect the Web would have on the people of the world. (It’s the companies providing the “internet connection” that charge us a fee).
The Internet of Things (IoT)
The concept of Internet of Things was assigned its official name in 1999. By 2013, the IoT had evolved to include multiple technologies, using the Internet, wireless communications, micro-electromechanical systems, and embedded systems. All of these transmit data about the person using them. Automation (including buildings and homes), GPS, and others, support the IoT.
The Internet of Things, unfortunately, can make computer systems vulnerable to hacking. In October of 2016, hackers crippled major portions of the Internet using the IoT. The early response has been to develop Machine Learning and Artificial Intelligence focused on security issues.
Computing Power and Internet Growth
There was an incredible amount of internet growth in the 1990s, and personal computers became steadily more powerful and more flexible. Internet growth was based both on Tim Berners-Lee’s efforts, Cern’s free access, and access to individual personal computers.
In 2005, Big Data, which had been used without a name, was labeled by Roger Mougalas. He was referring to a large set of data that, at the time, was almost impossible to manage and process using the traditional business intelligence tools available. Additionally, Hadoop, which could handle Big Data, was created in 2005. Hadoop was based on an open-sourced software framework called Nutch, and was merged with Google’s MapReduce. Hadoop is an Open Source software framework, and can process structured and unstructured data, from almost all digital sources. Because of this flexibility, Hadoop (and its sibling frameworks) can process Big Data.

Comments

Popular posts from this blog

Post #3: Growth of Big Data

There was an incredible amount of internet growth in the 1990s, and personal computers became steadily more powerful and more flexible. Internet growth was based both on Tim Berners-Lee’s efforts, CERN’s free access, and access to individual personal computers. In 2005, Big Data, which had been used without a name, was labelled by Roger Mougalas. He was referring to a large set of data that, at the time, was almost impossible to manage and process using the traditional business intelligence tools available. Additionally, Hadoop, which could handle Big Data, was created in 2005. Hadoop was based on an open-sourced software framework called Nutch, and was merged with Google’s MapReduce. Hadoop is an Open Source software framework, and can process structured and unstructured data, from almost all digital sources. Because of this flexibility, Hadoop (and its sibling frameworks) can process Big Data. Big Data is revolutionising entire industries and changing hum...

FutureLearn Week 2: Post 3 of 4

Two of the biggest challenges of big data is Analysing and Visualising the data. Firstly with analysing the data, the size of big data files can sometimes be substantial, there are many things that must be considered before downloading the data, for example the file size, how long the data file will take to download, will all of it be necessary or will part of the file suffice and is there enough storage space within the system itself. Visualisation is way to represent the data in a way that is easier to understand such as word clouds and things of the like. This will aid users in seeing the prominent and key terms from the analysis of the data sets. The first step after downloading the data would be to quality check it to ensure that each field had the appropriate data types in each field and to ensure that the user understood the meaning of each field. Keeping a copy of the original data would be essential as well as each documented version change for each stage of visualisation....

Post #4: Reasons for the Growth of Big Data

Big Data is continuously growing, each and every organisation is dealing with more and more data with each passing day and this growth shows no signs of slowing down. There are various reasons for this swift increase in growth, I will now discuss a few of these reasons. Business models are one of the main reasons for this exponential growth through the aggressive and continuous acquisition and permanent retention of data. Google is a perfect example of a business that is retaining vast amounts of data and this is definitely working in their favour as can be seen by their company growth. Infrastructure capacity is another reason for the increase as the cost of data storage has become incredibly low over the past few years while the capacity seems to be increasing almost doubling in the space of a couple of years.  Business analytics has also seen an increased acceleration in the past few years and is now over a 100 billion dollar market and continues to grow year to year. Regul...