Sunday, June 29, 2014

Big Data - Fundamentals

Big data is a term coined from past 4 years in the media.

Big data is basically the data sets which is too huge for traditional softwares to capture, process in a reasonable amount of time.



The size of this data sets is continually growing and it is in PB (Petabytes) now.

Emeet O Ryen writes in Dice.com

" Among the challenges of implementing solutions: capturing, storing, searching, sharing, analyzing and visualizing such data sets.
As a long time computer scientist in the area of information management, I see the term “Big Data” as a marketing label. It attempts to quantify all that is done in the Distributed Computing or High Performance Computing space, and make the results available to the Data Scientist or Information Analyst so that answers to questions can be formulated in timely manner."

So Big data is:

Capturing data from traditional and non traditional datasources, joining, transforming and processing it so that there is some kind of correlation between the different types that results in a single larger data set.
reducing the larger set to the representative of the whole. 
extracting the information that is relevant and applying the statistical analysis to it.
Visualizing the data and resulting information, so that trends can be determined and decisions made. 

A Competitive Advantage

Many businesses and organizations use Big Data to gain a competitive edge from the resulting business analytics — provided that they ask the right questions, have data sources that provide the right information and, of course, use the right statistical and analytical tools to interpret the answers they get.
These same businesses and organizations use Big Data to make timely strategic and tactical decisions. Instead of just relying on process or instinct, Big Data tools provide them with additional information to consider in making informed decisions.

The Technology

So is Big Data new? No, not really.  Distributed Computing or HPC has been around for well over 20 years. It’s just that never before have you used these HPC techniques and infrastructure on these different types of data sets, nor have you had to answer such business domain related questions. HPC has been the domain of scientists to do physics, biological and chemical analysis, simulations and experimentation where massive amounts of data are available.
The technology and infrastructure here is somewhat unique from HPC of even ten years ago. Today, low cost commodity servers or cloud computing infrastructures can be used for Big Data hardware platforms. As for software, key to a Big Data environment are Apache Hadoop/MapReduce and its associated tools, the use of NoSQL data bases, some means of doing basic analytics on data that’s to be processed, and a means of visualizing the results of the analytics.
There are now a wide variety of NoSQL data bases available, like MongoDB, Cassandra  and others. As for basic analytics, quite often Hadoop/MapReduce’s associated tools can do the trick. When it comes to advanced analytics, tools like R Project, SAS and other statistical packages can be used, though quite often an integration effort is needed to make them work.
Visualization of the data ingested and the resulting analytic results appears to be a wide open area with many vendors claiming dominance in one area or another.

No comments: