Research & Simulation: Big Data Analytics

What is Big Data:

In today's IT industry "Big Data" is new IT buzzword, which is meant for large files or unstructured data sets for which conventional approaches are inefficient to deal with. Inefficiency of traditional data storage and manipulation tools to deal with "Big Data" lies in its architecture, i.e. approx 80% of today's big data is unstructured or more specifically to be categorized as non-rdbms which crosses the boundaries of a system. But this unstructured data is very useful. Use of commodity hardware and plenty of open source tools made the big data analytics a feasible task.

For example: Day-to-day data generated by social sites is NoSQL in nature while it is worth to be stored and manipulated for faster trend analysis of the customer, or by the people whom a company is targeting for marketing. Twitter Trend analysis is one of the most commonly used example of big data analytics.

Most common Big Data categories are: Medical data, Telecom Data (also known as Telco Big Data), log data generated by retail-chains, Bar code data from aviation industry and many more.

Big data analytics has given new dimensions to data visualization and Machine Learning. Data visualization is the method of representing the values in graphical format. It is very fruitful to be used in decision making. The most promising use cases of this are weather forecasting and exit poll surveys, which process large amount of unstructured data and generates some fruitful results. Seeing this you can understand how important the data is:

You can see my video lecture on Big Data Analytics here

Properties of Big Data:

Three most common properties of Big Data are:

Volume
Velocity
Variety

Technologies to deal with Big Data:

There are various tools and technologies to deal with Big Data out of which Hadoop is most commonly used. Hadoop basically stands for ( HDFS + MapReduce). HDFS is reliable Hadoop Distributed File System, and MapReduce is a parallel processing framework which works in key value pair. HDFS is responsible for data storage with reliability and availability while MapReduce is responible for Data Processing. Main components responsible for storage in HDFS are NameNode and Datanode, correspondingly main components handling data processing in MapReduce are JobTracker and TaskTracker. If you are having namenode on local host you can check the status of your hadoop cluster on localhost:50070 via your web browser as shown in picture.

The Cluter configuration of hadoop is mentioned in core-site.xml, hdfs-site.xml, mapred-site.xml. You can customize your cluster configuration by making appropriate changes in these files. Hadoop needs a restart to reflect the changes in configuration. Mapreduce1 was having a problem of Single Point of Failure. MapReduce2 (YARN) architecture is enhanced to deal with parallel processing to be suited to OLAP and OLTP applications and to avoid the problem of single point of failure.

Hadoop is an open source technology which is designed to deal with distributed databases having unstructured data. It is not designed for faster processing rather it is specifically designed for failure proof distributed data processing. Hadoop provides partial failure support, data recovery, component recovery, consistency and scalability.
There are various benchmark tests (for eg "testdfsio") given in hadoop installation package to check the performance of the hadoop cluster.

For the user-convenience and faster execution Hadoop Eco-system supports various scripting languages like Pig, Hive, Jaql, and Many more to be discussed in detail later.

Pig (Pig Latin):
Pig is a simple language platform popularly known as Pig Latin that can be used for manipulating data and queries. Pig is a high level language developed at yahoo. Pig is a data flow language. Unlike SQL pig does not require that data have a schema. In Pig if you don't specify a datatype, all fields are byte-array as its default datatype. In Pig, relation, field, and function names are case sensitive, while keywords are not case-sensitive.

Hive (Hive QL):
Apache Hive first created at Facebook, is a data warehouse system for hadoop that facilitates easy data summarization, ad-hoc queries and the analysis of large datasets stored in hadoop compatible file system. Hive organizes data in the form of Database, Tables, Partitions, and bucket. Hive supported storage file formats are TEXTFILE, SEQUENCEFILE and RCFILE (Record Columnar File). Hive uses temporary directories on both hive client and HDFS. Hive client cleans up temporary data when query is completed.
Hive Query Language was developed at Facebook, and later contributed to the open-source community. Currently Facebook uses Hive for reporting dashboards and ad-hoc analysis.

Spark:
Apache Spark is fast execution engine. It can work independently or on the top of hadoop using hdfs for storage. As a stand alone solutions Spark is used as extremely fast processing framework.

Jaql:
Jaql is primarily a query language for JavaScript Object Notation (JSON) files, but it supports more than just JSON. It allows to process both structured and unstructured data. It was developed by IBM later donated to the open source community. Jaql allows you to select, join, group, and filter data that is stored in HDFS, much like a blend of Pig and Hive. Jaql’s query language was inspired by many programming and query languages, including Lisp, SQL, XQuery, and Pig. Jaql is a functional, declarative query language that is designed to process large data sets. For parallelism, Jaql rewrites high-level queries, when appropriate, into “low-level” queries consisting of MapReduce jobs.

To know the procedure of installing hadoop please follow these steps and for running first program in your hadoop cluster you please follow the steps given in following link (Steps to run first mapreduce program wordcount)

NoSQL Databases:

NoSQL databases are also the integral part of big data analytics domain. A few names of NoSQL databases are MongoDB, CouchDB, CouchBase, Cassandra etc. These databases are designed to store the data in non-relational structure generally in JSON. Most of these databases are categorized as document structure databases. These are effective tools to deal with big data analytics as well as elastic search.

Kafka

Kafka is a distributed messaging system which works in approx real-time manner. Here all the clients (consumer) gets access to the desired information from corresponding server (producer) based on selected topic.
A solution architecture designed over Spark, Cassandra, & Kafka technologies is being used as a SCM solution at prestigious retail chains which reduced the operations cost of SCM by 30-40%.

For more updates on big data analytics, you can like CoE Big Data. If you are really a big data enthusiast and wants to learn these technology stack from practical aspects, visit DataioticsHub and join our Dataioticshub-meetup group for hands-on sessions

Please find the list of projects in the field of big data here

Go to home page

Research & Simulation

Wednesday, January 7, 2015

Big Data Analytics

2 comments: