Hadoop : What is BIG DATA?

What is BIG DATA?

 

Introduction:

BigData is current industry problem.

1) Storage Problem (tb, pb, xb…)

2) Processing Problem.

The solution for these problems are called as “Bigdata Storage Solutions” and “Bigdata Processing Solution”. We cannot solve Bigdata problem using traditional RDBMS systems, hence industry today is integrating with Bigdata solutions to solve this problem.

 

Solutions that were introduced in the market to solve this Bigdata problem:

 

1) NO SQL- Cassendra, HBASE, MongoDB, Bigtable…

2) Hadoop

HDFS (Distributed Storage)

+
MapReduce( Distributed Processing)

+

YARN( Resource Management- Implemented in 2013)

3) Hadoop Ecosystem (Processing Solutions)

  • HIVE
  • PIG
  • SQOOP
  • FLUME
  • OOZIE
  • SPARK

 

Drawbacks of Traditional Filesytem:

1) Storing large amounts of data.

2) Processing large amounts of data.

3) Data loss during:

– Power Failure

-Network Failure

-Hardware and Software Failure ( Data can be saved only if we have a backup)

4) Metadata- In traditional filesystem, we have Filelevel Metdata but not Contentlevel Metdata. We cannot know the actual contents by looking into the Metadata.

Contentlevel Metadata provides content level and filelevel information which makes search mechanism easy and fast.

E.g Google uses content level Metadata.

History of the Hadoop

 

-In 1998, Google wanted to build a scalable search engine and hence introduced Google File System (GFS) along with Mapreduce.

-In 2002, Development was started as Nutch project by Doug Cutting and Mike.

Nutch (A Web Crawler):

A Web crawler, sometimes called a spider, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).

Web search engines and some other sites use Web crawling or spidering software to update their web content or indices of others sites’ web content. Web crawlers can copy all the pages they visit for later processing by a search engine which indexes the downloaded pages so the users can search much more efficiently.

But data from Nutch project caused Storage and Processing problems. As Nutch was not scalable it was flopped.

-In 2003, Google published white papers for GFS and Mapreduce.GFS as Storage Solution and MapReduce as Processing Solution.

-In 2004, Nutch implemented the google solution to overcome drawbacks of Nutch as

NDFS(Nutch Distributed File System).

-In 2006, Yahoo hired DougCutting and team to build a solution for Yahoo solution and Processing problem and that is how Hadoop had been actually started.

-In 2007, Apache Hadoop ( HDFS, MapReduce) was made open source by Yahoo. Apparently it was modified and performance was increased day to day.
-In 2008, 1 TB of data was sorted by Hadoop in 3.5mins approximately and in 2009 the same data was processed in 62 seconds only.

-Hadoop Ecosystem Tools were started in 2008. Hadoop supports almost 15 plus Filessystems.

Apache Hadoop(framework):

In  2010, Hadoop- 1.x — HDFS and MapReduce was published.

In 2013, Hadoop- 2.x — HDFS, MapReduce and YARN was published.

In 2016, Hadoop- 3.x — HDFS, MapReduce and Yarn was published.
Currently industry is using Hadoop -2.x.

Distributers of Hadoop:

Industry made some modifications by addressing the drawbacks of hadoop and made them as commericials solutions, they are called as “Distributors of Hadoop”

1) Cloudera

2)Hortonworks

3)MapR

4)IBM BigInsights

5)Pivotal HD

6)Amazon EMR(Elastic MapReduce)

 

Note: Please test scripts in Non Prod before trying in Production.
1 Star2 Stars3 Stars4 Stars5 Stars (15 votes, average: 5.00 out of 5)
Loading...

9 thoughts on “Hadoop : What is BIG DATA?

Add Comment