Big Data/Hadoop

Big Data/Hadoop

Big data is large amount of data which can be structured, unstructured and semi structured. It is collection of data sets which is very large in amount where our our traditional systems are unable to handle and deal with it. For e.g. we use browsers, listen online music, use credit cards, online shopping. During these processes our search is recorded and saved as data. This Big data needs to be collected, stored, processed for predictive analysis and certain other methods needs to be implemented to extract useful information from it.

Big data can be charcterised as:

Volume (data quantity): 500 TB of new data is created every day from Facebook. Also data created by smart phone- video, location, sensors and PC typically billions of constantly new updated data feed forms huge volume each day.

Velocity (data speed): Exchange of data between billions of devices, multiple of inputs produced by online processes per second, all of these are handled with frequent speed.

Variety (data type): Big data is the collection of different varieties of data containing date, number, figures, string, video, text and log files.

Why Big data Technologies?

Better Management: Handles large amount of data that traditional software fails to do, uses data visualization tools, BI (Business Intelligence) data making easy to read charts, graphs and data. These intelligence tools lead to abstraction of data for specific purpose without going through complex process.

Better speed and Decision: Increased processing power and storage capacity, provides storage power and necessary processing capacity.

Cost reduction: Big data technologies such as Hadoop and Cloud based Analytics provide significant cost when it comes about large data handling, analysis and efficient ways of results. Hadoop is an open source framework used to handle, store and process big data within a distributed computing environment.

How Big Data stores datasets?

Due to high velocity and veracity of big data traditional warehouse become useless for giving efficient results for big data. Hadoop is a framework used for storing structured and unstructured data and mining the data for providing best results.

Big data is about a Terabyte or Zettabyte of file which Hadoop allows to save the file by its Distributed file system. After that data analytics is done using Google MapReduce Algorithm implementation which is implemented using HDFS (Hadoop Distributed File system). MapReduce divides application into small blocks of work. HDFS creates multiple replicas of data blocks of reliability; place them into computer nodes around the cluster. MapReduce process the data where it is located.

MapReduce: It is the core component of Hadoop which performs two essential functions. It parcels out work to various nodes of a cluster and reduces the results into cohesive information. Data in the MapReduce must be in the form of k/v (key value) pair i.e. set of two linked data items.

HDFS: Hadoop distributed file system is designed to be deployed on low cost hardware and suitable for large data sets. It is applicable of quick fault detection and auto recovery from it making HDFS suitable for Big Data.

Big data Analytics: It include examine large amount of data from appropriate information. Identification of hidden pattern thus gives strategic and operational decisions for business, which leads to effective marketing and customer satisfaction.

Different organizations using Big Data:

Many companies use big data to target and retarget customers by solving their problems and build products and services according to need. There are many organizations adopting Big Data analytics:

  • A9.com, Amazon: Product and visual search organisation use Hadoop to build Amazon’s product search Indices, millions of session daily for analysis are performed.
  • Yahoo!: More than 100,000 CPU’s in 20,000 computers running Hadoop which supports research for Ad systems, Web search and used for scaling tests.
  • Facebook: It store copies of internal log and dimension data source. It as a source for reporting analytics and machine learning.
  • Twitter: Since 2010 Hadoop is used to store and processing tweets.
  • EBay: EBay has one of the largest Hadoop cluster in industry that run prominently on MapReduce jobs. Hadoop is used in EBay for Search Optimization and Research.
  • Accenture: An IT consulting company use it to store client projects in finance, telecom and retail.

Challenges of Big Data:

  • It is a platform which contains unlimited information but challenge is to extract relevant information from dataset.
  • Since Hadoop is used to handle massive volume of data but the technology is new and many data professionals are not familiar with how to handle it.
  • It is also difficult to predict where the data should be allocated.

Future of Big Data: Big Data analytics has become the emerging business trend for today’s business. The availability of Big Data, low cost commodity hardware made it unique from traditional systems. McKinsey Global Institute estimates that data volume is growing 40% and will grow by 44x between 2009 and 2020. Data is increasing with such a fast rate and focus is how to improve business performance of Big Data by data analytics. Adoption of best practices and close attention to changes in the way we think about Big Data, however will be important to all business.

Big data analytics is used in business to gain a competitive advantage.From many years, it has become an emerging technology to improve buisness anlytics. Webtunix is machine learning and AI based company, where weather forecasting is one of the application of machine learning which we have implemented using hadoop architecture. Done many projects like processing data of 20 years to count years which have maximum summer and maximum winter, deals with clustering which work on masterslave, backend hadoop changing to provide scheduler. These applications uniquely bring computer science, data science and domain science together to provide real-time analytics which enables businesses to become more efficient and profitable. Our team apply machine learning algorithm to find trends and patterns for transforamtional soluions. All of these together make better qualtiy of decision making.