What is Data?

Data can be qualitative or quantitative:

1)Qualitative data is descriptive information (it describes something)

What is Big Data?

Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.

Why is Big Data Important ?

>Cost Savings

How Amazon uses big data:

Amazon has thrived by adopting an “everything under one roof” model. However, when faced with such a huge range of options, customers can often feel overwhelmed. They effectively become data-rich, with tons of options, but insight-poor, with little idea about what would be the best purchasing decision for them.

Now, let’s try to understand, some of the major drawbacks associated with using the traditional approach for storing and processing the Big Data.

The first drawback is, it’s an expensive system and requires a lot of investment for implementing or upgrading the system, therefore small and mid-sized companies wouldn’t be able to afford it.

Challenges Associated with Big Data

These are the main challenges associated with Big Data that peoples faces:

1. Input / output processing :

Data processing starts with data in its raw form and converts it into a more readable format (graphs, documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized by employees throughout an organizations.

  1. Data collection
  2. Data preparation
  3. Data input
  4. Processing
  5. Data output/interpretation
  6. Data storage

2. Volume :

To store this much amount of Data we need lots of Storage. Think like in whole world the biggest storage exist is 10 GB, But your Data is 20 GB, so how you gonna fit it ? According to Popular Storage Solution Companies like Dell EMC, IBM etc. it’s not a big deal to create huge amount of storages. But if we store the data inside one or few big big storages then we are also facing two more issues. 1st one is costing and another one is velocity. Also if somehow any storage that is huge, gets corrupted, then that will be the biggest disaster for company. Here, I am just trying to tell you very few key challenges we have under Big Data. Don’t think that these few are the only challenges.

3. Velocity:Have you ever thought why Google is so fast ? so its simple answere is velocity

Usually when we store our data in RAM then you will notice RAM is super fast. But when we store our data on Hard Disks or SSD then it’s comparatively very slower. Now you will easily say then store Data on RAM, why you need SSD or Hard Disk to store. The problem is in the architecture of RAM. As, RAM is ephemeral storage so as soon as you close any program it gets vanished from RAM. That means, we can’t store Data permanent on RAM. So, we need to find some kind of solutions which are faster means which can read and write data very faster.

4. Costing : sometimes it also became challenge depends on company-to company/business-to-business

As we all know that when you use bigger storages then the price of those increase exponentially, So companies also needs to think how they can lower their costing, because business runs on revenue and if revenue starts decreasing due to purchase of storages then companies might die. So, companies can’t buy huge storages at a whole to store their data. Also after observing the next challenge when I will discuss the solutions, you will come to know how they are maintaining their costing.

Solution of Big Data Problem is:

Commodity Hardware implemented by the concept of DISTRIBUTED STORAGE -:

What is Distributed Storage?

Let us understand by simple example

  • Master is always receiving the data and distributing the data in between the slaves. That means now we don’t have to think about Volume Problems. Because no matter how big the Data is, we can easily distribute them in the slaves and also we don’t need to purchase bigger storages.
  • So, as we are not purchasing bigger storages so our costing will also decrease. Now we can purchase lots of small storage servers and attach them with master. Suppose in future the data becomes more huge, then we will purchase more storage servers and keep on attaching them with master.
  • Final thing speed, if you notice suppose one storage server takes 10 minute to store 10 GB data, now as in parallel there are multiple storage serves in parallel so to store the same 40 GB data in 4 storage device (10 GB in each server) we will only need 10 minutes. Also it’s not always about storing the data, it’s also about how faster you can read the data.whereas if we use one storage to read 40GB data then it will take over 40 minute. These are simple examples, in actually Industry these architectures are more bigger with lots of components attached to each other.

>This master slave setup is also called as TOPOLOGY .This entire setup working as a team is called CLUSTER.

There are few Technologies/ products that we are using to solve Big Data Problem i.e Hadoop,Cassandra,MongoDB,Apache Hive etc.

>One of the product or technology that used to implement the Distributed Storage is Hadoop which uses HDFS (hadoop distributed file system)protocol.

The Wrap-Up

I have done my best to explain the concepts surrounding Big Data in the simplest way I can.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Abhayagarwal

Abhayagarwal

3 Followers

I am tech enthusiast fascinated towards technology and its various disciplines including Big Data, Hadoop, Web Development, Competative Programming,ML,etc.