What is Data?

Abhayagarwal
11 min readSep 17, 2020

--

Data is a collection of facts, such as numbers, words, measurements, observations or just descriptions of things.

or we can say , Data as a general concept refers to the fact that some existing information or knowledge is represented or coded in some form suitable for better usage or processing.

For example:

The image,video,text we uploaded on any social media is one kind of data.

The sms or mail we do is one kind of data.

So, Have you ever wondered how much data your mobile phone generates in the form of texts , phone calls , Emails , photos , Videos , searches and music ??

Approximately 40 Exa Bytes of Data is get generated every month by a single smart phone user . Amazed ? ..yes its true . Now imagine this no [40 ExaBytes] multiplied by 5 billion smart phone user .40 EB X 5,000,000,000 = 200,000,000,000 EB , its a large large large data . Infact this data is quite a lot for traditional systems to handle and this massive amount of data is what we term as BIGDATA . 40 EB X 5,000,000,000 = 200,000,000,000 EB , its a large large large data . Infact this data is quite a lot for traditional systems to handle and this massive amount of data is what we term as BIGDATA.

Data generated on internet per day:

->2.1 Milllion snaps on SnapChat

->3.8 million Search Queries on google

->1.5 million people log on to the facebook

- >4,5 millions vidoes are watched on you tube.

- >188 million emails are send .

and many many more social media platforms.

In short we say that everything we use in mainly in our daily life is be a type of data.so, there is lots of data.

Data can be qualitative or quantitative:

1)Qualitative data is descriptive information (it describes something)

2)Quantitative data is numerical information (numbers)

Everyone uses data in their day-to-day life but 90% of them are not know where these huge data is actually stored or how these large data can be maintained?

As you noticed that the data is huge and in technical terms this huge amount of Data is causing lots of problems and by combining those problems, we have invented Big Data Problem. Big Data is not a technology, it’s just a name of the problem.

What is Big Data?

Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.

Why Big Data?

So, first we know about what launched the big data Era?

Acc. to ,an influential report by a company called McKinsey in 2013 claimed that the area of data science will be the number one catalyst for economic growth. McKinsey identified one of our new opportunities that contributed to the launch of the big data era. A growing torrent of data.

This refers to the idea that data seems to be coming continuously and at a fast rate. Think about this, today you can buy a hard drive to store all the music in the world for only $600. That’s an amazing storage capability over any previous forms of music storage.

In 2010 there were 5 billion mobile phones in use. You can be sure that there are more today and as I’m sure you will understand, these phones and the apps we install on them are a big source of big data, which all the time, every day, contributes to our core.

And Facebook, which recently just set a record of having one billion people login in a single day, has more that 30 billion pieces of content shared every month. Well, that number’s from 2013. So i’m sure that it’s much higher than that now.

Does it make you think how many Facebook shares you made last month? All this leads to projections of serious growth. 40% in global data per year, and 5% in global IT spending. This much data has sure pushed the data science field to start remaining itself and the business world of today.

But, there’s something else contributing to the catalyzing power of data science. It is called cloud computing. We call this on demand computing. Cloud computing is one of the ways in which computing has now become something that we ca do anytime, and anywhere.

You may be surprised to know that some of your favorite apps are from businesses being run from coffee shops. This new ability, combined with our torrent of data, gives us the opportunity to perform novel, dynamic and scalable data analysis, to tell us new things about our world and ourself.

To summarize, a new torrent of big data combined with computing capability anytime, anywhere has been at the core of the launch of the big data era.

Why is Big Data Important ?

>Cost Savings

>Time Reductions

>Understand the market conditions

>Control online reputation

>Using Big Data Analytics to Boost Customer Acquisition and Retention

>Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights

>Big Data Analytics As a Driver of Innovations and Product Development.

Let’s do an deep analysis how companies are using and managing BigData — among some of the biggest companies in the world — Microsoft , Apple , Amazon , Alphabet , Facebook etc.

How Amazon uses big data:

Amazon has thrived by adopting an “everything under one roof” model. However, when faced with such a huge range of options, customers can often feel overwhelmed. They effectively become data-rich, with tons of options, but insight-poor, with little idea about what would be the best purchasing decision for them.

To combat this, Amazon uses Big Data gathered from customers while they browse to build and fine-tune its recommendation engine. The more Amazon knows about you, the better it can predict what you want to buy. And, once the retailer knows what you might want, it can streamline the process of persuading you to buy it — for example, by recommending various products instead of making you search through the whole catalogue.

Amazon’s recommendation technology is based on collaborative filtering, which means it decides what it thinks you want by building up a picture of who you are, then offering you products that people with similar profiles have purchased.

Amazon gathers data on every one of its customers while they use the site. As well as what you buy, the company monitors what you look at, your shipping address (Amazon can take a surprisingly good guess at your income level based on where you live), and whether you leave reviews/feedback.

This mountain of data is used to build up a “360-degree view” of you as an individual customer. Amazon can then find other people who fit into the same precise customer niche (employed males between 18 and 45, living in a rented house with an income of over $30,000 who enjoy foreign films, for example) and make recommendations based on what those other customers like.

Let’s now see the Traditional Approach of Storing and Processing Big Data:

In a traditional approach, usually the data that is being generated out of the organizations, such as the banks or stock markets, or the hospitals is given as an input to an ETL (Extract, Transform and Load) System.

An ETL System, would extract this data, transform this data, (that is, it would convert this data into proper format) and finally load this data into the database.

Once this process is completed, the end users would be able to perform various operations, such as generate reports and perform analytics by querying this data.

But as this data grows, it becomes a challenging task to manage and process this data using this traditional approach.

This is one of the reasons for not using the traditional approach for storing and processing the Big Data.

Now, let’s try to understand, some of the major drawbacks associated with using the traditional approach for storing and processing the Big Data.

The first drawback is, it’s an expensive system and requires a lot of investment for implementing or upgrading the system, therefore small and mid-sized companies wouldn’t be able to afford it.

The second drawback is scalability. As the data grows expanding this system would be a challenging task.

And the last drawback is, it is time-consuming. It takes a lot of time to process and extract, valuable information from this data, as it is designed and built based on legacy computing systems.

Hope this makes clear, why the traditional approach or the legacy computing systems are not used to store and process the Big Data.

Challenges Associated with Big Data

These are the main challenges associated with Big Data that peoples faces:

1. Input / output processing :

Data processing starts with data in its raw form and converts it into a more readable format (graphs, documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized by employees throughout an organizations.

Its include:

  1. Data collection
  2. Data preparation
  3. Data input
  4. Processing
  5. Data output/interpretation
  6. Data storage

2. Volume :

To store this much amount of Data we need lots of Storage. Think like in whole world the biggest storage exist is 10 GB, But your Data is 20 GB, so how you gonna fit it ? According to Popular Storage Solution Companies like Dell EMC, IBM etc. it’s not a big deal to create huge amount of storages. But if we store the data inside one or few big big storages then we are also facing two more issues. 1st one is costing and another one is velocity. Also if somehow any storage that is huge, gets corrupted, then that will be the biggest disaster for company. Here, I am just trying to tell you very few key challenges we have under Big Data. Don’t think that these few are the only challenges.

3. Velocity:Have you ever thought why Google is so fast ? so its simple answere is velocity

Usually when we store our data in RAM then you will notice RAM is super fast. But when we store our data on Hard Disks or SSD then it’s comparatively very slower. Now you will easily say then store Data on RAM, why you need SSD or Hard Disk to store. The problem is in the architecture of RAM. As, RAM is ephemeral storage so as soon as you close any program it gets vanished from RAM. That means, we can’t store Data permanent on RAM. So, we need to find some kind of solutions which are faster means which can read and write data very faster.

4. Costing : sometimes it also became challenge depends on company-to company/business-to-business

As we all know that when you use bigger storages then the price of those increase exponentially, So companies also needs to think how they can lower their costing, because business runs on revenue and if revenue starts decreasing due to purchase of storages then companies might die. So, companies can’t buy huge storages at a whole to store their data. Also after observing the next challenge when I will discuss the solutions, you will come to know how they are maintaining their costing.

Solution of Big Data Problem is:

Commodity Hardware implemented by the concept of DISTRIBUTED STORAGE -:

What is Distributed Storage?

Let us understand by simple example

Very easily think in this way, you have 4 laptops or 4 storage servers, typically known as Slave Nodes or Data Node. Every laptop is connected via networking with one main laptop typically known as Master Node or Name Node. Now suppose each server has 10 GB of storage, so if somehow 40GB data came then we won’t be able to store it in one server, so here comes the play of Distributed Storage.

  • Master is always receiving the data and distributing the data in between the slaves. That means now we don’t have to think about Volume Problems. Because no matter how big the Data is, we can easily distribute them in the slaves and also we don’t need to purchase bigger storages.
  • So, as we are not purchasing bigger storages so our costing will also decrease. Now we can purchase lots of small storage servers and attach them with master. Suppose in future the data becomes more huge, then we will purchase more storage servers and keep on attaching them with master.
  • Final thing speed, if you notice suppose one storage server takes 10 minute to store 10 GB data, now as in parallel there are multiple storage serves in parallel so to store the same 40 GB data in 4 storage device (10 GB in each server) we will only need 10 minutes. Also it’s not always about storing the data, it’s also about how faster you can read the data.whereas if we use one storage to read 40GB data then it will take over 40 minute. These are simple examples, in actually Industry these architectures are more bigger with lots of components attached to each other.

>This master slave setup is also called as TOPOLOGY .This entire setup working as a team is called CLUSTER.

There are few Technologies/ products that we are using to solve Big Data Problem i.e Hadoop,Cassandra,MongoDB,Apache Hive etc.

>One of the product or technology that used to implement the Distributed Storage is Hadoop which uses HDFS (hadoop distributed file system)protocol.

The Wrap-Up

I have done my best to explain the concepts surrounding Big Data in the simplest way I can.

As we all know,World is running because of Data, and as Data is huge and Companies can’t delete the data, so it’s a very big challenge for them to store the Data, which leads us to the World of Big Data.

In upcoming days I am going to publish lots of articles on Big Data Tools and Technologies, So definitely follow me Medium.

Here is my LinkedIn profile if u have any queries definately comment below or DM me on linkedin

https://www.linkedin.com/in/abhay-agarwal-637b801a2/

--

--

Abhayagarwal

I am tech enthusiast fascinated towards technology and its various disciplines including Big Data, Hadoop, Web Development, Competative Programming,ML,etc.