What is Big Data?

Introduction

Have you ever stopped to think about the amount and variety of the data that we generate and store each day? Banks, insurance companies airlines, telecom operators, search services online networks and retailers are just a few of the many examples of companies that live day to day with large volumes of information. But only have data is not enough: it is important to be able and to know use them. This is where the concept of Big Data enters the scene.

In this text, you will see what is Big Data, you will understand why this name be ever more present in the vocabulary of the environments of Technology Information (IT) , and you will understand the reasons for that take the concept to contribute to the daily life of companies, governments, and others institutions.

The concept of Big Data

The principle, we can define the concept of Big Data as being the data sets extremely large and that, for this reason, need tools specially prepared to deal with large volumes, so that all and any information these media can be found, analyzed and used in time skilled.

Way more simplistic, the idea can also be understood as the analysis of large amounts of data to the generation of important results on volumes smaller would hardly be achieved.

It is not difficult to understand the scenario in which the concept applies: we exchanged a million emails per day; thousands of banking transactions happen in the world every second; sophisticated solutions that manage the supply chain of several factories in this exact time; operators record all the time calls and traffic data from the growing number of mobile lines worldwide;and systems ERP coordinate the sectors of many companies. Finally, the examples do not missing – if they ask you, you will surely be able to point to the other without making an effort.

Information is power, so if a company you know how to use the data that you have in your hands, you can understand how to improve a product, how to create a marketing strategy more efficient, how to cut expenses, how to produce more in less time, how to avoid the waste of resources, how to overcome a competitor, as provide services to a special client of way satisfactory, and so on.

Note, we are talking about factors that may be decisive for the future of a company. But, Big Data is a name the relatively recent (or, at least, began to appear in the media recently). This means that only in the the last few years is that companies have discovered the need a better use of their large databases?

You can be sure that no. There are times that departments IT include applications of Data Mining, Business Intelligence and CRM (Customer Relationship Management), for example, to treat just of data analysis, taken from decisions, and other aspects related to the business.

The proposal of a solution for Big Data is offer a comprehensive approach in the treatment of aspect increasingly “chaotic” the data to make such applications and all the other more efficient and accurate. For both, the concept considers not only large amounts of data, the speed of analysis and the availability of these, as also the relationship with and between the volumes.

Why Big Data is so important?

We deal with data since the dawn of mankind. Happens that, in the current times, advances in computing allow us to save, organize, and analyze data much more easily and with frequency much higher.

This panorama is far from ceasing to be growing. Just imagine, for example, that multiple devices in our homes – refrigerators, TVs, washing machines, coffee makers, etc. – must be connected to the Internet in the future not too far away. This forecast is within the what is known as the Internet of Things.

If we look at what we have now, we are going to see a great changes in relationship to the decadesprevious: taking as a basis only the Internet, think about the amount of data that are generated every day, only on social networks; notice the immense amount of sites on the Web; notice that you it is able to do online shopping through until your device, when the maximum computerisation of the shops had in a past not very distant systems were isolated to manage their properties physical.

The current technologies have enabled us – and will allow to increase exponentially the amount of information in the world and, now, companies, governments and other institutions need to know how to deal with this “explosion” of data. The Big Data proposes to help in this task, because the tools computational used until then for the management of data, by itself, already can’t do itsatisfactorily.

The amount of data generated and stored daily came to such a point that, today, a centralized structure for data processing no longer makes the most sense for the absolute majority of the large entities. Google, for example, has several date centers to give account of their operations, but that is all in an integrated way. This “partitioning structural”, it is good to highlight, it is not a barrier to Big Data – in times of computing in the clouds, nothing but trivial.

 

The ‘Vs’ of Big Data: volume, velocity, variety, veracity and value

In order to leave the idea of Big Data more clearly, some experts began to summarize the subject in ways that can describe satisfactorily the basis of the concept: the five ‘Vs’ – volume, velocity and variety, with the factors of truth and value appearing later.

The appearance of the volume (volume) you already know. We are talking about amounts of data is really large, that grow exponentially and that, not rarely, are underutilized just because they are in these conditions.

Speed (velocity) is the other point that you already assimilated. To give account of certain problems, the treatment of the data (obtaining, recording, update, anyway) should be done in a timely manner- often in real time. If the size of the database is a the limiting factor, the business may be impaired: imagine, for example, the disorder that a card company credit would have – and would – if it took hours to approve a transaction of a customer because of the fact that your system security will not be able to quickly analyze all data that may indicate a fraud.

Variety (variety) is another important aspect. The the volume of data that we have today are also the result of the diversity of information. We have data in format structured, that is, stored in databases such as PostgreSQL and Oracle, and unstructured data from many sources, such as documents, images, audios, videos, and so forth. It is necessary to know how to treat the variety as part of a whole – a given type of data can be useless if not is associated with the other.

The point of view of truthfulness (veracity) also can be considered, because not much point in dealing with the combo “volume + speed + range” if there is data is not reliable. It is necessary that there is processes that guarantee the maximum possible the consistency of the data. Returning to the example of the operator credit card, imagine the problem that the company would have if your system blocked a transaction genuine by analyzing data is not consistent with the reality.

Information is not only power, information is also heritage. The combination the “volume + velocity + variety + veracity, in addition to any and all another aspect that characterizes a solution of Big Data, if show infeasible if the result does not bring significant benefits and to offset the investment. This is the point of view of the value (value).

It is clear that these five aspects need to be taken as the perfect setting. There are those who believe, for example, that the combination “volume + speed + variety” is sufficient to convey a notion acceptable of Big Data. Under this approach, the aspects of the the truthfulness and the value would be unnecessary, because you already are implicit in the business – any entity serious know you need consistent data; no entity makes decisions and invests if there is no expectation of return.

The highlight for these two points may be the same unnecessary by making reference to what seems to obvious. On the other hand, your account can be relevant because it reinforces the care needed by these aspects: a company might be analyzing social networks to get an evaluation of the image that customers have of their products, but will these information are trusted to the point of not being required the adoption of procedures more judicious? Will be that it is not necessary for a study more deep to reduce the risks of an investment before of do it?

Anyway, the first three ‘Vs’ – volume, velocity and variety – they may not even offer the best definition of the concept, but are not far away do it. It is understood that Big Data is only about massive amounts of data, however, you can have a volume of not very large, but still fits in the context because of the factors velocity and variety.

Solutions Big Data

In addition to dealing with extremely large data volumes of more varied types, solutions, Big Data also need to work with distribution and processing elasticity, that is, to support applications with data volumes grow substantially in little time.

The problem is that the databases “traditional”, especially those that exploit the relational model, like MySQL, the PostgreSQL and Oracle, do not show suited to these requirements, because they are the least flexible.

This happens because the relational databases are usually based in the four properties that make its use safe and efficient, for which reason the solutions of the type are so popular: Atomicity, Consistency, Isolation, and Durability. This combination is known as ACID, which stands for the use of these terms in English: Atomicity, Consistency, Isolation and Durability. Let’s look at a brief description of each:

  • Atomicity: the entire transaction must be atomic, that is, can only be considered effective if executed completely;
  • Consistency: all the rules applied to the database must be followed;
  • Isolation: no transaction can interfere in another that is in progress at the same time;
  • Durability: once the transaction is completed, the data resulting can’t be lost.

The problem is that this set of properties is too restrictive for a solution of Big Data. The elasticity, for example, can be poisoned by the atomicity and by the consistency. It is at this point that enters the scene the concept of NoSQL, a name that many attribute the English expression “Not only SQL”, which in free translation means “Not only SQL” (SQL- Structured Query Language – is, in few words, in a language own to work with relational databases).

NoSQL makes reference to the solutions of banks data that allow for the storage of various forms, not limited to relational model traditional. Seats of the type are more flexible, including being compatible with a group of assumptions that “competes” with the ACID properties: BASE (Basically Available, Soft state, Eventually consistency- Basically available, Status Light, Eventually consistent).

It is not that relational databases have been outdated – they are and will continue to for a long timebeing useful to a wide range of applications. The what happens is that, generally, the larger a database becomes, the more costly and laborious it is: it is necessary optimize, add new servers, employ more specialists in your maintenance, anyway.

A rule of thumb, scale (make it bigger) a NoSQL databases it is easier and less costly. This is possible because, in addition to having properties more the flexible seats of the type already are optimized for working with parallel processing, global distribution (multiple data centers), an immediate increase in their capacity, and others.

In addition, there is more than one category of NoSQL database, making solutions can meet wide variety of data that exists, both estrurados, as not structured: databases, document-oriented, banksdata key/value databases graph, anyway.

Examples of databases NoSQL are the Cassandra, MongoDB, HBase, CouchDB and Redis. But, when the subject is Big Date, only one database type is not enough. Is also need to count with tools that allow the treatment of the volumes. At this point, Hadoop is, by far, the the main reference.

What is Hadoop?

The Hadoop is a platform open source developed especially for the processing and analysis of large volumes of data, whether structured or unstructured. The project it is maintained by the Apache Foundation, but with the collaboration of several companies, such as Yahoo! Facebook, Google, and IBM.

One can say that the project started in mid-2003, when Google created a programming model that distributes the processing to be carried out from various computers to help the search engine to get more fast and free needs powerful servers (and more expensive). This technology has received the name of MapReduce.

A few months later, Google introduced the Google File System (GFS), a system of files specially prepared to deal with distributed processing and, as could not in the case of a company like this, large volumes of data (in quantities of terabytes, or even petabytes).

*In a few words, the file system is a set of instructions that determines how the data should be stored, accessed, copied, modified, named, deleted, and so on.

In 2004, an open source implementation of the GFS was built into Nutch, a search engine project for the Web. Nutch faced problems scale – I couldn’t deal with a large volume of pages – and the variation of the GFS, which has received the name Nutch Distributed Filesystem (NDFS), showed itself as a solution. In the following year, the Nutch already had also with an implementation of MapReduce.

In fact, Nutch was a part of a larger project: a library for indexing of pages, called Lucene. The responsible for these jobs soon saw that what had in the hands it could also be used in different applications search on the Web. This perception motivated the creation of another project that includes features of Nutch and Lucene: Hadoop, whose implementation of the system of files received the name of Hadoop Distributed File System (HDFS).

Hadoop is seen as a suitable solution for Big Date for several reasons:

– Is an open source project, as already informed, the fact that allows for their modification for the purposes of customization and makes it susceptible to improvements constant thanks to his network of collaboration. Because of this characteristic, several projects, derivatives or complementary were – and still are created;

– Delivers cost savings, since it does not require the payment of licenses and supports hardware conventional, allowing the creation of projects with machines considerably cheaper;

– The Hadoop account, by default, with features of tolerance failures, such as data replication;

– Hadoop is scalable: no need of processing to support larger amount of data, is can add computers without the need to perform reconfigurations of complex system.

It is clear that Hadoop can be used in conjunction with banks NoSQL data. To own the Apache Foundation maintains a solution of the type that is a kind of subproject of Hadoop: the already mentioned database HBase, which works linked to the HDFS.

Hadoop, it is important to highlight, is the option with the most highlight, but not the only one. Is possible to find other solutions compatible with NoSQL or that are based on Massively Parallel Processing (MPP), for example.

Ending

We are not to consider the solutions of Big Data as an arsenal computational perfect: type systems are complex, still unknown by many managers and professionals IT and your own definition is still amenable to discussion.

The fact is that the idea of Big Data reflects a real-life scenario: there is, increasingly, the volumes of data is gigantic, and that, therefore, require an approach able to leverage them to the maximum.Just to give you a sense of this challenge, IBM released at the end of 2012, which, according to their estimates, 90% of the data available in the world were generated only in the previous two years. By the end of 2015, this volume of the whole will be increased at least twice. On this point of view, is somewhat precipitate face the expression “Big Data” as a mere “end of fashion”.

What Is Big Data 1