Big data from the data lake perspective

As the field of data science grows and becomes ubiquitous to our needs of addressing new challenges and questions. Big data requires more than the hierarchical effort put in place by the data warehousing methodology.

Big data from the data lake perspective

Big data from the data lake perspective

As the field of data science grows and becomes ubiquitous to our needs of addressing new challenges and questions. Big data requires more than the hierarchical effort put in place by the data warehousing methodology.

Data lake is the new kid on the block which warehouses the raw data in its native form. It securely stores and allows access to organisations and researchers the raw data in its innate form whenever the need arises.

So what are the implications of creation of such a raw data storage for the organisations, what promise does data lake hold for the companies and organisations trying to exploit big data for their benefit?

To begin with we would first present you with this caveat put forth by Gartner research recently: “Data lake or data lakes are purposefully built to solve two key problems i.e. it stores disparate data into a raw format and allows enterprise wide access to this data lake and secondly, it aims to solve the problem of information silos which are created at the price of lost time and sometimes miscellaneously marked storage as the I.T doesn’t understand its usage at the present moment”. Gartner research further goes on to say that although it is great to have raw data lake accessible to all the nodes within the organisation. Yet, the problems get complicated when it comes to enterprise wide integration and usage.

Without the descriptive metadata and metadata tags, data lake is bound to be exposed to under or sometimes no usage, as the user fails to apply the correct tools to exploit it. This severely undermines the data quality and hence the end result of failure to uniformly uphold better data standards in compliance with big data science.

The other problem point out by Gartner research is the security risk these data lake pose to the security infrastructure. Data lake often start of as ungoverned data sources and the onus of securing and placing security lockdowns on data lake or data lakes lies with the I.T department. It is therefore key to hinder unregulated access to such data lakes.

With these cautions in mind we will now proceed to look into the advantages that are brought forth by the data lake. Firstly, there will be certain regulatory biases which would have to considered by the organisations when it comes to introducing data lake within there organisation. The key here being regulatory policies and tag management which is enforced strictly by data scientists within the organisation. Simply put, there is an innate need to uphold some amount of data management when it comes to correctly using data lake with your organisation.

This management translates into better data management and warehousing practices which requires certain degree of skill within the organisation.

Data lake is a method to consolidate large amounts of raw data hence its storage should encompass the cloud or edge computing principles. This is due to the fact that the nodal access to the data lake should never be left unchecked and yet should allow seamless access to all corners of the globe wherever your organisation and its users are spread.

Think of this point as elastic storage and access which would truly leverage the advantages brought in by data lake.

This also allows the user to access large amounts of unstructured data and delivers better agility in data access. The need here is to allow the users to access data lake when and where the need arises instead of the old concept where structured data was accessed once the purpose was determined and established. This truly save time but also presents us with the dilemma between mature data (data warehouses) and maturing data (data lakes).

Lastly, do not think of data lakes as data warehouses 2.0. The principal difference lies in the storage which in case of former is ‘schema on write’ while latter proposes a ‘schema on read’ approach.

In its true sense data lake is a concept which would need to be explored and undertsood further when it comes to end-user. Till the big data science organisations and companies can take a long and hard look as to where their position lies on the data lake map.

FOLLOW US ON INSTAGRAM

@memorres