How to Utilize Data Lakes Effectively to Gain Insights


While the big data industry has lacked a consistent and well-understood definition of the data lake since its entry into the hype cycle, clear use cases and best practices are now emerging.

Many companies are currently adopting data lakes for data discovery, data science, and big data projects.  Meanwhile, data governance, security and integration have all been identified as essential ingredients.

To educate enterprises on the importance of having a data lake and what to do with it, Reiner Kappenberger, global product management of HPE Security - Data Security at HP; Kevin Petrie, senior director and technology evangelist at Attunity; and Sumit Sarkar, product marketing manager at Progress DataDirect, recently participated in a roundtable discussion as part of a DBTA webinar.

According to Kappenberger, securing the data lake is the most important piece to initial piece. “One of the important things we see is securing Hadoop itself is a very difficult proposition,” Kappenberger said. Hadoop is tricky to secure because the environment is constantly evolving, with new releases being adopted every few months, and there are multiple data sources and data types that get consolidated inside the data lake repository, Kappenberger explained.

To protect that information, a data-centric security approach will work best using a variety of end-to-end database encryption tools, Kappenberger said.

Extracting the data from Hadoop is the next important piece to invest in as part of a data lake strategy, Petrie said, as there’s a surplus of data but a shortage of insights. Automation is a key aspect to gaining the most from data analytics within the data lake, Petrie noted. “There’s a lot of manually intensive work,” Petrie said.

Leveraging software as a service is the final piece for understanding the storage platform, Sakar explained.

Utilizing Apache Sqoop, a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores, will help move data between Hadoop and relational databases – and, by using a standard interface, data can successfully move from one SaaS application from another, Sakar said.

“Once you pull that into your lake, that’s where your other analytics begin,” Sakar said.

To view a replay of this DBTA webinar, go here.

Image courtesy of Shutterstock.



Newsletters

Subscribe to Big Data Quarterly E-Edition