In recent years, data lake and data management platform (DMP) projects have multiplied in companies, where they are sometimes used alongside each other.
But what is a data lake, what is it used for, and how is it different from a DMP?
Pierre Harand, Managing Director France, and Jean-François Wassong, Global Technology Director at fifty-five, provide their explanations and insights.
What is a data lake used for?
Pierre Harand: To put it simply, a data lake is a huge database in which a company’s various data streams are channelled. These flows stream into the data lake like so many rivers coming from the various departments of the company.
A data lake aims to allow its users to access exhaustive data which they can extract in an automatic and personalised way.
Its primary function is analytical – in a way, it can be seen as an experimental playground for data scientists, as they can play along with all sorts of data without always knowing beforehand what they are going to reveal.
More agile than a data warehouse, it allows the company to extract value out of raw data without first having to standardise or map its own data.
Typically, we set up a data lake for a major fashion retailer who was thus able to determine how to best order the product lists on its e-commerce website based on visitors’ behavioural data, the company’s product databases, the inventory and the margin. In the end, the analysis from the data lake combined with a new product ordering led to an 8% increase in the add-to-cart rate and to a 4% increment to the annual turnover (performance comparison based on A/B Testing).
What is the difference between a data lake and a DMP?
Jean-François Wassong: A data lake is characterised by the longevity and comprehensiveness of its data. It also allows companies to collect PII data (Personally Identifiable Information), unlike DMPs. In this sense, we can say that a data lake has higher asset value than a DMP, which is centred on cookies and media audience activation.
A data lake is also more open to the various company departments – it is a free space.
Conversely, in a Data Management Platform, everything is processed in anticipation of media activation, especially by combining first-party data with third-party data (i.e. additional data from external partners), unlike data lakes which do not include the latter as they are reserved for internal use.
What are the advantages of both approaches?
Pierre Harand: In a company, the fields of application of a data lake are broader than those of a DMP, and they extend beyond advertising. They involve several company departments and functions, such as pricing, merchandising, logistics, production and inventory. A data lake is more of a cold exploration and processing tool.
A DMP is primarily an activation-oriented marketing tool, and it is mainly digital. The advantage of the DMP is that it is usually an off-the-shelf tool with connectors to the various market solutions. It also includes features that allow marketers to quickly create and launch campaigns targeting specific user segments. There is a “real-time” and “hot data” aspect to the DMP, which allows for the use of engagement and personalisation tools as well as fast processing of audience engagement signals.
Are data lakes and DMPs compatible?
Jean-François Wassong: Yes, the two approaches are compatible, and a data lake often constitutes a good preliminary step to a DMP. Indeed, it is often possible to use data from the data lake to expand the knowledge base of the DMP.
In short, a DMP establishes connections between several external data providers, and the data lake then supplements it with new internal data.
We generally advise companies that enjoy significant traffic on their digital assets to start off by setting up a data lake. Conversely, if the number of visitors is low, as is the case for companies selling fast-moving consumer goods (FMCG), we recommend starting with a DMP instead.
As far as data lakes are concerned, what are the known technologies?
Jean-François Wassong: A data lake is generally a collection of multiple components. The Hadoop ecosystem, which is currently the most widespread, is made up of at least three components:
- data storage, which is generally handled by HDFS
- distributed processing, for which there is a wide range of solutions (map/reduce, yarn, spark, etc.)
- the query engine (Hive, Pig, Drill, etc.)
These make it possible to implement a lambda architecture, i.e. an architecture designed to transform raw data into actionable data.
These data can then be used by the various departments through fast query tools (Elastic Search, Hbase, Impala, Cassandra…).
The main distributions of Hadoop (Horton Works, Cloudera, MapR) have packaged the components of the Hadoop ecosystem, thus facilitating the implementation of lambda architectures.
Over the past few years, several major players such as Amazon, Microsoft and Google have marketed cloud offers that simplify the implementation of the Hadoop stack even further. They also offer alternative solutions for some of the components, such as Amazon Redshift, Google Big Query or Microsoft Azure Document DB.
How much does a data lake cost?
Pierre Harand: Data lakes are based on very low-cost storage and processing technologies, which means that the entry cost for a data lake – around dozens of thousands of euros – is lower than for a DMP, which can be up to several hundreds of thousands of euros, or even exceed one million euros in the case of large companies.
However, the cost of the governance design and exploration phases prior to the setting up of a data lake should not be overlooked. These phases allow companies to set up data collection frameworks and ensure their reliability.
Generally speaking, we advise companies not to rush into using the tool without having first defined what they wish to extract from it.
The advantage of a data lake is that it allows companies to start small with low-stake analysis and performance measurement projects. It often gives them the opportunity to mobilise several departments that do not usually work together and to engage them in a reflection about the issue of the customer experience they wish to provide (for example, maximise store sales per region based on customer behaviour, inventory and logistics).
What fundamental questions should companies ask themselves before setting up a data lake?
Jean-François Wassong: The key issues have to do with governance, security and data reliability. We thus recommend companies to involve both the legal and IT departments beforehand, as well as to answer the following questions:
- What does each department need a data lake for, and what are the expected purpose and benefits of the data lake?
- How will we feed the data lake?
- What are the legal and technical requirements (for instance regarding privacy and personal data storage)?
- What prior consents do we need to establish?
- How can we extract our data towards the cloud?
- In what geographical locations will the data be stored and processed?
- What are the security standards to be observed?
Data lake projects also touch upon significant issues regarding the security of the Information System, the relevance of switching over to a cloud service, personal data management as well as change management. In this regard, a substantial educational effort must be made in order to gain the support of the various departments, including the IT and legal departments. It is therefore essential to involve all of these players in the project as early on as possible.
This interview was originally published on Viuz, and translated from the original French by Marion Beaujard.