As data has gotten bigger and broader over the past twenty years, a new engineering discipline has emerged to handle the unique challenges and opportunities storing, transforming, and moving data presents. Data Engineers build and connect the technologies that deliver data when, where, and how it is needed in the organization. In this section, I will cut through the jargon and define the important concepts you need to know.
Warehouses, Lakes, and Lake Houses: Data warehouses, data lakes, and data lake houses are all storage technologies that fall under a class of database systems called online analytical processing (OLAP). That is, each of these are a way to store data in a way that it is optimized for aggregation and analysis. This differentiates warehouses, etc. from online transactional processing (OLTP) which is optimized for storing large amounts of transactions (e.g., calls to a call center).
Using our terminology from above, OLTP systems are typically your systems of record, and they feed the OLAPs (warehouses, etc.) that your data engineers build. Analysts can then pull from the OLAPs and save a lot of time and effort in data wrangling. Whether your data engineer builds a data warehouse, lake, or lake house will depend largely on the type of data existing in your underlying Systems of Record.
Data warehouses are the oldest of these architectures and are well suited for structured data (data that is typically housed in rows and columns). Given that most of the work done by the legal industry is structured, data warehouses are still a highly effective mode of storing data for analytics for law firms and departments. However, semi- and unstructured data (e.g., text, videos, images, documents) are quickly proliferating–even in legal. Data lakes were developed in response to this need for greater storage flexibility, as they can handle structured, semi-structured, and unstructured data. Finally, data lake houses were developed recently to combine the best features of both warehouses and lakes, while adding governance enforcement features.
Regardless of the architecture you choose, the goal of each is to provide analytical data to business intelligence, data science, and AI platforms.
Final Stop: Consumption
The data governance and engineering activities outlined above are all in service of the ultimate consumption of data by the business. While you may be familiar with longstanding or headline-making ways we utilize data (e.g., dashboards or generative AI, respectively), data management enables consumption in myriad other ways. To end this article, I will focus on just two terms you may have heard but still aren’t quite sure what they mean.
Metadata: In its most reductive sense, metadata is data about data. In context, metadata are those all-important features of a piece of data that tell us how we can use it, where we can find it, and how much we can trust it. Metadata could be definition, an effective date, or a system of record for a piece of data. When managed and shared with the organization, metadata becomes both a powerful governance tool and an enabler of enterprise analytics, data science, and AI.
Data Literacy: This term describes the ability to understand and communicate with data. As businesses and their employees generate and consume ever increasing amounts of data, data literacy has become as important today as computer literacy was 20 years ago. Helpful data literacy topics include: making sound inferences from data, interpreting charts and graphs, and spotting data quality issues.
Data is one of the most important assets of this century, not just to organizations, but to individuals as well. We all deserve to be a part of the conversation around data. I hope this article helped you see beyond the buzzwords to the holistic, and exciting, discipline of data management, and prepared to you take part in what’s next.