• Data Warehouses, Lakes, and Marts

    Businesses with lots of data need to prioritize proper data architecture. On a system with complex data management, it's very important that data is being transferred and stored in the right way. This way, those that need the data will know where to find it and how to access it. 

    Data architecture has two main components - the data connectors that transmit information between systems, and the tools that businesses store their data in. Data connectors are important, but they get discussed to death in BI marketing. If you're confused about them, it's pretty easy to find out more (we even have a guide on them here and here)

    However, it's pretty difficult to find much valuable information on data storage solutions. We have an article that goes over some strategies, but if you're a real beginner, you'll want something even more basic. 

    What is a data warehouse? Why do I need a data warehouse? What are my other options? These are some common beginner questions. Since these terms so often go unexplained, those looking to implement their first BI solution don't always know what they're getting. To prevent that, we'll go over some of the most common terms. 

    Data warehouses

    Data warehouses are the most basic method that businesses use to store their data. They're systems that can store a large amount of structured data, and then easily recall that data in a structured way when users need it. 

    When BI experts talk about structured data, they're talking about data that has metadata attached to it that sorts the data into rows and columns. Think of data in a spreadsheet. A data point's location in its row and column define how useful that data point is. Without structure, this kind of numerical or text data is largely useless. 

    Structured data also has the advantage of being comparatively easy to process and query, relatively speaking. Most data storage solutions use a programming language called SQL to encode their data. With some knowledge of this language, users can easily query their database to find the data they want. 

    Data warehouses can handle a far larger scale of data than other, less powerful storage solutions. Unlike the onboard storage in a standalone business tool, a data warehouse is designed to handle the entire data load of an organization, not just a few departments.  

    Usually, businesses that use a standalone data warehouse use that data warehouse as a central repository for all of their structured data, from any source. If a user wants to access a certain data stream, they do so by querying their data warehouse, instead of going to the tool that collects that data. 

    This data architecture centralizes all of a business's data. In more decentralized systems, it's very easy for unconnected tools and databases to become siloed from other sections of the system. This means they can't be accessed easily, and data gets trapped within them. 

    This data gets trapped, not for any data governance reason, but just because it can't easily be accessed in a consistent way by other parts of the data network. Siloes are a major issue with decentralized data architecture, and data warehouses make them largely a thing of the past. 

    Businesses with large data requirements are usually best served by a data warehouse, but they do have some limitations that make them a bad fit for some. For one, they're often more expensive than using other solutions. Many businesses don't have the amount of data that makes them cost efficient. 

    BI tool as data warehouse

    For smaller businesses, it's rarely economical to invest in a standalone data warehouse. They don't have the scale of data that makes those sorts of tools 'worth it' to larger companies. They still need a uniform data management solution, though, so that they can access and store their data in a consistent way. 

    Businesses in this situation often end up using their business intelligence tool as a de-facto data warehouse. This makes a lot of sense, since their BI tool is already connected to all of their other business software. 

    BI tools can easily act as a data warehouse. Often, they're much easier to query than a standalone data warehousing tool, which makes them a good choice for those without much technical expertise. They're usually easier to navigate to find needed files. 

    However, they do come with some downsides. The biggest issue is the scale of data that they can handle. While many BI tools can handle massive amounts of data, it can quickly get very expensive to store that much data in the tool. BI vendors aren't in the business of storing data, they're in the BI business, and so they charge a premium for data storage. 

    For smaller businesses, it'll be more economical to pay the comparatively smaller fees for storing data in their BI tool, than it would be to pay for a standalone data warehouse. For larger companies, though, this math is different. It becomes a better choice to use something like Snowflake to store their data at scale.  

    Data lakes

    Some of the largest, most tech-focused businesses have started to move away from data warehouses, and towards what they're calling a 'data lake' model. Data lakes are a new kind of storage solution that attempts to solve some of the problems with data warehouses. 

    Data lakes act as a central repository for data, just like a data warehouse would. However, data lakes can store any kind of data, not just structured data. This makes them a better choice for businesses that want to store large amounts of unstructured data in a centralized way. 

    Unstructured data is any sort of data that doesn't have relational relationships attached to it through metadata. It's not arranged into rows and columns, so it often lacks context. It's rarely used for general numerical data, but it's a good fit for things like raw text, pictures and video, and Internet of Things data. 

    Most data lakes can store both structured and unstructured data. This way, businesses that have both kinds of data can store it all in one place. Businesses that have really complex data demands often choose data lakes over data warehouses. 

    Data lakes do have a few flaws, though. Since their data isn't structured, it's much harder to query a data lake to find needed information. Often, data can get lost or forgotten. Some experts also believe they encourage businesses to collect and store data that they don't need, since they don't have to do anything to the data to store it properly. 

    Regardless of their issues, data lakes are quickly becoming the data storage strategy of choice for large corporations and businesses that need to store a large amount of unstructured data. 

    Data marts

    Data marts are another level of data architecture between the data warehouse and end user. Many businesses, especially ones that have large amounts of centralized data, use the data mart model to streamline the data access process. 

    In the data mart model, data storage solutions called data marts act as a middleman between the end user and their data. Instead of end users querying their data warehouse directly to access their data, they query a data mart, which communicates with the data warehouse and should have all the relevant data. 

    Most businesses that use this sort of architecture arrange their data marts around topics or departments. For example, there might be a sales mart, a marketing mart, a development mart, and so on. Users only need to access the marts that relate to their work, and don't need to access others. 

    Data marts are read-only to the end user; they can only be updated with data from the data warehouse. They also rarely update their data on-demand; usually, they're updated on a set schedule that's less demanding on the data warehouse. 

    One key thing to remember is that data marts don't replace a business's data warehouse; they act as a second layer of data management between the data warehouse and the end user. 

    Using a data mart model can have some important benefits. First, it lowers the operating demands of the data warehouse. At a large business, a data warehouse may have dozens or hundreds of people accessing it at once. Depending on the software, this can slow things down or lead to glitches. Data marts can lower the demand on a data warehouse. 

    Second, data marts are important for data governance. In most schemas, the system administrator has control over what data should go into each data mart. Users can't access other data marts, and they can't access the data warehouse itself. They can only access their own data mart and the data contained inside it. 

    This structure is a powerful data governance tool. It heavily limits what sort of data users can and can't see, and makes sure that users only have access to data that's relevant to them. Not only does this keep users from seeing sensitive data, it also helps them to navigate to data that can help them do their jobs. 

    Conclusion

    Businesses have a few options for data storage. Data warehousing is one of the most common solutions. With a data warehouse, businesses can store large amounts of structured data, and query it on demand. This is preferable to other systems, which are harder to scale and harder to query. 

    Smaller businesses don't necessarily need to pay for a large-scale data warehousing solution, though. For businesses with lower data demands, a BI tool can work as a data warehouse. It connects and stores data just like a data warehouse would, but it has the additional advantage of being able to transform that data and perform analysis on it. 

    In recent years, there have been a few major innovations on the data warehousing model. While these innovations aren't necessary for every business, they do make life easier on large businesses with complex data demands. 

    Data lakes are a similar concept to data warehouses; they're a central repository of business data. However, data lakes can handle both structured and unstructured data, which makes them valuable for storing things like pictures or raw data. They can be harder to actually use than traditional data warehouses, though. 

    Data marts act as a second layer of communication between the end user and the data warehouse. They limit what sort of data users can access, so that they just need to access their relevant data mart instead of querying the data warehouse every time. 

    Data management is an important part of any business intelligence implementation. Businesses need to put down consistent procedures for storing and transmitting information, so that data is treated in a consistent way and everything is easy to find. 

Related News