Data transformation is an essential part of the BI process. Businesses need to turn their raw, unedited business data into actionable, insightful data implications. They can't do that without transforming their data in new ways.
However, many businesses don't want to invest in data transformation. They think it's overly complicated, and not worth the time. They don't want their employees working on data transformations when they could be doing real data analysis.
Data transformation is worth the time, but it is true that it can be overly complicated. A single BI tool might offer two dozen different transformation techniques, without providing guidance on how to use any of them.
Just looking at a data transformation feature can be daunting. While a lot of work has been done to make data transformation easier to perform, employees can easily get overwhelmed by all of the choices that they have.
The core problem here is that BI users don't know what different data transformation functions do, and how they can be used to make raw data more actionable. This glossary is here to help with that.
We've listed some of the most common data transformations, and explained what they're for. This way, you can transform your data in a more effective way.
Add Constants - This transformation can add a new column to your data set that contains a constant value. That constant value can be a number, or it can be some kind of text.
This transformation is useful for keeping data from multiple sources straight. For example, if you're joining data from different branches, you can add the name of the branch as a constant to each individual data set, so you know where each data point came from.
Alter Columns - This transformation is used to change the data type of a column. Sometimes, a data integration will incorrectly transfer over numbers as text columns, or not effectively recognize dates or monetary information.
Using this transformation, you can tell a BI tool what kind of data is in a given row. This way, it can run content-specific analytics on that row, like ordering dates based on recent-ness.
Append Rows - An append is one of the major techniques for combining data sets. In an append, the data to be appended is added to the bottom of the original data set. For best results, the columns of both data sets need to match.
It can also be used in cases where columns don't match exactly, but that'll leave the data set with a lot of null values. It's best for updating old data sets with newer data from the same source.
Calculated Field - This function can add a calculated result to the data set as a column. It uses a formula to power this result. For instance, you could add a column that's a multiplication of the data from another column, or divide one column by another to get a new value.
This is useful for situations where end users will need a derived value that's not in the original data set, but can be added easily with a basic formula. Using SQL queries, these formulas can get more complicated.
Combine Columns - This simple transformation merges two or more columns into one, unified column. This takes all of the content from the original rows, and adds it to a new, combined row.
A user might want to combine two rows to group content in a more effective way. For example, they might combine 'First Name' and 'Last Name' columns into a more useful 'Name' column.
Date Operations - This transformation is similar to a calculated field, except it uses date data instead of integers. With this function, users can derive new dates from their other date fields.
This is useful if there was important date data left out of the original data set. For instance, you could add a month or a year to a data set that just had raw MM/DD/YYYY data.
Deduplicate - Deduplication is a process that removes duplicate rows from a data set. This function allows users to run that process on their data set automatically.
While deduplication can be useful for removing errors from a raw data set, it's really useful for removing duplicates from combined data. A join or append can easily result in duplicate rows, but this function can automatically clean that up.
Filter Rows - This operation allows users to define filter rules for their data set, and then filter out any row that has data that does or doesn't meet those rules. Users can filter out specific dates, text strings, or numbers.
It's particularly helpful for filtering out incomplete data. Often, data sets will include rows that don't have all the data that they need to be effective. These null data points can throw off analytics if they're not filtered out.
Group By - This function allows for aggregation of data based on a single shared characteristic. For example, if you have a data set with multiple entries for the same dates, you can group by date to aggregate all that data under a single entry.
Businesses can use this to aggregate large data sets into smaller, less detailed data sets that provide broader views. It can also help to aggregate data sets so that they can be joined with other data sets more effectively.
Join - A join is one of the most common techniques for combining data sets. In a join, the columns from one data set are added to another data set to create one large dataset. To do this, the data points in at least one column of each data set need to match, so that it's clear where each data entry should go.
There are a few types of join, depending on which columns need to be kept and which should be discarded. For example, a full outer join includes all the rows from all the data sets, while an inner join only keeps the columns that match.
Joins are extremely effective for combining data from different data sources. Users can look for shared data points across their data sources, like dates and sales IDs, and then join related data points to get a wider view of their operations.
Pivot - A pivot transformation can turn a row in a data set into a column. The data is 'pivoted' from a vertical row to a horizontal column or set of columns. This allows users to widen their data sets or fix data sets that were badly integrated.
Pivots can help users to arrange their data in more useful ways. With pivots, users can highlight the metrics that are really important without getting a new data set or changing their integration. Sometimes, pivots can even help in aggregating data.
Rank - A rank transformation evaluates the data in a given data column, organizes it in some consistent way, and then filters out all rows that fall above or below a certain threshold.
For example, a sales manager could rank all of their salespeople by their total revenue, and then filter out all but their top 50 best-performing. This way, they can run data analytics to figure out what makes a salesperson successful.
Replace Data - This transformation allows users to replace all occurrences of a specific text string with another text string. It's like the find-and-replace tool that you might find in a word processing program.
Sometimes, a data source may express a specific data point in an ineffective way. For example, it might abbreviate the names of the months, which throws off visualizations that look for full month names.
With a replace, you can replace abbreviations with full text, or replace full text with abbreviations. You can edit any text string and replace it with more effective data.
Select Columns - Using this transformation, a BI user can delete, reorder, and rename the columns in their data set. This way, they can configure their data set to be just the way they want it.
This function is also essential for properly joining data. Using this function, you can make sure your matching columns have the same name, so that it's easier to join them later on. They can also remove columns that they know won't be useful.
Split Column - This transformation can split one column into multiple columns, distributing that column's content across the new ones. To do this, it looks for a delimiter, like a comma or a period, and sorts everything between the delimiters into new columns.
For example, you might have an 'Address' column that you want to make more granular. Using the split column transformation, you could split it up into 'Street Address', 'City', 'State', and 'Zip Code' columns, as long as commas or other delimiters separate those data points from each other.
String Operation - This somewhat technical operation allows users to put specific text strings out of their text-based data entries. By defining some simple rules, you can cut out irrelevant content from your text entries or add and remove spaces.
For example, a business may use a 15-number SKU to label its products, but only the last 5 numbers actually mean anything; the rest are zeroes. With a string operation, they can cut out all of the useless zeroes to use their SKUs in a more effective way.
Text Formatting - With the text formatting transformation, businesses can fix minor formatting problems with their text entries. This transformation handles the smallest, mostly cosmetic errors; larger errors need more powerful transformations.
Using this transformation, a business can change the case on their text entries, switching them into all uppercase, all lowercase, or using title case. They can also remove numbers from their text data, or only show numbers and remove the text.
Unpivot - The unpivot function is the opposite of the pivot function. Instead of turning rows into columns, it turns columns into rows. This is useful for narrowing data sets or for formatting it in a more effective way.
Value Mapping - This transformation is similar to the replace function, but it can act on non-text values like numbers and dates. With this function, users can replace values in a column with other values in a programmatic way.
This function can be helpful in dealing with null values or empty entries. With a value mapper, you can fill null values with a zero, so that the null values don't throw off your data analytics.
Windowing - The window function allows users to execute simple formulas on a specific window of their data, instead of the data as a whole. In conjunction with the ranking function, this can allow businesses to filter their data to a high degree.
For example, a sales manager may rank all of their salespeople by revenue, and then window their data to focus on salespeople between the top 25% and top 75%. This way, they can get a more accurate view of how their 'average' salespeople are doing.