How Data Integration Works

By: Jonathan Strickland  | 
Information moving through internet searches
Data integration
John Bloor, iStockphoto

For the average Joe, information technology (IT) can be a mysterious universe filled with indecipherable programming languages and expensive hardware. However, despite some IT jargon sounding like a foreign language, it can be critically important for decision-makers in businesses and organizations to understand the world of IT. One of the most important IT concepts is data integration.

On the surface, data integration may sound like a simple enough idea. Because many organizations store information on multiple databases, they need a way to retrieve data from different sources and assemble it in a unified way.

Advertisement

In reality, data integration solutions are complicated. There isn't a universal approach to data integration, and many of the techniques IT experts use are still evolving. One data integration tool might work better than another for an organization, depending upon that organization's needs.

So, what are the basics of data integration? Let's get into the details!

Advertisement

How Does Data Integration Work?

Results from a sales database
This simple table shows customer purchases.
HowStuffWorks.com

Data integration focuses mainly on databases. A database is an organized collection of data. It's similar to a file system, which is an organizational structure for files so they're easy to find, access and manipulate.

There are different ways to categorize databases. Some people prefer to classify them according to the kind of data the databases store. For example, you might classify a database as a media database if all the information stored there is contained in video or sound files.

Advertisement

Another classification method looks at how the databases organize data. A database's organizational arrangement is called a schema. A common organizational technique is to use tables to show the relationship between different data points. Tables are like spreadsheets. Columns define categories of data, while rows are records. A database using this approach is a relational database.

Object-oriented programming (OOP) databases take a different approach to organizing data. The OOP language is a departure from traditional approaches to programming, which follow the pattern of inserting data into a set of instructions and then producing output. The OOP language focuses instead on defining data as objects and then determining how different objects relate and interact with one another.

To create an OOP database, first you'd define all the objects you plan on storing in the database. Then, you'd define the way each object relates to every other object within the database. After you identify an object, you put it into a class — or set of objects. To define a class, you have to determine what data each object within that class must have and which logic sequences, called methods, will affect those objects. The objects within a system can communicate with you or other objects using interfaces called messages.

It's easier to understand with an example. Let's say you're building a database containing information about American sports. You decide to start by defining baseball teams. Once you've created the definition of a baseball team, you can generalize it as a class within the database. The Atlanta Braves would be a specific instance of that class, also known as an object. The class of baseball teams belongs to a superclass of American sports teams, which would also include other classes like football and soccer teams.

To access information within a database (no matter how it organizes data), you use a query. A query is just a request for information. People and applications can submit queries to databases. A database responds to queries by sending data that meets the original request's parameters. Queries rely on special computer languages such as Structured Query Language (SQL). If you've ever used an Internet search engine, you've submitted a query — your search terms.

The database responds to queries by creating a view of data. A view is a specific way of displaying data. In a data integration system, the returned view shows only the data directly related to the original query. In our table example, if you submitted a query asking for all the customers who bought more than $100 worth of products, you'd get this result:­

Results of which customers purchased more than $100 in products
This view shows only the data relevant to the query "customers who purchased more than $100 in products."
HowStuffWorks.com

This view shows only the data relevant to the query "customers who purchased more than $100 in products." Notice that it doesn't show what kind of products were purchased, nor does it display customers who purchased less than $100 of products.

What are the different approaches to data integration? We'll get into that next.

Advertisement

Data Integration Tools

Based on the previous section, you might think that databases are fairly complex. That's a fair assumption, and it helps explain why data integration is still a developing discipline even though it's been around for decades. The goal of data integration is to gather data from different sources, combine it and present it in such a way that it appears to be a unified whole. However, the success of this process depends heavily on the data quality, as poor data can lead to inaccurate conclusions or insights.

Let's say you're about to leave on a trip and you want to see what traffic is like before you decide which route to take out of town. Here's how the different approaches to data integration would handle your query.

Advertisement

The manual integration approach would leave all the work to you. First, you'd have to know where to look for your data. You would need to know the physical location for both the traffic report and the map for your town. You would need to retrieve the traffic report and the map data directly from their respective databases, then compare the two sets of data against each other to figure out what's the best route out of town.

If you used a common user interface approach, you'd have to do a little less work. You'd use an interface such as the internet to make a query. The query results would appear as a view on the interface. You'd still have to compare the traffic report to the map to determine the best route, but at least the interface would take care of locating and retrieving the data.

Some integration approaches rely on applications to do all the work for you. The applications, often referred to as data integration tools, are specialized programs designed to locate, retrieve, and integrate the information for you. Data scientists often develop these applications to ensure that data integration processes run smoothly and deliver accurate results.

During the integration process, the applications must manipulate the data so that the information from one source is compatible with the information from the other source. In our example, that would mean you'd submit a query to an application and it would present a view that combined a map of your town with data from traffic reports. The problem with this approach is that applications become complex and difficult to program as the number of data sources and formats increases.

Then there's the common data storage method, also known as data warehousing. Using this method, all the data from the various databases you intend to integrate are extracted, transformed and loaded. That means that the data warehouse first pulls all the data from the various data sources. Then, the data warehouse converts all the data into a common format so that one set of data is compatible with another. Then it loads this new data into its own database. When you submit your query, the data warehouse locates the data, retrieves it and presents it to you in an integrated view.

Using our example, the data warehouse would locate the latest information it has on traffic reports and maps of your town. Then it would integrate the two together and send the view back to you. There are several advantages and drawbacks to this system, which we'll look into in the next section.

Most data integration system designers assume that the end goal is to create as little work for the end user as possible, so they tend to focus on applications and data warehousing techniques.

What is it that data warehouses do, exactly? Find out next!

Advertisement

The Data Warehouse

As we saw earlier, a data warehouse is a database that stores information from other databases using a common format. That's about as specific as you can get when describing data warehouses. There's no unified definition that dictates what data warehouses are or how designers should build them. As a result, there are several different ways to create data warehouses, and one data warehouse might look and behave very differently from another.

In general, queries to a data warehouse take very little time to resolve. That's because the data warehouse has already done the major work of extracting, converting and combining data. The user's side of a data warehouse is called the front end, so from a front-end standpoint, data warehousing is an efficient way to get integrated data.

Advertisement

From the back-end perspective, it's a different story. Database managers must put a lot of thought into a data warehouse system to make it effective and efficient. Converting the data gathered from different sources into a common format can be particularly difficult. The system requires a consistent approach to describing and encoding the data.

The warehouse must have a database large enough to store data gathered from multiple sources. Some data warehouses include an additional step called a data mart. The data warehouse takes over the duties of aggregating data, while the data mart responds to user queries by retrieving and combining the appropriate data from the warehouse.

One problem with data warehouses is that the information in them isn't always current. That's because of the way data warehouses work — they pull information from other databases periodically. If the data in those databases changes between extractions, queries to the data warehouse won't result in the most current and accurate views. If the data in a system rarely changes, this isn't a big deal. For other applications, though, it's problematic.

Going back to our example from before with the traffic report and map, you can see how this would be a problem. While the town's map might not require frequent updates, traffic conditions can change dramatically in a relatively short amount of time. A data warehouse might not extract data very frequently, which means time-sensitive information may not be reliable. For those sorts of applications, it can be better to take a different data integration approach.

What's the alternative to data warehousing? Let's take a look!

Advertisement

Networked Databases

For data integration systems that rely on information that changes frequently, a data warehouse approach isn't ideal. In these cases, data virtualization may offer a more flexible approach by allowing data from different sources to be accessed without requiring physical integration. Other alternatives, such as streaming data integration or real-time data processing, also offer solutions for organizations that need to manage rapidly changing information.

One way that IT experts try to address the issue of frequently changing information is to design systems that pull data directly from individual data sources. Since there's no centralized database dedicated to analyzing, categorizing and integrating the data in preparation for user queries, those responsibilities fall to other parts of the system.

Advertisement

IT experts define data integration systems in terms of schemata. The unified view produced from a processed query is the global schema. The structure of the various data sources and the way they relate to one another is the source schema. The way the global and source schemata interrelate is called mapping. Think of the source schema as a blueprint for all the data within the system, while the global schema is a blueprint for the view presented in response to a query.

There are two main approaches to resolving queries in a data integration system: global-as-view and local-as-view. Each approach focuses on a particular part of the overall system and has its advantages and disadvantages.

In a global-as-view approach, the focus is on the global schema. As long as the data sources remain consistent, the global-as-view approach works well. It's easy to change the set-up of the global schema. That means it's not difficult to analyze the same overall set of data in different ways. However, adding or removing data sources to the system is problematic because it affects data across the system as a whole.

The local-as-view technique takes the opposite approach. It focuses on the data sources. As long as the global schema remains constant, it's easy to add or remove data sources to the system. The schema looks for the same kinds of data and relationships within the new data sources. In this approach, changing the parameters of the global schema is difficult. If you want to analyze the data sources in a new way, you'll have to redefine the entire system.

So, that's the story on data integration. The next time you look at a weather map or call up a filtered selection of data, you'll now be more aware of the complex series of processes going on in the background making it all possible.

We updated this article in conjunction with AI technology, then made sure it was fact-checked and edited by a HowStuffWorks editor.

Advertisement

Frequently Answered Questions

What is the process of data integration?
Data integration is the process of combining data from multiple sources to create a unified view. It typically involves extracting data from various sources, transforming the data into a common format, and loading it into a destination, such as a data warehouse or integrated system for reporting and analysis.
What is the purpose of data integration?
The purpose of data integration is to create a unified, consolidated view of data from multiple sources. This view can be used for reporting, business intelligence, and analytics, helping organizations make informed decisions and improve operational efficiency.

Advertisement

Loading...