For the average Joe, information technology (IT) is a mysterious universe filled with indecipherable programming languages and expensive hardware. Eavesdropping on IT technicians is almost like overhearing a conversation spoken in a foreign language. But despite this seemingly impenetrable language barrier, it can be critically important for decision-makers in businesses and organizations to understand the world of IT. One of the most important IT concepts is data integration.
On the surface, data integration sounds like a simple idea. Because many organizations store information on multiple databases, they need a way to retrieve data from different sources and assemble it in a unified way. For example, let's imagine that an electronics company is preparing to roll out a new mobile device. The marketing department might want to retrieve customer information from a sales department database and compare it to information from the product department to create a targeted sales list. A good data integration system would let the marketing department view information from both sources in a unified way, leaving out any information that didn't apply to the search.
In reality, data integration is a complicated discipline. There isn't a universal approach to data integration, and many of the techniques IT experts use are still evolving. Some data integration approaches might work better than others for an organization, depending upon that organization's needs. We'll take a close look at some of the general strategies IT experts use to integrate multiple sources of data and enter the world of database management.
What are the basics of data integration? Find out in the next section.
Data integration focuses mainly on databases. A database is an organized collection of data. It's similar to a file system, which is an organizational structure for files so they're easy to find, access and manipulate.
There are different ways to categorize databases. Some people prefer to classify them according to the kind of data the databases store. For example, you might classify a database as a media database if all the information stored there is contained in video or sound files.
Another classification method looks at how the databases organize data. A database's organizational arrangement is called a schema. A common organizational technique is to use tables to show the relationship between different data points. Tables are like spreadsheets. Columns define categories of data, while rows are records. A database using this approach is a relational database.
Object-oriented programming (OOP) databases take a different approach to organizing data. The OOP language is a departure from traditional approaches to programming, which follow the pattern of inserting data into a set of instructions and then producing output. The OOP language focuses instead on defining data as objects and then determining how different objects relate and interact with one another.
To create an OOP database, first you'd define all the objects you plan on storing in the database. Then, you'd define the way each object relates to every other object within the database. After you identify an object, you put it into a class, or set of objects. To define a class you have to determine what data each object within that class must have and which logic sequences, called methods, will affect those objects. The objects within a system can communicate with you or other objects using interfaces called messages.
It's easier to understand with an example. Let's say you're building a database containing information about American sports. You decide to start by defining baseball teams. Once you've created the definition of a baseball team, you can generalize it as a class within the database. The Atlanta Braves would be a specific instance of that class, also known as an object. The class of baseball teams belongs to a superclass of American sports teams, which would also include other classes like football and soccer teams.
To access information within a database (no matter how it organizes data), you use a query. A query is just a request for information. People and applications can submit queries to databases. A database responds to queries by sending data that meets the original request's parameters. Queries rely on special computer languages such as Structured Query Language (SQL). If you've ever used an Internet search engine, you've submitted a query -- your search terms.
The database responds to queries by creating a view of data. A view is a specific way of displaying data. In a data integration system, the returned view shows only the data directly related to the original query. In our table example, if you submitted a query asking for all the customers who bought more than $100 worth of products, you'd get this result:
This view shows only the data relevant to the query "customers who purchased more than $100 in products." Notice that it doesn't show what kind of products were purchased, nor does it display customers who purchased less than $100 of products.
What are the different approaches to data integration? Find out in the next section.
What is data?
Data can be any kind of information at all. It can be the contents of a cell on a spreadsheet. It can be a sound file or a video. It can be a string of words in a document. It can be the raw information created as output from a computer program. Or it can be the information used to describe a file. Data integration focuses on information, not files.
Data Integration Approaches
Based on the previous section, you might think that databases are fairly complex. That's a fair assumption, and it helps explain why data integration is still a developing discipline even though it's more than 30 years old. The goal of data integration is to gather data from different sources, combine it and present it in such a way that it appears to be a unified whole.
Let's say you're about to leave on a trip and you want to see what traffic is like before you decide which route to take out of town. Here's how the different approaches to data integration would handle your query.
The manual integration approach would leave all the work to you. First, you'd have to know where to look for your data. You would need to know the physical location for both the traffic report and the map for your town. You would need to retrieve the traffic report and the map data directly from their respective databases, then compare the two sets of data against each other to figure out what's the best route out of town.
If you used a common user interface approach, you'd have to do a little less work. You'd use an interface such as the World Wide Web to make a query. The query results would appear as a view on the interface. You'd still have to compare the traffic report to the map to determine the best route, but at least the interface would take care of locating and retrieving the data.
Some integration approaches rely on applications to do all the work for you. The applications, which are specialized computer programs, would locate, retrieve and integrate the information for you. During the integration process, the applications must manipulate the data so that the information from one source is compatible with the information from the other source. In our example, that would mean you'd submit a query to an application and it would present a view that combined a map of your town with data from traffic reports. The problem with this approach is that applications become complex and difficult to program as the number of data sources and formats increase.
Then there's the common data storage method, also known as data warehousing. Using this method, all the data from the various databases you intend to integrate are extracted, transformed and loaded. That means that the data warehouse first pulls all the data from the various data sources. Then, the data warehouse converts all the data into a common format so that one set of data is compatible with another. Then it loads this new data into its own database. When you submit your query, the data warehouse locates the data, retrieves it and presents it to you in an integrated view. Using our example, the data warehouse would locate the latest information it has on traffic reports and maps of your town. Then it would integrate the two together and send the view back to you. There are several advantages and drawbacks to this system, which we'll look into in the next section.
Most data integration system designers assume that the end goal is to create as little work for the end user as possible, so they tend to focus on applications and data warehousing techniques.
What is it that data warehouses do, exactly? Find out in the next section.
This Was a Triumph
Portal pages like iGoogle or My Yahoo are examples of common user interfaces. Portals pull information from multiple sources, but don't integrate the data into a unified view.
As we saw earlier, a data warehouse is a database that stores information from other databases using a common format. That's about as specific as you can get when describing data warehouses. There's no unified definition that dictates what data warehouses are or how designers should build them. As a result, there are several different ways to create data warehouses, and one data warehouse might look and behave very differently from another.
In general, queries to a data warehouse take very little time to resolve. That's because the data warehouse has already done the major work of extracting, converting and combining data. The user's side of a data warehouse is called the front end, so from a front-end standpoint, data warehousing is an efficient way to get integrated data.
From the back-end perspective, it's a different story. Database managers must put a lot of thought into a data warehouse system to make it effective and efficient. Converting the data gathered from different sources into a common format can be particularly difficult. The system requires a consistent approach to describing and encoding the data.
The warehouse must have a database large enough to store data gathered from multiple sources. Some data warehouses include an additional step called a data mart. The data warehouse takes over the duties of aggregating data, while the data mart responds to user queries by retrieving and combining the appropriate data from the warehouse.
One problem with data warehouses is that the information in them isn't always current. That's because of the way data warehouses work -- they pull information from other databases periodically. If the data in those databases changes between extractions, queries to the data warehouse won't result in the most current and accurate views. If the data in a system rarely changes, this isn't a big deal. For other applications, though, it's problematic.
Going back to our example from before with the traffic report and map, you can see how this would be a problem. While the town's map might not require frequent updates, traffic conditions can change dramatically in a relatively short amount of time. A data warehouse might not extract data very frequently, which means time-sensitive information may not be reliable. For those sort of applications, it's better to take a different data integration approach.
What's the alternative to data warehousing? Find out in the next section.
Meta? I hardly knew her!
Descriptions of data are called metadata. Metadata is useful for naming and defining data as well as describing the relationship of one set of data to other sets. Data integration systems use metadata to locate the information relevant to queries.
For data integration systems that rely on information that changes frequently, a data warehouse approach isn't ideal. One way that IT experts try to address this issue is to design systems that pull data directly from individual data sources. Since there's no centralized database dedicated to analyzing, categorizing and integrating the data in preparation for user queries, those responsibilities fall to other parts of the system.
IT experts define data integration systems in terms of schemata. The unified view produced from a processed query is the global schema. The structure of the various data sources and the way they relate to one another is the source schema. The way the global and source schemata interrelate is called mapping. Think of the source schema as a blueprint for all the data within the system, while the global schema is a blueprint for the view presented in response to a query.
There are two main approaches to resolving queries in a data integrated system: global-as-view and local-as-view. Each approach focuses on a particular part of the overall system and has its advantages and disadvantages.
In a global-as-view approach, the focus is on the global schema. As long as the data sources remain consistent, the global-as-view approach works well. It's easy to change the set-up of the global schema. That means it's not difficult to analyze the same overall set of data in different ways. However, adding or removing data sources to the system is problematic, because it affects data across the system as a whole.
The local-as-view technique takes the opposite approach. It focuses on the data sources. As long as the global schema remains constant, it's easy to add or remove data sources to the system. The schema looks for the same kinds of data and relationships within the new data sources. In this approach, changing the parameters of the global schema is difficult. If you want to analyze the data sources in a new way, you'll have to redefine the entire system.
So that's the story on data integration. The next time you look at a weather map or call up a filtered selection of data, you'll be aware of the complex series of processes going on in the background making it all possible.
To learn more about data integration, migrate on over to the links on the following page.
Federated database systems (FDBMS) are collections of networked, autonomous databases. These systems handle several daunting tasks:
Accepting user queries
Dividing them into several sub-queries
Using special tags called wrappers to define the sub-queries in a way that each respective database can understand
Sending those wrapped sub-queries to the appropriate databases
Accepting the data sent back from those databases
Integrating all the data into a unified view, and
Presenting that view to the user.
They're difficult to create and maintain, mainly due to the complex nature of databases.
Frequently Answered Questions
What is the process of data integration?
Data integration is the process of combining data from multiple sources into a single database or data warehouse. The process typically involves extracting data from multiple sources, transforming the data into a common format, and loading the data into a destination database.
What is the purpose of data integration?
Data integration is the process of combining data from multiple sources into a single repository. The purpose of data integration is to provide a single, consolidated view of data that can be used for reporting and analysis.
Baldwin, James R. "The Data Warehouse: An Overview." Spring 1997. http://varietysoftworks.com/jbaldwin/Education/CS615-DataWarehouse.html
Haas, Laura and Lin, Eileen. "IBM Federated Database Technology." IBM. March 1, 2002. http://www-128.ibm.com/developerworks/db2/library/techarticle/0203haas/0203haas.html
Halevy, Alon Y. et al. "Enterprise Information Integration: Successes, Challenges and Controversies." International Conference on Management of Data. 2005.
Koch, Christoph. "Data Integration against Multiple Evolving Autonomous Schemata." PhD Thesis. Technical University, Vienna. May 16, 2001. http://www.csd.uoc.gr/~hy562/Papers/thesis_final.pdf
Lenzerini, Maurizio. "Data Integration: A Theoretical Perspective." University of Rome. ACM PODS. 2002. http://delivery.acm.org/10.1145/550000/543644/p233-lenzerini.pdf?key1=543644&key2=8368337021&coll=&dl=&CFID=15151515&CFTOKEN=6184618
Poje, Richard J. "Treasury and IT integration in plain English." Sep. 1, 2003. http://www.allbusiness.com/management/channel-management/659425-1.html