Monday, April 13, 2009

Dimentional Data Modeling


Business Intelligence

Business intelligence is a term commonly associated with data warehousing. In fact, many of the tool vendors position their products as business intelligence software rather than data warehousing software. There are other occasions where the two terms are used interchangeably. So, exactly what is business intelligence?

Business intelligence usually refers to the information that is available for the enterprise to make decisions on. A data warehousing (or data mart) system is the backend, or the infrastructural, component for achieving business intelligence. Business intelligence also includes the insight gained from doing data mining analysis, as well as unstructured data (thus the need for content management systems). For our purposes here, we will discuss business intelligence in the context of using a data warehouse infrastructure

The most common tools used for business intelligence are as follows. They are listed in the following order: Increasing cost, increasing functionality, increasing business intelligence complexity, and decreasing number of total users.

Excel

Take a guess what's the most common business intelligence tool? You might be surprised to find out it's Microsoft Excel. There are several reasons for this:

1. It's relatively cheap.

2. It's commonly used. You can easily send an Excel sheet to another person without worrying whether the recipient knows how to read the numbers.

3. It has most of the functionalities users need to display data.

In fact, it is still so popular that all third-party reporting / OLAP tools have an "export to Excel" functionality. Even for home-built solutions, the ability to export numbers to Excel usually needs to be built.

Excel is best used for business operations reporting and goals tracking.

Reporting tool

In this discussion, I am including both custom-built reporting tools and the commercial reporting tools together. They provide some flexibility in terms of the ability for each user to create, schedule, and run their own reports. The Reporting Tool Selection discusses how one should select an OLAP tool.

Business operations reporting and dashboard are the most common applications for a reporting tool.

OLAP tool

OLAP tools are usually used by advanced users. They make it easy for users to look at the data from multiple dimensions. The OLAP Tool Selection discusses how one should select an OLAP tool.

OLAP tools are used for multidimensional analysis.

Data mining tool

Data mining tools are usually only by very specialized users, and in an organization, even large ones, there are usually only a handful of users using data mining tools.

Data mining tools are used for finding correlation among different factors.

Reporting Tool Selection

There is a wide variety of reporting requirements, and whether to buy or build a reporting tool for your business intelligence needs is also heavily dependent on the type of requirements. Typically, the determination is based on the following:

  • Number of reports: The higher the number of reports, the more likely that buying a reporting tool is a good idea. This is not only because reporting tools typically make creating new reports easier (by offering re-usable components), but they also already have report management systems to make maintenance and support functions easier.
  • Desired Report Distribution Mode: If the reports will only be distributed in a single mode (for example, email only, or over the browser only), we should then strongly consider the possibility of building the reporting tool from scratch. However, if users will access the reports through a variety of different channels, it would make sense to invest in a third-party reporting tool that already comes packaged with these distribution modes.
  • Ad Hoc Report Creation: Will the users be able to create their own ad hoc reports? If so, it is a good idea to purchase a reporting tool. These tool vendors have accumulated extensive experience and know the features that are important to users who are creating ad hoc reports. A second reason is that the ability to allow for ad hoc report creation necessarily relies on a strong metadata layer, and it is simply difficult to come up with a metadata model when building a reporting tool from scratch.

Reporting Tool Functionalities

Data is useless if all it does is sit in the data warehouse. As a result, the presentation layer is of very high importance.

Most of the OLAP vendors already have a front-end presentation layer that allows users to call up pre-defined reports or create ad hoc reports. There are also several report tool vendors. Either way, pay attention to the following points when evaluating reporting tools:

· Data source connection capabilities

In general there are two types of data sources, one the relationship database, the other is the OLAP multidimensional data source. Nowadays, chances are good that you might want to have both. Many tool vendors will tell you that they offer both options, but upon closer inspection, it is possible that the tool vendor is especially good for one type, but to connect to the other type of data source, it becomes a difficult exercise in programming.

· Scheduling and distribution capabilities

In a realistic data warehousing usage scenario by senior executives, all they have time for is to come in on Monday morning, look at the most important weekly numbers from the previous week (say the sales numbers), and that's how they satisfy their business intelligence needs. All the fancy ad hoc and drilling capabilities will not interest them, because they do not touch these features.

Based on the above scenario, the reporting tool must have scheduling and distribution capabilities. Weekly reports are scheduled to run on Monday morning, and the resulting reports are distributed to the senior executives either by email or web publishing. There are claims by various vendors that they can distribute reports through various interfaces, but based on my experience, the only ones that really matter are delivery via email and publishing over the intranet.

· Security Features: Because reporting tools, similar to OLAP tools, are geared towards a number of users, making sure people see only what they are supposed to see is important. Security can reside at the report level, folder level, column level, row level, or even individual cell level. By and large, all established reporting tools have these capabilities. Furthermore, they have a security layer that can interact with the common corporate login protocols. There are, however, cases where large corporations have developed their own user authentication mechanism and have a "single sign-on" policy. For these cases, having a seamless integration between the tool and the in-house authentication can require some work. I would recommend that you have the tool vendor team come in and make sure that the two are compatible.

· Customization

Every one of us has had the frustration over spending an inordinate amount of time tinkering with some office productivity tool only to make the report/presentation look good. This is definitely a waste of time, but unfortunately it is a necessary evil. In fact, a lot of times, analysts will wish to take a report directly out of the reporting tool and place it in their presentations or reports to their bosses. If the reporting tool offers them an easy way to pre-set the reports to look exactly the way that adheres to the corporate standard, it makes the analysts jobs much easier, and the time savings are tremendous.

· Export capabilities

The most common export needs are to Excel, to a flat file, and to PDF, and a good report tool must be able to export to all three formats. For Excel, if the situation warrants it, you will want to verify that the reporting format, not just the data itself, will be exported out to Excel. This can often be a time-saver.

· Integration with the Microsoft Office environment

Most people are used to work with Microsoft Office products, especially Excel, for manipulating data. Before, people used to export the reports into Excel, and then perform additional formatting / calculation tasks. Some reporting tools now offer a Microsoft Office-like editing environment for users, so all formatting can be done within the reporting tool itself, with no need to export the report into Excel. This is a nice convenience to the users.

Popular Tools

  • Business Objects (Crystal Reports)
  • Cognos
  • Actuate

OLAP tools are geared towards slicing and dicing of the data. As such, they require a strong metadata layer, as well as front-end flexibility. Those are typically difficult features for any home-built systems to achieve. Therefore, my recommendation is that if OLAP analysis is part of your charter for building a data warehouse, it is best to purchase an existing OLAP tool rather than creating one from scratch.

OLAP Tool Functionalities

Before we speak about OLAP tool selection criterion, we must first distinguish between the two types of OLAP tools, MOLAP (Multidimensional OLAP) and ROLAP (Relational OLAP).

1. MOLAP: In this type of OLAP, a cube is aggregated from the relational data source (data warehouse). When user generates a report request, the MOLAP tool can generate the create quickly because all data is already pre-aggregated within the cube.

2. ROLAP: In this type of OLAP, instead of pre-aggregating everything into a cube, the ROLAP engine essentially acts as a smart SQL generator. The ROLAP tool typically comes with a 'Designer' piece, where the data warehouse administrator can specify the relationship between the relational tables, as well as how dimensions, attributes, and hierarchies map to the underlying database tables.

Right now, there is a convergence between the traditional ROLAP and MOLAP vendors. ROLAP vendor recognize that users want their reports fast, so they are implementing MOLAP functionalities in their tools; MOLAP vendors recognize that many times it is necessary to drill down to the most detail level information, levels where the traditional cubes do not get to for performance and size reasons.

So what are the criteria for evaluating OLAP vendors? Here they are:

· Ability to leverage parallelism supplied by RDBMS and hardware: This would greatly increase the tool's performance, and help loading the data into the cubes as quickly as possible.

· Performance: In addition to leveraging parallelism, the tool itself should be quick both in terms of loading the data into the cube and reading the data from the cube.

· Customization efforts: More and more, OLAP tools are used as an advanced reporting tool. This is because in many cases, especially for ROLAP implementations, OLAP tools often can be used as a reporting tool. In such cases, the ease of front-end customization becomes an important factor in the tool selection process.

· Security Features: Because OLAP tools are geared towards a number of users, making sure people see only what they are supposed to see is important. By and large, all established OLAP tools have a security layer that can interact with the common corporate login protocols. There are, however, cases where large corporations have developed their own user authentication mechanism and have a "single sign-on" policy. For these cases, having a seamless integration between the tool and the in-house authentication can require some work. I would recommend that you have the tool vendor team come in and make sure that the two are compatible.

· Metadata support: Because OLAP tools aggregates the data into the cube and sometimes serves as the front-end tool, it is essential that it works with the metadata strategy/tool you have selected.

Popular Tools

  • Business Objects
  • Cognos
  • Hyperion
  • Microsoft Analysis Services
  • MicroStrategy

According to OLAP Report, the top 5 OLAP vendors for 2006, 2005, 2004, and 2003 were as follows:

Vendor

2006
Rank

2006
Share

2005
Rank

2005
Share

2004
Rank

2004
Share

2003
Rank

2003
Share

Microsoft

1

31.6%

1

27.9%

1

27.3%

1

26.1%

Hyperion

2

18.9%

2

19.2%

2

20.6%

2

21.9%

Cognos

3

12.9%

3

14.0%

3

14.1%

3

14.1%

Business Objects

4

7.3%

4

7.4%

4

7.2%

4

7.7%

MicroStrategy

5

7.3%

5

7.2%

5

7.1%

5

6.2%

Data Warehouse & Business Intelligence

The data warehousing market consists of tools, technologies, and methodologies that allow for the construction, usage, management, and maintenance of the hardware and software used for a data warehouse, as well as the actual data itself. Surveys indicate Data Warehousing will be the single largest IT initiative after completion of Y2K efforts. Data warehousing is currently a $28 Billion market (Source: Data Warehousing Institute) and we estimate 20% growth per annum through at least 2002. Two of the pioneers in the field were Ralph Kimball and Bill Inmon. Biographies of these two individuals have been provided, since many of the terms discussed in this paper were coined and concepts defined by them.

Data Warehouse:

The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the following way: "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process". He defined the terms in the sentence as follows:

Subject Oriented:

Data that gives information about a particular subject instead of about a company's ongoing operations.

Integrated:

Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole.

Time-variant:

All data in the data warehouse is identified with a particular time period.

Non-volatile

Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business.

(Source: "What is a Data Warehouse?" W.H. Inmon, Prism, Volume 1, Number 1, 1995).

This definition remains reasonably accurate almost ten years later. However, a single-subject data warehouse is typically referred to as a data mart, while data warehouses are generally enterprise in scope. Also, data warehouses can be volatile. Due to the large amount of storage required for a data warehouse, (multi-terabyte data warehouses are not uncommon), only a certain number of periods of history are kept in the warehouse. For instance, if three years of data are decided on and loaded into the warehouse, every month the oldest month will be "rolled off" the database, and the newest month added.

Ralph Kimball provided a much simpler definition of a data warehouse. As stated in his book, "The Data Warehouse Toolkit", on page 310, a data warehouse is "a copy of transaction data specifically structured for query and analysis". This definition provides less insight and depth than Mr. Inmon's, but is no less accurate.

Data warehousing is essentially what you need to do in order to create a data warehouse, and what you do with it. It is the process of creating, populating, and then querying a data warehouse and can involve a number of discrete technologies such as:

Source System Identification:

Source System Identification: In order to build the data warehouse, the appropriate data must be located. Typically, this will involve both the current OLTP (On-Line Transaction Processing) system where the "day-to-day" information about the business resides, and historical data for prior periods, which may be contained in some form of "legacy" system. Often these legacy systems are not relational databases, so much effort is required to extract the appropriate data.

Data Warehouse Design and Creation:

This describes the process of designing the warehouse, with care taken to ensure that the design supports the types of queries the warehouse will be used for. This is an involved effort that requires both an understanding of the database schema to be created, and a great deal of interaction with the user community. The design is often an iterative process and it must be modified a number of times before the model can be stabilized. Great care must be taken at this stage, because once the model is populated with large amounts of data, some of which may be very difficult to recreate, the model can not easily be changed.

Data Acquisition:

This is the process of moving company data from the source systems into the warehouse. It is often the most time-consuming and costly effort in the data warehousing project, and is performed with software products known as ETL (Extract/Transform/Load) tools. There are currently over 50 ETL tools on the market. The data acquisition phase can cost millions of dollars and take months or even years to complete. Data acquisition is then an ongoing, scheduled process, which is executed to keep the warehouse current to a pre-determined period in time, (i.e. the warehouse is refreshed monthly).

Changed Data Capture:

The periodic update of the warehouse from the transactional system(s) is complicated by the difficulty of identifying which records in the source have changed since the last update. This effort is referred to as "changed data capture". Changed data capture is a field of endeavor in itself, and many products are on the market to address it. Some of the technologies that are used in this area are Replication servers, Publish/Subscribe, Triggers and Stored Procedures, and Database Log Analysis.

Data Cleansing:

This is typically performed in conjunction with data acquisition (it can be part of the "T" in "ETL"). A data warehouse that contains incorrect data is not only useless, but also very dangerous. The whole idea behind a data warehouse is to enable decision-making. If a high level decision is made based on incorrect data in the warehouse, the company could suffer severe consequences, or even complete failure. Data cleansing is a complicated process that validates and, if necessary, corrects the data before it is inserted into the warehouse. For example, the company could have three "Customer Name" entries in its various source systems, one entered as "IBM", one as "I.B.M.", and one as "International Business Machines". Obviously, these are all the same customer. Someone in the organization must make a decision as to which is correct, and then the data cleansing tool will change the others to match the rule. This process is also referred to as "data scrubbing" or "data quality assurance". It can be an extremely complex process, especially if some of the warehouse inputs are from older mainframe file systems (commonly referred to as "flat files" or "sequential files").

Data Aggregation:

is process is often performed during the "T" phase of ETL, if it is performed at all. Data warehouses can be designed to store data at the detail level (each individual transaction), at some aggregate level (summary data), or a combination of both. The advantage of summarized data is that typical queries against the warehouse run faster. The disadvantage is that information, which may be needed to answer a query, is lost during aggregation. The tradeoff must be carefully weighed, because the decision can not be undone without rebuilding and repopulating the warehouse. The safest decision is to build the warehouse with a high level of detail, but the cost in storage can be extreme.

Now that the warehouse has been built and populated, it becomes possible to extract meaningful information from it that will provide a competitive advantage and a return on investment. This is done with tools that fall within the general rubric of "Business Intelligence".

Business Intelligence (BI)

A very broad field indeed, it contains technologies such as Decision Support Systems (DSS), Executive Information Systems (EIS), On-Line Analytical Processing (OLAP), Relational OLAP (ROLAP), Multi-Dimensional OLAP (MOLAP), Hybrid OLAP (HOLAP, a combination of MOLAP and ROLAP), and more. BI can be broken down into four broad fields:

Multi-dimensional Analysis Tools:

Tools that allow the user to look at the data from a number of different "angles". These tools often use a multi-dimensional database referred to as a "cube".

Query tools:

Tools that allow the user to issue SQL (Structured Query Language) queries against the warehouse and get a result set back.

Data Mining Tools:

Tools that automatically search for patterns in data. These tools are usually driven by complex statistical formulas. The easiest way to distinguish data mining from the various forms of OLAP is that OLAP can only answer questions you know to ask, data mining answers questions you didn't necessarily know to ask.

Data Visualization Tools:

Tools that show graphical representations of data, including complex three-dimensional data pictures. The theory is that the user can "see" trends more effectively in this manner than when looking at complex statistical graphs. Some vendors are making progress in this area using the Virtual Reality Modeling Language (VRML).

Metadata Management

Throughout the entire process of identifying, acquiring, and querying the data, metadata management takes place. Metadata is defined as "data about data". An example is a column in a table. The datatype (for instance a string or integer) of the column is one piece of metadata. The name of the column is another. The actual value in the column for a particular row is not metadata - it is data. Metadata is stored in a Metadata Repository and provides extremely useful information to all of the tools mentioned previously. Metadata management has developed into an exacting science that can provide huge returns to an organization. It can assist companies in analyzing the impact of changes to database tables, tracking owners of individual data elements ("data stewards"), and much more. It is also required to build the warehouse, since the ETL tool needs to know the metadata attributes of the sources and targets in order to "map" the data properly. The BI tools need the metadata for similar reasons.

Summary:

Data Warehousing is a complex field, with many vendors vying for market awareness. The complexity of the technology and the interactions between the various tools, and the high price points for the products require companies to perform careful technology evaluation before embarking on a warehousing project. However, the potential for enormous returns on investment and competitive advantage make data warehousing difficult to ignore.

Biographical Information

Bill Inmon

Bill Inmon is universally recognized as the "father of the data warehouse." He has over 26 years of database technology management experience and data warehouse design expertise, and has published 36 books and more than 350 articles in major computer journals. His books have been translated into nine languages. He is known globally for his seminars on developing data warehouses and has been a keynote speaker for every major computing association. Before founding Pine Cone Systems, Bill was a co-founder of Prism Solutions, Inc.

Ralph Kimball

Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems, and founder and CEO of Red Brick Systems. He has a Ph.D. from Stanford in electrical engineering, specializing in man-machine systems. Ralph is a leading proponent of the dimensional approach to designing large data warehouses. He currently teaches data warehousing design skills to IT groups, and helps selected clients with specific data warehouse designs. Ralph is a columnist for Intelligent Enterprise magazine and has a relationship with Sagent Technology, Inc., a data warehouse tool vendor. His book "The Data Warehouse Toolkit" is widely recognized as the seminal work on the subject


Data Warehousing Concepts:

Dimensional Data Model:

Dimensional data model is most often used in data warehousing systems. This is different from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data would then be stored differently in a dimensional model than in a 3rd normal form model.

To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:

Dimension: A category of information. For example, the time dimension.

Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.

Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day.

Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column.

Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1").

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.

In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema.

Star Schema: In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.

Snowflake Schema: The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables