Data Warehouse Concepts
What is a Data Warehouse?
According
to Inmon, famous author for several data warehouse books, "A data
warehouse is a subject oriented, integrated, time variant, non volatile
collection of data in support of management's decision making process".
Example:
In order to store data, over the years, many application designers in
each branch have made their individual decisions as to how an
application and database should be built. So source systems will be
different in naming conventions, variable measurements, encoding
structures, and physical attributes of data. Consider a bank that has
got several branches in several countries, has millions of customers and
the lines of business of the enterprise are savings, and loans. The
following example explains how the data is integrated from source
systems to target systems.
Example of Source Data
System Name
|
Attribute Name
|
Column Name
|
Datatype
|
Values
|
Source System 1
|
Customer Application Date
|
CUSTOMER_APPLICATION_DATE
|
NUMERIC(8,0)
|
11012005
|
Source System 2
|
Customer Application Date
|
CUST_APPLICATION_DATE
|
DATE
|
11012005
|
Source System 3
|
Application Date
|
APPLICATION_DATE
|
DATE
|
01NOV2005
|
In
the aforementioned example, attribute name, column name, datatype and
values are entirely different from one source system to another. This
inconsistency in data can be avoided by integrating the data into a data
warehouse with good standards.
Example of Target Data(Data Warehouse)
Target System
|
Attribute Name
|
Column Name
|
Datatype
|
Values
|
Record #1
|
Customer Application Date
|
CUSTOMER_APPLICATION_DATE
|
DATE
|
01112005
|
Record #2
|
Customer Application Date
|
CUSTOMER_APPLICATION_DATE
|
DATE
|
01112005
|
Record #3
|
Customer Application Date
|
CUSTOMER_APPLICATION_DATE
|
DATE
|
01112005
|
In
the above example of target data, attribute names, column names, and
datatypes are consistent throughout the target system. This is how data
from various source systems is integrated and accurately stored into the
data warehouse.
See Figure below for Data Warehouse Architecture Diagram.
Data Warehouse & Data Mart
A data warehouse is a relational/multidimensional database that is designed for query and analysis rather than transaction processing. A data warehouse usually contains historical data that is derived from transaction data. It separates analysis workload from transaction workload and enables a business to consolidate data from several sources.
A data warehouse is a relational/multidimensional database that is designed for query and analysis rather than transaction processing. A data warehouse usually contains historical data that is derived from transaction data. It separates analysis workload from transaction workload and enables a business to consolidate data from several sources.
In
addition to a relational/multidimensional database, a data warehouse
environment often consists of an ETL solution, an OLAP engine, client
analysis tools, and other applications that manage the process of
gathering data and delivering it to business users.
There are three types of data warehouses:
1. Enterprise Data Warehouse - An enterprise data warehouse provides a central database for decision support throughout the enterprise.
2. ODS(Operational Data Store) - This has a broad enterprise wide scope, but unlike the real entertprise data warehouse, data is refreshed in near real time and used for routine business activity.
3. Data Mart - Datamart is a subset of data warehouse and it supports a particular region, business unit or business function.
1. Enterprise Data Warehouse - An enterprise data warehouse provides a central database for decision support throughout the enterprise.
2. ODS(Operational Data Store) - This has a broad enterprise wide scope, but unlike the real entertprise data warehouse, data is refreshed in near real time and used for routine business activity.
3. Data Mart - Datamart is a subset of data warehouse and it supports a particular region, business unit or business function.
Data
warehouses and data marts are built on dimensional data modeling where
fact tables are connected with dimension tables. This is most useful for
users to access data since a database can be visualized as a cube of
several dimensions. A data warehouse provides an opportunity for slicing
and dicing that cube along each of its dimensions.
Data Mart:
A data mart is a subset of data warehouse that is designed for a
particular line of business, such as sales, marketing, or finance. In a
dependent data mart, data can be derived from an enterprise-wide data
warehouse. In an independent data mart, data can be collected directly
from sources.
What is Star Schema?
Star Schema is a relational database schema for representing multidimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables. The advantage of star schema is slicing down, performance increase and easy understanding of data.
Star Schema is a relational database schema for representing multidimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables. The advantage of star schema is slicing down, performance increase and easy understanding of data.
Steps in designing Star Schema
· Identify a business process for analysis (like sales).
· Identify measures or facts (sales dollar).
· Identify dimensions for facts (product dimension, location dimension, time dimension, organization dimension).
· List the columns that describe each dimension.(region name, branch name, region name).
· Determine the lowest level of summary in a fact table(sales dollar).
Important aspects of Star Schema & Snow Flake Schema
· In a star schema every dimension will have a primary key.
· In a star schema, a dimension table will not have any parent table.
· Whereas in a snow flake schema, a dimension table will have one or more parent tables.
· Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
· Whereas
hierachies are broken into separate tables in snow flake schema. These
hierachies helps to drill down the data from topmost hierachies to the
lowermost hierarchies.
Glossary:
Hierarchy
A logical structure that uses ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation; for example, in a time dimension, a hierarchy might be used to aggregate data from the Month level to the Quarter level, from the Quarter level to the Year level. A hierarchy can also be used to define a navigational drill path, regardless of whether the levels in the hierarchy represent aggregated totals or not.
A logical structure that uses ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation; for example, in a time dimension, a hierarchy might be used to aggregate data from the Month level to the Quarter level, from the Quarter level to the Year level. A hierarchy can also be used to define a navigational drill path, regardless of whether the levels in the hierarchy represent aggregated totals or not.
Level
A position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the Month, Quarter, and Year levels.
A position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the Month, Quarter, and Year levels.
Fact Table
A table in a star schema that contains facts and connected to dimensions. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
A table in a star schema that contains facts and connected to dimensions. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
A
fact table might contain either detail level facts or facts that have
been aggregated (fact tables that contain aggregated facts are often
instead called summary tables). A fact table usually contains facts with
the same level of aggregation.
Example of Star Schema:
In
the example figure sales fact table is connected to dimensions
location, product, time and organization. It shows that data can be
sliced across all dimensions and again it is possible for the data to be
aggregated across multiple dimensions. "Sales Dollar" in sales fact
table can be calculated across all dimensions independently or in a
combined manner which is explained below.
· Sales Dollar value for a particular product
· Sales Dollar value for a product in a location
· Sales Dollar value for a product in a year within a location
· Sales Dollar value for a product in a year within a location sold or serviced by an employee
Snow Flake Schema
A
snowflake schema is a term that describes a star schema structure
normalized through the use of outrigger tables. i.e dimension table
hierachies are broken into simpler tables. In star schema example we had 4 dimensions like location, product, time, organization and a fact table(sales).
In
snow flake schema, the example diagram shown below has 4 dimension
tables, 4 lookup tables and 1 fact table. The reason is that
hierarchies(category, branch, state, and month) are being broken out of
the dimension tables(PRODUCT, ORGANIZATION, LOCATION, and TIME)
respectively and shown separately. In OLAP, this snow flake schema
approach increases the number of joins and poor performance in retrieval
of data. In few organizations, they try to normalize the dimension
tables to save space. Since dimension tables hold less space, snow flake
schema approach may be avoided.
Example of Snow Flake Schema:
Fact Table
The centralized table in a star schema is called as FACT table. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
The centralized table in a star schema is called as FACT table. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
In
the example fig 1.6 "Sales Dollar" is a fact(measure) and it can be
added across several dimensions. Fact tables store different types of
measures like additive, non additive and semi additive measures.
Measure Types
· Additive - Measures that can be added across all dimensions.
· Non Additive - Measures that cannot be added across all dimensions.
· Semi Additive - Measures that can be added across few dimensions and not with others.
A
fact table might contain either detail level facts or facts that have
been aggregated (fact tables that contain aggregated facts are often
instead called summary tables).
In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called as Factless Fact tables.
Steps in designing Fact Table
· Identify a business process for analysis(like sales).
· Identify measures or facts (sales dollar).
· Identify dimensions for facts(product dimension, location dimension, time dimension, organization dimension).
· List the columns that describe each dimension.(region name, branch name, region name).
· Determine the lowest level of summary in a fact table(sales dollar).
Example of a Fact Table with an Additive Measure in Star Schema: Figure 1.6
In
the example figure 1.6, sales fact table is connected to dimensions
location, product, time and organization. Measure "Sales Dollar" in
sales fact table can be added across all dimensions independently or in a
combined manner which is explained below.
· Sales Dollar value for a particular product
· Sales Dollar value for a product in a location
· Sales Dollar value for a product in a year within a location
· Sales Dollar value for a product in a year within a location sold or serviced by an employee
Dimension Table
Dimension table is one that describe the business entities of an enterprise, represented as hierarchical, categorical information such as time, departments, locations, and products. Dimension tables are sometimes called lookup or reference tables.
Dimension table is one that describe the business entities of an enterprise, represented as hierarchical, categorical information such as time, departments, locations, and products. Dimension tables are sometimes called lookup or reference tables.
Location Dimension
In a relational data modeling, for normalization purposes, country lookup, state lookup, county lookup, and city lookups are not merged as a single table. In a dimensional data modeling (star schema), these tables would be merged as a single table called LOCATION DIMENSION for performance and slicing data requirements. This location dimension helps to compare the sales in one region with another region. We may see good sales profit in one region and loss in another region. If it is a loss, the reasons for that may be a new competitor in that area, or failure of our marketing strategy etc.
In a relational data modeling, for normalization purposes, country lookup, state lookup, county lookup, and city lookups are not merged as a single table. In a dimensional data modeling (star schema), these tables would be merged as a single table called LOCATION DIMENSION for performance and slicing data requirements. This location dimension helps to compare the sales in one region with another region. We may see good sales profit in one region and loss in another region. If it is a loss, the reasons for that may be a new competitor in that area, or failure of our marketing strategy etc.
Example of Location Dimension: Figure
Country Lookup
Country Code
|
Country Name
|
DateTimeStamp
|
USA
|
United States Of America
|
1/1/2005 11:23:31 AM
|
State Lookup
State Code
|
State Name
|
DateTimeStamp
|
NY
|
New York
|
1/1/2005 11:23:31 AM
|
FL
|
Florida
|
1/1/2005 11:23:31 AM
|
CA
|
California
|
1/1/2005 11:23:31 AM
|
NJ
|
New Jersey
|
1/1/2005 11:23:31 AM
|
County Lookup
County Code
|
County Name
|
DateTimeStamp
|
NYSH
|
Shelby
|
1/1/2005 11:23:31 AM
|
FLJE
|
Jefferson
|
1/1/2005 11:23:31 AM
|
CAMO
|
Montgomery
|
1/1/2005 11:23:31 AM
|
NJHU
|
Hudson
|
1/1/2005 11:23:31 AM
|
City Lookup
City Code
|
City Name
|
DateTimeStamp
|
NYSHMA
|
Manhattan
|
1/1/2005 11:23:31 AM
|
FLJEPC
|
Panama City
|
1/1/2005 11:23:31 AM
|
CAMOSH
|
San Hose
|
1/1/2005 11:23:31 AM
|
NJHUJC
|
Jersey City
|
1/1/2005 11:23:31 AM
|
Location Dimension
Location
Dimension Id |
Country
Name |
State
Name |
County
Name |
City
Name |
DateTime
Stamp |
1
|
USA
|
New York
|
Shelby
|
Manhattan
|
1/1/2005 11:23:31 AM
|
2
|
USA
|
Florida
|
Jefferson
|
Panama City
|
1/1/2005 11:23:31 AM
|
3
|
USA
|
California
|
Montgomery
|
San Hose
|
1/1/2005 11:23:31 AM
|
4
|
USA
|
New Jersey
|
Hudson
|
Jersey City
|
1/1/2005 11:23:31 AM
|
Product Dimension
In a relational data model, for normalization purposes, product category lookup, product sub-category lookup, product lookup, and and product feature lookups are are not merged as a single table. In a dimensional data modeling(star schema), these tables would be merged as a single table called PRODUCT DIMENSION for performance and slicing data requirements.
In a relational data model, for normalization purposes, product category lookup, product sub-category lookup, product lookup, and and product feature lookups are are not merged as a single table. In a dimensional data modeling(star schema), these tables would be merged as a single table called PRODUCT DIMENSION for performance and slicing data requirements.
Example of Product Dimension: Figure
Product Category Lookup
Product Category Code
|
Product Category Name
|
DateTimeStamp
|
1
|
Apparel
|
1/1/2005 11:23:31 AM
|
2
|
Shoe
|
1/1/2005 11:23:31 AM
|
Product Sub-Category Lookup
Product
Sub-Category Code |
Product
Sub-Category Name |
DateTime
Stamp |
11
|
Shirt
|
1/1/2005 11:23:31 AM
|
12
|
Trouser
|
1/1/2005 11:23:31 AM
|
13
|
Casual
|
1/1/2005 11:23:31 AM
|
14
|
Formal
|
1/1/2005 11:23:31 AM
|
Product Lookup
Product Code
|
Product Name
|
DateTimeStamp
|
1001
|
Van Heusen
|
1/1/2005 11:23:31 AM
|
1002
|
Arrow
|
1/1/2005 11:23:31 AM
|
1003
|
Nike
|
1/1/2005 11:23:31 AM
|
1004
|
Adidas
|
1/1/2005 11:23:31 AM
|
Product Feature Lookup
Product Feature Code
|
Product Feature Description
|
DateTimeStamp
|
10001
|
Van-M
|
1/1/2005 11:23:31 AM
|
10002
|
Van-L
|
1/1/2005 11:23:31 AM
|
10003
|
Arr-XL
|
1/1/2005 11:23:31 AM
|
10004
|
Arr-XXL
|
1/1/2005 11:23:31 AM
|
10005
|
Nike-8
|
1/1/2005 11:23:31 AM
|
10006
|
Nike-9
|
1/1/2005 11:23:31 AM
|
10007
|
Adidas-10
|
1/1/2005 11:23:31 AM
|
10008
|
Adidas-11
|
1/1/2005 11:23:31 AM
|
Product Dimension
Product Dimension Id
|
Product Category Name
|
Product Sub-Category Name
|
Product Name
|
Product Feature Desc
|
DateTime
Stamp |
100001
|
Apparel
|
Shirt
|
Van Heusen
|
Van-M
|
1/1/2005 11:23:31 AM
|
100002
|
Apparel
|
Shirt
|
Van Heusen
|
Van-L
|
1/1/2005 11:23:31 AM
|
100003
|
Apparel
|
Shirt
|
Arrow
|
Arr-XL
|
1/1/2005 11:23:31 AM
|
100004
|
Apparel
|
Shirt
|
Arrow
|
Arr-XXL
|
1/1/2005 11:23:31 AM
|
100005
|
Shoe
|
Casual
|
Nike
|
Nike-8
|
1/1/2005 11:23:31 AM
|
100006
|
Shoe
|
Casual
|
Nike
|
Nike-9
|
1/1/2005 11:23:31 AM
|
100007
|
Shoe
|
Casual
|
Adidas
|
Adidas-10
|
1/1/2005 11:23:31 AM
|
100008
|
Shoe
|
Casual
|
Adidas
|
Adidas-11
|
1/1/2005 11:23:31 AM
|
Organization Dimension
In a relational data model, for normalization purposes, corporate office lookup, region lookup, branch lookup, and employee lookups are not merged as a single table. In a dimensional data modeling(star schema), these tables would be merged as a single table called ORGANIZATION DIMENSION for performance and slicing data.
In a relational data model, for normalization purposes, corporate office lookup, region lookup, branch lookup, and employee lookups are not merged as a single table. In a dimensional data modeling(star schema), these tables would be merged as a single table called ORGANIZATION DIMENSION for performance and slicing data.
This
dimension helps us to find the products sold or serviced within the
organization by the employees. In any industry, we can calculate the
sales on region basis, branch basis and employee basis. Based on the
performance, an organization can provide incentives to employees and
subsidies to the branches to increase further sales.
Example of Organization Dimension: Figure
Corporate Lookup
Corporate Code
|
Corporate Name
|
DateTimeStamp
|
CO
|
American Bank
|
1/1/2005 11:23:31 AM
|
Region Lookup
Region Code
|
Region Name
|
DateTimeStamp
|
SE
|
South East
|
1/1/2005 11:23:31 AM
|
MW
|
Mid West
|
1/1/2005 11:23:31 AM
|
Branch Lookup
Branch Code
|
Branch Name
|
DateTimeStamp
|
FLTM
|
Florida-Tampa
|
1/1/2005 11:23:31 AM
|
ILCH
|
Illinois-Chicago
|
1/1/2005 11:23:31 AM
|
Employee Lookup
Employee Code
|
Employee Name
|
DateTimeStamp
|
E1
|
Paul Young
|
1/1/2005 11:23:31 AM
|
E2
|
Chris Davis
|
1/1/2005 11:23:31 AM
|
Organization Dimension
Organization Dimension Id
|
Corporate Name
|
Region Name
|
Branch Name
|
Employee Name
|
DateTime
Stamp |
1
|
American Bank
|
South East
|
Florida-Tampa
|
Paul Young
|
1/1/2005 11:23:31 AM
|
2
|
American Bank
|
Mid West
|
Illinois-Chicago
|
Chris Davis
|
1/1/2005 11:23:31 AM
|
Time Dimension
In a relational data model, for normalization purposes, year lookup, quarter lookup, month lookup, and week lookups are not merged as a single table. In a dimensional data modeling(star schema), these tables would be merged as a single table called TIME DIMENSION for performance and slicing data.
In a relational data model, for normalization purposes, year lookup, quarter lookup, month lookup, and week lookups are not merged as a single table. In a dimensional data modeling(star schema), these tables would be merged as a single table called TIME DIMENSION for performance and slicing data.
This
dimensions helps to find the sales done on date, weekly, monthly and
yearly basis. We can have a trend analysis by comparing this year sales
with the previous year or this week sales with the previous week.
Example of Time Dimension: Figure
Year Lookup
Year Id
|
Year Number
|
DateTimeStamp
|
1
|
2004
|
1/1/2005 11:23:31 AM
|
2
|
2005
|
1/1/2005 11:23:31 AM
|
Quarter Lookup
Quarter Number
|
Quarter Name
|
DateTimeStamp
|
1
|
Q1
|
1/1/2005 11:23:31 AM
|
2
|
Q2
|
1/1/2005 11:23:31 AM
|
3
|
Q3
|
1/1/2005 11:23:31 AM
|
4
|
Q4
|
1/1/2005 11:23:31 AM
|
Month Lookup
Month Number
|
Month Name
|
DateTimeStamp
|
1
|
January
|
1/1/2005 11:23:31 AM
|
2
|
February
|
1/1/2005 11:23:31 AM
|
3
|
March
|
1/1/2005 11:23:31 AM
|
4
|
April
|
1/1/2005 11:23:31 AM
|
5
|
May
|
1/1/2005 11:23:31 AM
|
6
|
June
|
1/1/2005 11:23:31 AM
|
7
|
July
|
1/1/2005 11:23:31 AM
|
8
|
August
|
1/1/2005 11:23:31 AM
|
9
|
September
|
1/1/2005 11:23:31 AM
|
10
|
October
|
1/1/2005 11:23:31 AM
|
11
|
November
|
1/1/2005 11:23:31 AM
|
12
|
December
|
1/1/2005 11:23:31 AM
|
Week Lookup
Week Number
|
Day of Week
|
DateTimeStamp
|
1
|
Sunday
|
1/1/2005 11:23:31 AM
|
1
|
Monday
|
1/1/2005 11:23:31 AM
|
1
|
Tuesday
|
1/1/2005 11:23:31 AM
|
1
|
Wednesday
|
1/1/2005 11:23:31 AM
|
1
|
Thursday
|
1/1/2005 11:23:31 AM
|
1
|
Friday
|
1/1/2005 11:23:31 AM
|
1
|
Saturday
|
1/1/2005 11:23:31 AM
|
2
|
Sunday
|
1/1/2005 11:23:31 AM
|
2
|
Monday
|
1/1/2005 11:23:31 AM
|
2
|
Tuesday
|
1/1/2005 11:23:31 AM
|
2
|
Wednesday
|
1/1/2005 11:23:31 AM
|
2
|
Thursday
|
1/1/2005 11:23:31 AM
|
2
|
Friday
|
1/1/2005 11:23:31 AM
|
2
|
Saturday
|
1/1/2005 11:23:31 AM
|
Time Dimension
Time Dim Id
|
Year No
|
Day Of Year
|
Quarter No
|
Month No
|
Month Name
|
Month Day No
|
Week No
|
Day of Week
|
Cal Date
|
DateTime
Stamp |
1
|
2004
|
1
|
Q1
|
1
|
January
|
1
|
1
|
5
|
1/1/2004
|
1/1/2005 11:23:31 AM
|
2
|
2004
|
32
|
Q1
|
2
|
February
|
1
|
5
|
1
|
2/1/2004
|
1/1/2005 11:23:31 AM
|
3
|
2005
|
1
|
Q1
|
1
|
January
|
1
|
1
|
7
|
1/1/2005
|
1/1/2005 11:23:31 AM
|
4
|
2005
|
32
|
Q1
|
2
|
February
|
1
|
5
|
3
|
2/1/2005
|
1/1/2005 11:23:31 AM
|
Slowly Changing Dimensions
Dimensions that change over time are called Slowly Changing Dimensions. For instance, a product price changes over time; People change their names for some reason; Country and State names may change over time. These are a few examples of Slowly Changing Dimensions since some changes are happening to them over a period of time.
Dimensions that change over time are called Slowly Changing Dimensions. For instance, a product price changes over time; People change their names for some reason; Country and State names may change over time. These are a few examples of Slowly Changing Dimensions since some changes are happening to them over a period of time.
Slowly Changing Dimensions are often categorized into three types namely Type1, Type2 and Type3. The following section deals with how to capture and handling these changes over time.
The
"Product" table mentioned below contains a product named, Product1 with
Product ID being the primary key. In the year 2004, the price of
Product1 was $150 and over the time, Product1's price changes from $150
to $350. With this information, let us explain the three types of Slowly
Changing Dimensions.
Product Price in 2004:
Product ID(PK)
|
Year
|
Product Name
|
Product Price
|
1
|
2004
|
Product1
|
$150
|
Type 1: Overwriting the old values.
In the year 2005, if the price of the product changes to $250, then the old values of the columns "Year" and "Product Price" have to be updated and replaced with the new values. In this Type 1, there is no way to find out the old value of the product "Product1" in year 2004 since the table now contains only the new price and year information.
In the year 2005, if the price of the product changes to $250, then the old values of the columns "Year" and "Product Price" have to be updated and replaced with the new values. In this Type 1, there is no way to find out the old value of the product "Product1" in year 2004 since the table now contains only the new price and year information.
Product
Product ID(PK)
|
Year
|
Product Name
|
Product Price
|
1
|
2005
|
Product1
|
$250
|
Type 2: Creating an another additional record.
In this Type 2, the old values will not be replaced but a new row containing the new values will be added to the product table. So at any point of time, the difference between the old values and new values can be retrieved and easily be compared. This would be very useful for reporting purposes.
In this Type 2, the old values will not be replaced but a new row containing the new values will be added to the product table. So at any point of time, the difference between the old values and new values can be retrieved and easily be compared. This would be very useful for reporting purposes.
Product
Product ID(PK)
|
Year
|
Product Name
|
Product Price
|
1
|
2004
|
Product1
|
$150
|
1
|
2005
|
Product1
|
$250
|
The
problem with the above mentioned data structure is "Product ID" cannot
store duplicate values of "Product1" since "Product ID" is the primary
key. Also, the current data structure doesn't clearly specify the
effective date and expiry date of Product1 like when the change to its
price happened. So, it would be better to change the current data
structure to overcome the above primary key violation.
Product
Product ID(PK)
|
Effective
DateTime(PK) |
Year
|
Product Name
|
Product Price
|
Expiry
DateTime |
1
|
01-01-2004 12.00PM
|
2004
|
Product1
|
$150
|
12-31-2004 11.59PM
|
1
|
01-01-2005 12.00PM
|
2005
|
Product1
|
$250
|
In
the changed Product table's Data structure, "Product ID" and "Effective
DateTime" are composite primary keys. So there would be no violation of
primary key constraint. Addition of new columns, "Effective DateTime"
and "Expiry DateTime" provides the information about the product's
effective date and expiry date which adds more clarity and enhances the
scope of this table. Type2 approach may need additional space in the
data base, since for every changed record, an additional row has to be
stored. Since dimensions are not that big in the real world, additional
space is negligible.
Type 3: Creating new fields.
In this Type 3, the latest update to the changed values can be seen. Example mentioned below illustrates how to add new columns and keep track of the changes. From that, we are able to see the current price and the previous price of the product, Product1.
In this Type 3, the latest update to the changed values can be seen. Example mentioned below illustrates how to add new columns and keep track of the changes. From that, we are able to see the current price and the previous price of the product, Product1.
Product
Product ID(PK)
|
Current
Year |
Product
Name |
Current
Product Price |
Old Product
Price |
Old Year
|
1
|
2005
|
Product1
|
$250
|
$150
|
$2004
|
The
problem with the Type 3 approach, is over years, if the product price
continuously changes, then the complete history may not be stored, only
the latest change will be stored. For example, in year 2006, if the
product1's price changes to $350, then we would not be able to see the
complete history of 2004 prices, since the old values would have been
updated with 2005 product information.
Product
Product ID(PK)
|
Year
|
Product
Name |
Product
Price |
Old Product
Price |
Old Year
|
1
|
2006
|
Product1
|
$350
|
$250
|
$2005
|
What are ETL Tools?
ETL Tools are meant to extract, transform and load the data into Data Warehouse for decision making. Before the evolution of ETL Tools, the above mentioned ETL process was done manually by using SQL code created by programmers. This task was tedious and cumbersome in many cases since it involved many resources, complex coding and more work hours. On top of it, maintaining the code placed a great challenge among the programmers.
ETL Tools are meant to extract, transform and load the data into Data Warehouse for decision making. Before the evolution of ETL Tools, the above mentioned ETL process was done manually by using SQL code created by programmers. This task was tedious and cumbersome in many cases since it involved many resources, complex coding and more work hours. On top of it, maintaining the code placed a great challenge among the programmers.
These
difficulties are eliminated by ETL Tools since they are very powerful
and they offer many advantages in all stages of ETL process starting
from extraction, data cleansing, data profiling, transformation,
debuggging and loading into data warehouse when compared to the old
method.
There
are a number of ETL tools available in the market to do ETL process the
data according to business/technical requirements. Following are some
those.
Popular ETL Tools
Tool Name
|
Company Name
|
Informatica
|
Informatica Corporation
|
DT/Studio
|
Embarcadero Technologies
|
DataStage
|
IBM
|
Ab Initio
|
Ab Initio Software Corporation
|
Data Junction
|
Pervasive Software
|
Oracle Warehouse Builder
|
Oracle Corporation
|
Microsoft SQL Server Integration
|
Microsoft
|
ETL Concepts
Extraction, transformation, and loading. ETL refers to the methods involved in accessing and manipulating source data and loading it into target database.
Extraction, transformation, and loading. ETL refers to the methods involved in accessing and manipulating source data and loading it into target database.
The
first step in ETL process is mapping the data between source systems
and target database(data warehouse or data mart). The second step is
cleansing of source data in staging area. The third step is transforming
cleansed source data and then loading into the target system.
Note
that ETT (extraction, transformation, transportation) and ETM
(extraction, transformation, move) are sometimes used instead of ETL.
Glossary of ETL (Reference:www.Oracle.com)
Source System
A database, application, file, or other storage facility from which the data in a data warehouse is derived.
A database, application, file, or other storage facility from which the data in a data warehouse is derived.
Mapping
The definition of the relationship and data flow between source and target objects.
The definition of the relationship and data flow between source and target objects.
Metadata
Data that describes data and other structures, such as objects, business rules, and processes. For example, the schema design of a data warehouse is typically stored in a repository as metadata, which is used to generate scripts used to build and populate the data warehouse. A repository contains metadata.
Data that describes data and other structures, such as objects, business rules, and processes. For example, the schema design of a data warehouse is typically stored in a repository as metadata, which is used to generate scripts used to build and populate the data warehouse. A repository contains metadata.
Staging Area
A place where data is processed before entering the warehouse.
A place where data is processed before entering the warehouse.
Cleansing
The process of resolving inconsistencies and fixing the anomalies in source data, typically as part of the ETL process.
The process of resolving inconsistencies and fixing the anomalies in source data, typically as part of the ETL process.
Transformation
The process of manipulating data. Any manipulation beyond copying is a transformation. Examples include cleansing, aggregating, and integrating data from multiple sources.
The process of manipulating data. Any manipulation beyond copying is a transformation. Examples include cleansing, aggregating, and integrating data from multiple sources.
Transportation
The process of moving copied or transformed data from a source to a data warehouse.
The process of moving copied or transformed data from a source to a data warehouse.
Target System
A database, application, file, or other storage facility to which the "transformed source data" is loaded in a data warehouse.
A database, application, file, or other storage facility to which the "transformed source data" is loaded in a data warehouse.
ETL Tools - Informatica
Informatica
is a powerful ETL tool from Informatica Corporation, a leading provider
of enterprise data integration software and ETL softwares.
The important Informatica Components are:
· Power Exchange
· Power Center
· Power Center Connect
· Power Exchange
· Power Channel
· Metadata Exchange
· Power Analyzer
· Super Glue
In
Informatica, all the Metadata information about source systems, target
systems and transformations are stored in the Informatica repository.
Informatica's Power Center Client and Repository Server access this
repository to store and retrieve metadata.
Source and Target:
Consider a Bank that has got many branches throughout the world. In each branch data may be stored in different source systems like oracle, sql server, terradata, etc. When the Bank decides to integrate its data from several sources for its management decisions, it may choose one or more systems like oracle, sql server, terradata, etc. as its data warehouse target. Many organisations prefer Informatica to do that ETL process, because Informatica is more powerful in designing and building data warehouses. It can connect to several sources and targets to extract meta data from sources and targets, transform and load the data into target systems.
Consider a Bank that has got many branches throughout the world. In each branch data may be stored in different source systems like oracle, sql server, terradata, etc. When the Bank decides to integrate its data from several sources for its management decisions, it may choose one or more systems like oracle, sql server, terradata, etc. as its data warehouse target. Many organisations prefer Informatica to do that ETL process, because Informatica is more powerful in designing and building data warehouses. It can connect to several sources and targets to extract meta data from sources and targets, transform and load the data into target systems.
Guidelines to work with Informatica Power Center
· Repository: This
is where all the metadata information is stored in the Informatica
suite. The Power Center Client and the Repository Server would access
this repository to retrieve, store and manage metadata.
· Power Center Client: Informatica
client is used for managing users, identifiying source and target
systems definitions, creating mapping and mapplets, creating sessions
and run workflows etc.
· Repository Server: This repository server takes care of all the connections between the repository and the Power Center Client.
· Power Center Server: Power Center server does the extraction from source and then loading data into targets.
· Designer: Source
Analyzer, Mapping Designer and Warehouse Designer are tools reside
within the Designer wizard. Source Analyzer is used for extracting
metadata from source systems.
Mapping Designer is used to create mapping between sources and targets. Mapping is a pictorial representation about the flow of data from source to target.
Warehouse Designer is used for extracting metadata from target systems or metadata can be created in the Designer itself.
Mapping Designer is used to create mapping between sources and targets. Mapping is a pictorial representation about the flow of data from source to target.
Warehouse Designer is used for extracting metadata from target systems or metadata can be created in the Designer itself.
· Data Cleansing: The
PowerCenter's data cleansing technology improves data quality by
validating, correctly naming and standardization of address data. A
person's address may not be same in all source systems because of typos
and postal code, city name may not match with address. These errors can
be corrected by using data cleansing process and standardized data can
be loaded in target systems (data warehouse).
· Transformation: Transformations
help to transform the source data according to the requirements of
target system. Sorting, Filtering, Aggregation, Joining are some of the
examples of transformation. Transformations ensure the quality of the
data being loaded into target and this is done during the mapping
process from source to target.
· Workflow Manager: Workflow
helps to load the data from source to target in a sequential manner.
For example, if the fact tables are loaded before the lookup tables,
then the target system will pop up an error message since the fact table
is violating the foreign key validation. To avoid this, workflows can
be created to ensure the correct flow of data from source to target.
· Workflow Monitor: This monitor is helpful in monitoring and tracking the workflows created in each Power Center Server.
· Power Center Connect: This
component helps to extract data and metadata from ERP systems like
IBM's MQSeries, Peoplesoft, SAP, Siebel etc. and other third party
applications.
· Power Center Exchange: This
component helps to extract data and metadata from ERP systems like
IBM's MQSeries, Peoplesoft, SAP, Siebel etc. and other third party
applications.
ETL Tools - Informatica
Power Exchange:
Informatica Power Exchange as a stand alone service or along with Power Center, helps organizations leverage data by avoiding manual coding of data extraction programs. Power Exchange supports batch, real time and changed data capture options in main frame(DB2, VSAM, IMS etc.,), mid range (AS400 DB2 etc.,), and for relational databases (oracle, sql server, db2 etc) and flat files in unix, linux and windows systems.
Informatica Power Exchange as a stand alone service or along with Power Center, helps organizations leverage data by avoiding manual coding of data extraction programs. Power Exchange supports batch, real time and changed data capture options in main frame(DB2, VSAM, IMS etc.,), mid range (AS400 DB2 etc.,), and for relational databases (oracle, sql server, db2 etc) and flat files in unix, linux and windows systems.
Power Channel:
This helps to transfer large amount of encrypted and compressed data over LAN, WAN, through Firewalls, tranfer files over FTP, etc.
This helps to transfer large amount of encrypted and compressed data over LAN, WAN, through Firewalls, tranfer files over FTP, etc.
Meta Data Exchange:
Metadata Exchange enables organizations to take advantage of the time and effort already invested in defining data structures within their IT environment when used with Power Center. For example, an organization may be using data modeling tools, such as Erwin, Embarcadero, Oracle designer, Sybase Power Designer etc for developing data models. Functional and technical team should have spent much time and effort in creating the data model's data structures(tables, columns, data types, procedures, functions, triggers etc). By using meta deta exchange, these data structures can be imported into power center to identifiy source and target mappings which leverages time and effort. There is no need for informatica developer to create these data structures once again.
Metadata Exchange enables organizations to take advantage of the time and effort already invested in defining data structures within their IT environment when used with Power Center. For example, an organization may be using data modeling tools, such as Erwin, Embarcadero, Oracle designer, Sybase Power Designer etc for developing data models. Functional and technical team should have spent much time and effort in creating the data model's data structures(tables, columns, data types, procedures, functions, triggers etc). By using meta deta exchange, these data structures can be imported into power center to identifiy source and target mappings which leverages time and effort. There is no need for informatica developer to create these data structures once again.
Power Analyzer:
Power Analyzer provides organizations with reporting facilities. PowerAnalyzer makes accessing, analyzing, and sharing enterprise data simple and easily available to decision makers. PowerAnalyzer enables to gain insight into business processes and develop business intelligence.
Power Analyzer provides organizations with reporting facilities. PowerAnalyzer makes accessing, analyzing, and sharing enterprise data simple and easily available to decision makers. PowerAnalyzer enables to gain insight into business processes and develop business intelligence.
With
PowerAnalyzer, an organization can extract, filter, format, and analyze
corporate information from data stored in a data warehouse, data mart,
operational data store, or otherdata storage models. PowerAnalyzer is
best with a dimensional data warehouse in a relational database. It can
also run reports on data in any table in a relational database that do
not conform to the dimensional model.
Super Glue:
Superglue is used for loading metadata in a centralized place from several sources. Reports can be run against this superglue to analyze meta data.
Superglue is used for loading metadata in a centralized place from several sources. Reports can be run against this superglue to analyze meta data.
Power Mart:
Power Mart is a departmental version of Informatica for building, deploying, and managing data warehouses and data marts. Power center is used for corporate enterprise data warehouse and power mart is used for departmental data warehouses like data marts. Power Center supports global repositories and networked repositories and it can be connected to several sources. Power Mart supports single repository and it can be connected to fewer sources when compared to Power Center. Power Mart can extensibily grow to an enterprise implementation and it is easy for developer productivity through a codeless environment.
Power Mart is a departmental version of Informatica for building, deploying, and managing data warehouses and data marts. Power center is used for corporate enterprise data warehouse and power mart is used for departmental data warehouses like data marts. Power Center supports global repositories and networked repositories and it can be connected to several sources. Power Mart supports single repository and it can be connected to fewer sources when compared to Power Center. Power Mart can extensibily grow to an enterprise implementation and it is easy for developer productivity through a codeless environment.
Note:This
is not a complete tutorial on Informatica. We will add more Tips and
Guidelines on Informatica in near future. Please visit us soon to check
back. To know more about Informatica, contact its official website
www.informatica.com.
Informatica - Transformations
In Informatica, Transformations help to transform the source data according to the requirements of target system and it ensures the quality of the data being loaded into target.
In Informatica, Transformations help to transform the source data according to the requirements of target system and it ensures the quality of the data being loaded into target.
Transformations are of two types: Active and Passive.
Active Transformation
An active transformation can change the number of rows that pass through it from source to target i.e it eliminates rows that do not meet the condition in transformation.
An active transformation can change the number of rows that pass through it from source to target i.e it eliminates rows that do not meet the condition in transformation.
Passive Transformation
A passive transformation does not change the number of rows that pass through it i.e it passes all rows through the transformation.
A passive transformation does not change the number of rows that pass through it i.e it passes all rows through the transformation.
Transformations can be Connected or UnConnected.
Connected Transformation
Connected transformation is connected to other transformations or directly to target table in the mapping.
Connected transformation is connected to other transformations or directly to target table in the mapping.
UnConnected Transformation
An unconnected transformation is not connected to other transformations in the mapping. It is called within another transformation, and returns a value to that transformation.
An unconnected transformation is not connected to other transformations in the mapping. It is called within another transformation, and returns a value to that transformation.
Following are the list of Transformations available in Informatica:
· Aggregator Transformation
· Expression Transformation
· Filter Transformation
· Joiner Transformation
· Lookup Transformation
· Normalizer Transformation
· Rank Transformation
· Router Transformation
· Sequence Generator Transformation
· Stored Procedure Transformation
· Sorter Transformation
· Update Strategy Transformation
· XML Source Qualifier Transformation
· Advanced External Procedure Transformation
· External Transformation
In
the following pages, we will explain all the above Informatica
Transformations and their significances in the ETL process in detail.
Aggregator Transformation
Aggregator transformation is an Active and Connected transformation. This transformation is useful to perform calculations such as averages and sums (mainly to perform calculations on multiple rows or groups). For example, to calculate total of daily sales or to calculate average of monthly or yearly sales. Aggregate functions such as AVG, FIRST, COUNT, PERCENTILE, MAX, SUM etc. can be used in aggregate transformation.
Aggregator transformation is an Active and Connected transformation. This transformation is useful to perform calculations such as averages and sums (mainly to perform calculations on multiple rows or groups). For example, to calculate total of daily sales or to calculate average of monthly or yearly sales. Aggregate functions such as AVG, FIRST, COUNT, PERCENTILE, MAX, SUM etc. can be used in aggregate transformation.
Expression Transformation
Expression transformation is a Passive and Connected transformation. This can be used to calculate values in a single row before writing to the target. For example, to calculate discount of each product or to concatenate first and last names or to convert date to a string field.
Expression transformation is a Passive and Connected transformation. This can be used to calculate values in a single row before writing to the target. For example, to calculate discount of each product or to concatenate first and last names or to convert date to a string field.
Filter Transformation
Filter transformation is an Active and Connected transformation. This can be used to filter rows in a mapping that do not meet the condition. For example, to know all the employees who are working in Department 10 or to find out the products that falls between the rate category $500 and $1000.
Filter transformation is an Active and Connected transformation. This can be used to filter rows in a mapping that do not meet the condition. For example, to know all the employees who are working in Department 10 or to find out the products that falls between the rate category $500 and $1000.
Joiner Transformation
Joiner Transformation is an Active and Connected transformation. This can be used to join two sources coming from two different locations or from same location. For example, to join a flat file and a relational source or to join two flat files or to join a relational source and a XML source.
Joiner Transformation is an Active and Connected transformation. This can be used to join two sources coming from two different locations or from same location. For example, to join a flat file and a relational source or to join two flat files or to join a relational source and a XML source.
In
order to join two sources, there must be atleast one matching port. at
least one matching port. While joining two sources it is a must to
specify one source as master and the other as detail.
The Joiner transformation supports the following types of joins:
· Normal
· Master Outer
· Detail Outer
· Full Outer
Normal join discards all the rows of data from the master and detail source that do not match, based on the condition.
Master outer join discards all the unmatched rows from the master source and keeps all the rows from the detail source and the matching rows from the master source.
Detail outer join keeps all rows of data from the master source and the matching rows from the detail source. It discards the unmatched rows from the detail source.
Full outer join keeps all rows of data from both the master and detail sources.
Master outer join discards all the unmatched rows from the master source and keeps all the rows from the detail source and the matching rows from the master source.
Detail outer join keeps all rows of data from the master source and the matching rows from the detail source. It discards the unmatched rows from the detail source.
Full outer join keeps all rows of data from both the master and detail sources.
Lookup Transformation
Lookup transformation is Passive and it can be both Connected and UnConnected as well. It is used to look up data in a relational table, view, or synonym. Lookup definition can be imported either from source or from target tables.
Lookup transformation is Passive and it can be both Connected and UnConnected as well. It is used to look up data in a relational table, view, or synonym. Lookup definition can be imported either from source or from target tables.
For
example, if we want to retrieve all the sales of a product with an ID
10 and assume that the sales data resides in another table. Here instead
of using the sales table as one more source, use Lookup transformation
to lookup the data for the product, with ID 10 in sales table.
Difference between Connected and UnConnected Lookup Transformation:
Connected lookup receives input values directly from mapping pipeline whereas UnConnected lookup receives values from: LKP expression from another transformation.
Connected lookup returns multiple columns from the same row whereas UnConnected lookup has one return port and returns one column from each row.
Connected lookup supports user-defined default values whereas UnConnected lookup does not support user defined values.
Connected lookup receives input values directly from mapping pipeline whereas UnConnected lookup receives values from: LKP expression from another transformation.
Connected lookup returns multiple columns from the same row whereas UnConnected lookup has one return port and returns one column from each row.
Connected lookup supports user-defined default values whereas UnConnected lookup does not support user defined values.
Normalizer Transformation
Normalizer Transformation is an Active and Connected transformation. It is used mainly with COBOL sources where most of the time data is stored in de-normalized format. Also, Normalizer transformation can be used to create multiple rows from a single row of data.
Normalizer Transformation is an Active and Connected transformation. It is used mainly with COBOL sources where most of the time data is stored in de-normalized format. Also, Normalizer transformation can be used to create multiple rows from a single row of data.
Rank Transformation
Rank transformation is an Active and Connected transformation. It is used to select the top or bottom rank of data. For example, to select top 10 Regions where the sales volume was very high or to select 10 lowest priced products.
Rank transformation is an Active and Connected transformation. It is used to select the top or bottom rank of data. For example, to select top 10 Regions where the sales volume was very high or to select 10 lowest priced products.
Router Transformation
Router is an Active and Connected transformation. It is similar to filter transformation. The only difference is, filter transformation drops the data that do not meet the condition whereas router has an option to capture the data that do not meet the condition. It is useful to test multiple conditions. It has input, output and default groups. For example, if we want to filter data like where State=Michigan, State=California, State=New York and all other States. It’s easy to route data to different tables.
Router is an Active and Connected transformation. It is similar to filter transformation. The only difference is, filter transformation drops the data that do not meet the condition whereas router has an option to capture the data that do not meet the condition. It is useful to test multiple conditions. It has input, output and default groups. For example, if we want to filter data like where State=Michigan, State=California, State=New York and all other States. It’s easy to route data to different tables.
Sequence Generator Transformation
Sequence Generator transformation is a Passive and Connected transformation. It is used to create unique primary key values or cycle through a sequential range of numbers or to replace missing keys.
Sequence Generator transformation is a Passive and Connected transformation. It is used to create unique primary key values or cycle through a sequential range of numbers or to replace missing keys.
It
has two output ports to connect transformations. By default it has two
fields CURRVAL and NEXTVAL(You cannot add ports to this transformation).
NEXTVAL port generates a sequence of numbers by connecting it to a
transformation or target. CURRVAL is the NEXTVAL value plus one or
NEXTVAL plus the Increment By value.
Stored Procedure Transformation
Stored Procedure transformation is a Passive and Connected & UnConnected transformation. It is useful to automate time-consuming tasks and it is also used in error handling, to drop and recreate indexes and to determine the space in database, a specialized calculation etc.
Stored Procedure transformation is a Passive and Connected & UnConnected transformation. It is useful to automate time-consuming tasks and it is also used in error handling, to drop and recreate indexes and to determine the space in database, a specialized calculation etc.
The
stored procedure must exist in the database before creating a Stored
Procedure transformation, and the stored procedure can exist in a
source, target, or any database with a valid connection to the
Informatica Server. Stored Procedure is an executable script with SQL
statements and control statements, user-defined variables and
conditional statements.
Sorter Transformation
Sorter transformation is a Connected and an Active transformation. It allows to sort data either in ascending or descending order according to a specified field. Also used to configure for case-sensitive sorting, and specify whether the output rows should be distinct.
Sorter transformation is a Connected and an Active transformation. It allows to sort data either in ascending or descending order according to a specified field. Also used to configure for case-sensitive sorting, and specify whether the output rows should be distinct.
Source Qualifier Transformation
Source Qualifier transformation is an Active and Connected transformation. When adding a relational or a flat file source definition to a mapping, it is must to connect it to a Source Qualifier transformation. The Source Qualifier performs the various tasks such as overriding default SQL query, filtering records; join data from two or more tables etc.
Source Qualifier transformation is an Active and Connected transformation. When adding a relational or a flat file source definition to a mapping, it is must to connect it to a Source Qualifier transformation. The Source Qualifier performs the various tasks such as overriding default SQL query, filtering records; join data from two or more tables etc.
Update Strategy Transformation
Update strategy transformation is an Active and Connected transformation. It is used to update data in target table, either to maintain history of data or recent changes. You can specify how to treat source rows in table, insert, update, delete or data driven.
Update strategy transformation is an Active and Connected transformation. It is used to update data in target table, either to maintain history of data or recent changes. You can specify how to treat source rows in table, insert, update, delete or data driven.
XML Source Qualifier Transformation
XML Source Qualifier is a Passive and Connected transformation. XML Source Qualifier is used only with an XML source definition. It represents the data elements that the Informatica Server reads when it executes a session with XML sources.
XML Source Qualifier is a Passive and Connected transformation. XML Source Qualifier is used only with an XML source definition. It represents the data elements that the Informatica Server reads when it executes a session with XML sources.
Advanced External Procedure Transformation
Advanced External Procedure transformation is an Active and Connected transformation. It operates in conjunction with procedures, which are created outside of the Designer interface to extend PowerCenter/PowerMart functionality. It is useful in creating external transformation applications, such as sorting and aggregation, which require all input rows to be processed before emitting any output rows.
Advanced External Procedure transformation is an Active and Connected transformation. It operates in conjunction with procedures, which are created outside of the Designer interface to extend PowerCenter/PowerMart functionality. It is useful in creating external transformation applications, such as sorting and aggregation, which require all input rows to be processed before emitting any output rows.
External Procedure Transformation
External Procedure transformation is an Active and Connected/UnConnected transformations. Sometimes, the standard transformations such as Expression transformation may not provide the functionality that you want. In such cases External procedure is useful to develop complex functions within a dynamic link library (DLL) or UNIX shared library, instead of creating the necessary Expression transformations in a mapping.
External Procedure transformation is an Active and Connected/UnConnected transformations. Sometimes, the standard transformations such as Expression transformation may not provide the functionality that you want. In such cases External procedure is useful to develop complex functions within a dynamic link library (DLL) or UNIX shared library, instead of creating the necessary Expression transformations in a mapping.
Differences between Advanced External Procedure and External Procedure Transformations:
External Procedure returns single value, where as Advanced External Procedure returns multiple values.
External Procedure supports COM and Informatica procedures where as AEP supports only Informatica Procedures.
External Procedure returns single value, where as Advanced External Procedure returns multiple values.
External Procedure supports COM and Informatica procedures where as AEP supports only Informatica Procedures.
Note:This
is not a complete tutorial on Informatica. To know more about
Informatica, contact its official website www.informatica.com.
Database - RDBMS
There are a number of relational databases to store data. A relational database contains normalized data stored in tables. Tables contain records and columns. RDBMS makes it easy to work with individual records. Each row contains a unique instance of data for the categories defined by the columns.
There are a number of relational databases to store data. A relational database contains normalized data stored in tables. Tables contain records and columns. RDBMS makes it easy to work with individual records. Each row contains a unique instance of data for the categories defined by the columns.
RDBMS
are used in OLTP applications(e.g. ATM cards) very frequently and
sometimes datawarehouse may also use relational databases. Please refer
to Relational data modeling for details to know how data from a source
system is normalized and stored in RDBMS databases.
Popular RDBMS Databases
RDBMS Name
|
Company Name
|
Oracle
|
Oracle Corporation
|
IBM DB2 UDB
|
IBM Corporation
|
IBM Informix
|
IBM Corporation
|
Microsoft SQL Server
|
Microsoft
|
Sybase
|
Sybase Corporation
|
Terradata
|
NCR
|
Business Intelligence:
Business Intelligence is a technology based on customer and profit oriented models that reduces operating costs and provide increased profitability by improving productivity, sales, service and helps to make decision making capabilities at no time. Business Intelligence Models are based on multi dimensional analysis and key performance indicators (KPI) of an enterprise.
Business Intelligence is a technology based on customer and profit oriented models that reduces operating costs and provide increased profitability by improving productivity, sales, service and helps to make decision making capabilities at no time. Business Intelligence Models are based on multi dimensional analysis and key performance indicators (KPI) of an enterprise.
Business
Intelligence applications that are based on Business Intelligence
Models are created by Business Intelligence software which provides the
aggregated details about suppliers, customers, internal activities,
business to business transactions to the managers or whoever needs it to
take better corporate decisions.
Many
business questions or situations need to be analyzed in order to
achieve the target of an enterprise with the help of several managers or
executives in each cadre. Below are some of the samples of these
questions.
Business Intelligence: Finance:
What is the net income, expenses, gross profit, and net profit for this quarter, year?
What is the net income, expenses, gross profit, and net profit for this quarter, year?
Business Intelligence: Accounts:
What is the sales amount this month and what is the outstanding pending payment?
What is the sales amount this month and what is the outstanding pending payment?
Business Intelligence: Purchase:
Who is the vendor to be contacted to purchase products?
Who is the vendor to be contacted to purchase products?
Business Intelligence: Production:
How many products are manufactured in each production unit today, weekly, monthly?
How many products are manufactured in each production unit today, weekly, monthly?
Business Intelligence: Sales:
How many products have been sold in each area today, weekly, monthly?
How many products have been sold in each area today, weekly, monthly?
Business Intelligence: Quality:
How many products have been defective today, weekly, monthly, quarterly, yearly?
How many products have been defective today, weekly, monthly, quarterly, yearly?
Business Intelligence: Service:
Are the customers satisfied with the quality?
Are the customers satisfied with the quality?
These
business intelligence questions are related with why, what, how, when,
and business intelligence reports(olap reports) are useful in providing
solutions to the above questions by means of reporting, score cards,
balance score cards that are helpful in managerial decisions.
Business Intelligence Tools
Business Intelligence Tools help to gather, store, access and analyze corporate data to aid in decision-making. Generally these systems will illustrate business intelligence in the areas of customer profiling, customer support, market research, market segmentation, product profitability, statistical analysis, inventory and distribution analysis.
Business Intelligence Tools help to gather, store, access and analyze corporate data to aid in decision-making. Generally these systems will illustrate business intelligence in the areas of customer profiling, customer support, market research, market segmentation, product profitability, statistical analysis, inventory and distribution analysis.
With
Business Intelligence Tools, various data like customer related,
product related, sales related, time related, location related, employee
related etc. are gathered and analysed based on which important
strategies or rules are formed and goals to achieve their target are
set. These decisions are very efficient and effective in promoting an
Organisation's growth.
Since
the collected data can be sliced across almost all the dimensions like
time,location, product, promotion etc., valuable statistics like sales
profit in one region for the current year can be calculated and compared
with the previous year statistics.
Popular Business Intelligence Tools
Tool Name
|
Company Name
|
Business Objects
|
Business Objects
|
Cognos
|
Cognos
|
Hyperion
|
Hyperion
|
Microstrategy
|
Microstrategy
|
Microsoft Reporting Services
|
Microsoft
|
Crystal
|
Business Objects
|
OLAP & its Hybrids
OLAP, an acronym for Online Analytical Processing
is an approach that helps organization to take advantages of DATA.
Popular OLAP tools are Cognos, Business Objects, Micro Strategy etc.
OLAP cubes provide the insight into data and helps the topmost
executives of an organization to take decisions in an efficient manner.
Technically,
OLAP cube allows one to analyze data across multiple dimensions by
providing multidimensional view of aggregated, grouped data. With OLAP
reports, the major categories like fiscal periods, sales region,
products, employee, promotion related to the product can be ANALYZED
very efficiently, effectively and responsively. OLAP applications
include sales and customer analysis, budgeting, marketing analysis,
production analysis, profitability analysis and forecasting etc.
ROLAP
ROLAP stands for Relational Online Analytical Process that provides multidimensional analysis of data, stored in a Relational database(RDBMS).
ROLAP stands for Relational Online Analytical Process that provides multidimensional analysis of data, stored in a Relational database(RDBMS).
MOLAP
MOLAP(Multidimensional OLAP), provides the analysis of data stored in a multi-dimensional data cube.
MOLAP(Multidimensional OLAP), provides the analysis of data stored in a multi-dimensional data cube.
HOLAP
HOLAP(Hybrid OLAP) a combination of both ROLAP and MOLAP can provide multidimensional analysis simultaneously of data stored in a multidimensional database and in a relational database(RDBMS).
HOLAP(Hybrid OLAP) a combination of both ROLAP and MOLAP can provide multidimensional analysis simultaneously of data stored in a multidimensional database and in a relational database(RDBMS).
DOLAP
DOLAP(Desktop OLAP or Database OLAP)provide multidimensional analysis locally in the client machine on the data collected from relational or multidimensional database servers.
DOLAP(Desktop OLAP or Database OLAP)provide multidimensional analysis locally in the client machine on the data collected from relational or multidimensional database servers.
OLAP Analysis
Imagine
an organization that manufactures and sells goods in several States of
USA which employs hundreds of employees in its manufacturing, sales and
marketing division etc. In order to manufacture and sell this product in
profitable manner, the executives need to analyse(OLAP analysis) the
data on the product and think about various possibilities and causes for
a particular event like loss in sales, less productivity or increase in
sales over a particular period of the year.
During the OLAP analysis, the top executives may seek answers for the following:
1. Number of products manufactured.
2. Number of products manufactured in a location.
3. Number of products manufactured on time basis within a location.
4. Number of products manufactured in the current year when compared to the previous year.
5. Sales Dollar value for a particular product.
6. Sales Dollar value for a product in a location.
7. Sales Dollar value for a product in a year within a location.
8. Sales Dollar value for a product in a year within a location sold or serviced by an employee.
OLAP tools help executives in finding out the answers, not only to the above mentioned measures, even for the very complex queries by allowing them to slice and dice, drill down from higher level to lower level summarized data, rank, sort, etc.
1. Number of products manufactured.
2. Number of products manufactured in a location.
3. Number of products manufactured on time basis within a location.
4. Number of products manufactured in the current year when compared to the previous year.
5. Sales Dollar value for a particular product.
6. Sales Dollar value for a product in a location.
7. Sales Dollar value for a product in a year within a location.
8. Sales Dollar value for a product in a year within a location sold or serviced by an employee.
OLAP tools help executives in finding out the answers, not only to the above mentioned measures, even for the very complex queries by allowing them to slice and dice, drill down from higher level to lower level summarized data, rank, sort, etc.
Example of OLAP Analysis Report
Time Dimension Id
|
Location Dimension Id
|
Product Dimension Id
|
Organization Dimension Id
|
Sales Dollar
|
DateTimeStamp
|
1
|
1
|
100001
|
1
|
1000
|
1/1/2005 11:23:31 AM
|
3
|
1
|
100001
|
1
|
750
|
1/1/2005 11:23:31 AM
|
1
|
1
|
100001
|
2
|
1000
|
1/1/2005 11:23:31 AM
|
3
|
1
|
100001
|
2
|
750
|
1/1/2005 11:23:31 AM
|
In
the above example of OLAP analysis, data can be sliced and diced,
drilled up and drilled down for various hierarchies like time dimension,
location dimension, product dimension, and organization dimension .
This would provide the topmost executives to take a decision about the
product performance in a location/time/organization. In OLAP reports,
Trend analysis can be also made by comparing the sales value of a
particular product over several years or quarters.
OLAP Database – Multidimensional
This is a type of database that is optimized for data warehouse, data mart and online analytical processing (OLAP) applications. The main advantage of this database is query performance.
Relational
databases make it easy to work with individual records, whereas
multidimensional databases are designed for analyzing large groups of
records. Relational database is typically accessed using a Structured
Query Language (SQL) query. A multidimensional database allows a user to
ask questions like "How many mortgages have been sold in New Jersey
city" and "How many credit cards have been purchased in a particular
county?".
Popular Multidimensional Databases
Database Name
|
Company Name
|
Crystal Holos
|
Business Objects
|
Hyperion Essbase
|
Hyperion
|
Oracle Express
|
Oracle Corporation
|
Oracle OLAP Option
|
Oracle Corporation
|
AWMicrosoft Analysis Services
|
Microsoft
|
PowerPlay Enterprise
|
Cognos
|