Data Mining & Data Warehouse(PU)
Purbanchal University
3.SOFTWARE & HARDWARE DESIGN
- Introduction
- Multidimensional structure
- Fact table
- Dimension table
- Difference between fact table and dimension table
- Data Warehouse Schema
- Star schema
- Snowflake schema
- Fact Constellation
- Concept Hierarchy
- Starnet Query model
- Overview of hard ware and I/0 considerations
- Index
- Materialize view
Introduction:
- Multidimensional Structure
Data Warehouses and OLAP tools are based on a multidimensional data model. This model views data in the form of data cube.
What is a data Cube?
A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts.
Fact Table
- The large central table, containing the bulk of the data with no redundancy.
- Usually the fact table in the schemas are in third normal form
- A flat table can contain fact’s data on detail or aggregate level
- The fact table contains business facts (or measures), and foreign keys which refer to candidate keys (normally primary keys) in the dimension tables.
- Fig : fact table.
Dimension table:
Contrary to fact tables, dimension tables contain descriptive attributes (or fields) that are typically textual fields (or discrete numbers that behave like text).
- A dimension table contains dimensions of a fact.
- They are joined to fact table via a foreign key.
- Dimension tables are de-normalized tables.
- The Dimension Attributes are the various columns in a dimension table
- Dimensions offers descriptive characteristics of the facts with the help of their attributes
- No set limit set for given for number of dimensions
- The dimension can also contain one or more hierarchical relationships
- dimensions table are generally small in size than fact table.
- - Dimensions categorize and describe data
- warehouse facts and measures in ways that
- support meaningful answers to business questions.
For e.g. Dimension table for item may contain
item_name
Brand
Type
-Dimension table can be specified by users or Experts.
-it is composed of one or more hierarchies that catagorise Data .if the dimension has not got a hierarchies and level it is called flat dimension or list.
Difference between Dimension table vs. Fact table
Parameters |
Fact Table |
Dimension Table |
Definition |
Measurements or facts about a business process. |
Companion table to the fact table contains descriptive attributes to be used as query constraining. |
Characteristic |
Located at the center of a star or snowflake schema and surrounded by dimensions. |
Connected to the fact table and located at the edges of the star or snowflake schema |
Design |
Defined by their grain or its most atomic level. |
Should be wordy, descriptive, complete, and quality assured. |
Type of Data |
Facts tables could contain information like sales against a set of dimensions like Product and Date. |
dimension table contains attributes which describe the details of the dimension. E.g., Product dimensions can contain Product ID, Product Category, etc. |
Key |
Primary Key in fact is mapped as foreign keys to Dimensions. |
Foreign key to the facts table |
Hierarchy |
Does not contain Hierarchy |
Contains Hierarchies. For example Location could contain, country, pin code, state, city, etc. |
- Data warehouse Schema
Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates. Much like a database, a data warehouse also requires to maintain a schema. A database uses relational model, while a data warehouse uses
- Star Schema,
- Snowflake, and
- Fact Constellation schema.
- Star Schema
The star schema architecture is the simplest data warehouse schema. It is call star schema because the diagram resembles a star, with points radiating from the center.
- Each dimension in a star schema is represented with only one-dimension table.
- This dimension table contains the set of attributes.
- There is a fact table at the center. It contains the keys to each of four dimensions.
- The fact table also contains the attributes, namely dollars sold and units sold.
- A large central table (called fact table) containing the bulk of the data with no redundancy.
- Usually the fact table in the star schema are in third normal form.(3NF).
- Fact table typically have two columns: foreign key to dimension table and measures those that contact numeric facts.
- A fact table can contain facts data on detail or aggregate level.
- the following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location.
fig: star schema
- Snow flake schema
A snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. "Snow flaking" is a method of normalizing the dimension tables in a star schema. When it is completely normalized along all the dimension tables, the resultant structure resembles a snowflake with the fact table in the middle. The principle behind snow flaking is normalization of the dimension tables by removing low cardinality attributes and forming separate tables.
The snowflake schema is similar to the star schema. However, in the snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema's dimensions are de normalized with each dimension represented by a single table. A complex snowflake shape emerges when the dimensions of a snowflake schema are elaborate, having multiple levels of relationships, and the child tables have multiple parent tables ("forks in the road").
The snowflake schema is a vibrant of the star schema model, where some of the dimension table are normalized, therefore further splitting the data into additional tables.
Diagram of snowflake schema
Disadvantages:
- The primary disadvantage of the snowflake schema is that the additional levels of attribute normalization adds complexity to source query joins, when compared to the star schema.
- Snowflake schemas, in contrast to flat single table dimensions, have been heavily criticized. Their goal is assumed to be an efficient and compact storage of normalized data but this is at the significant cost of poor performance when browsing the joins required in this dimension this disadvantage may have reduced in the years since it was first recognized, owing to better query performance within the browsing tools.
- When compared to a highly normalized transactional schema, the snowflake schema's denormalization removes the data integrity assurances provided by normalized schemas Data loads into the snowflake schema must be highly controlled and managed to avoid update and insert anomalies.
- Fact Constellation
- In the fact constellation architecture conatin multiple fact tables that share many dimension tables.
- This schema is also galaxy schema(collection of stars)
- This schema is more complex than star or snowflake schema architecture, which is because it contains multiple fact tables.
- This schema is flexible however it may be hard to manage and support.
Advantage of Fact Constellation Schema Data Warehouses
- Provides a flexible schema
- Different fact tables are explicitly assigned to the dimensions
Disadvantage of Fact Constellation Schema Data Warehouses
- Fact Constellation solution is difficult to maintain
- Complexity of the schema involved due to the number of aggregations
Diagram of fact constellation (galaxy)
- Concept Hierarchy
Concept hierarchies organize data or concept in hierarchical forms or in certain partial order.
If is used for expressing knowledge in concise high level form and facilitating mining knowledge at multiple level of abstraction.
Concept hierarchy facilitates drilling and rolling in data warehouse to view data in multiple granularity.
It defines a sequence of mapping from the set of low-level concept to higher level concept.
Let us take a concept hierarchy for the dimension [location] city values for location include Vancouver, taint, new york &Chicago each city is then mapped to porcine re state to which it belong. Then province or state is mapped to country to which they belong such as Canada/USA.
The concept hierarchy can be illustrated by following fig.
Fig: A concept hierarchy for the dimension location.
These mapping from the concept hierarchy for the dimension ‘location’ mapping a set of low-level concept (e.g. cities) to higher-level (e.g. countries). These attributes are related by a total order, forming a concept hierarchy such as street city, province or state ,country.
Country
Province on …..
City Street
Fig: hierarchy for location.
Lattice
The attributes of a dimension may be organized in partial order for the (time)dimension based on the attributes day ,month, quarter and year is day <{month<quarter and year is day<{month <quarter; week}<year.
- Starnet Query Model
A starnet query model for querying multidimensional data base .if consist of radial line from the central point .Each line represent a concept hierarchy for the dimension . Each abstraction level in the hierarchy is called footprint .These footprint represent the granularities available for use by OLAP operation such as drill-down and drill up.
Fig:Modeling business queries a starnet model.
In the fig it has for radial lines. This represents concept hierarchies of dimension, location, customer, item, and time. Each line consists of footprints representing abstraction level of dimension.
E.g. the footprint of time dimension is: “day” month, quarter, and “year”. A concept hierarchy may include a single attribute or several attributes.
In order to examine the item sale user can roll up along the “time” dimension from month to quarter OR.
Drilldown along the “location” dimension from country to city.
- Overview of hard ware and I/0 considerations
I/0 performances should always be a key consideration for data warehouse designer and administrators
The typical workload in the data warehouse is especially I/O intensive, with operations such as large data loads and index builds, Creation. Of materialization view and queries over large volume of data. The underlying I/O system for the data warehouse should be designed to meet these heavy requirements.
Five level guideline for data warehouse I/O configuration.
The I/O configuration used by a data warehouse will depend on the characteristics of the specific storage and server capabilities so the material in this chapter is only intended to provide guidelines for designing and toning I/O system.
Configure I/O bandwidth not capacity
- Storage configuration for the data warehouse should be chosen base on the I/O band with then can provide, and not necessarily on their overall storage capacity.
- Buying storage based solely on capacity has the potential for making mistake.
- Especially for system less than 500 GB in total size. The capacity of individual disk drives is growing foster than the I/O through put rates provides by those disk leading to the situation in which a small number of disk can store a large volume of data. Eg consider a 200 GB data mart using 72 GB drives this data mart could be built with as few as six drives in a fully mirrored environment. However six drives might not provide enough I/O bandwidth to handle a minimum number of concurrent users on a 4 CPU sever. Thus even though six drives provides sufficient storage a large number of drives may be required to provide acceptable performance this system.
Use Redundancy
Because data warehouse are often the larges database system in a company they have they most disk and thus are also the most susceptible to the failure of a single disk. Therefore disk redundancy is a requirement for data warehouse to protect against a hardware failure like disk-striping, redundancy can be achieved in many ways using software or hardware.
Test the I/O system before building the database
The most important time to examine and tone the I/O system is before the database is even created. Once the database files are created it is more difficult to reconfigure the files. Some logical volume managers many support dynamic reconfiguration of files while other storage configuration order to reconfigure their I/O layout in both cane considerable system resources must be devoted to this reconfiguration.
Plan for Growth:
The data warehouse designer should plan for future growth of a data ware. There are many approaches to handling the growth of a data ware. There are many approaches to handling the growth in the system and the key consideration is to be able to grow the I/O system without compromising on the I/O bandwidth.
- Index
A database index is a data structure that improves the speed of data retrieval operation on a database table at the cast of additional writes and strange space to maintain the index data structure.
Indexes are to quickly locate data without having to search overy row in a database table every time aab table is accessed.
Indexes can be created using one or more columns of database table, providing the basis for both rapid random lookups and efficient access of address records.
An index is a copy of selected column of data from a table that can be searched very efficiently that also includes a low level disk block address or direct link to the complete row of data it was copied from.
Types of Index:
Bitmap index
A bitmap index is a special kind of indexing that stores the bulk of its data as bit arrays (bitmaps) and answers most queries by performing bitwise logical operations on these bitmaps. The most commonly used indexes, such as B+ trees, are most efficient if the values they index do not repeat or repeat a small number of times. In contrast, the bitmap index is designed for cases where the values of a variable repeat very frequently. For example, the sex field in a customer database usually contains at most three distinct values: male, female or unknown (not recorded). For such variables, the bitmap index can have a significant performance advantage over the commonly used trees.
Dense index
A dense index in databases is a file with pairs of keys and pointers for every record in the data file. Every key in this file is associated with a particular pointer to a record in the sorted data file. In clustered indices with duplicate keys, the dense index points to the first record with that key.
Sparse index
A sparse index in databases is a file with pairs of keys and pointers for every block in the data file. Every key in this file is associated with a particular pointer to the block in the sorted data file. In clustered indices with duplicate keys, the sparse index points to the lowest search key in each block.
Reverse index
A reverse key index reverses the key value before entering it in the index. E.g., the value 24538 becomes 83542 in the index. Reversing the key value is particularly useful for indexing data such as sequence numbers, where new key values monotonically increase.
Student Note:
in most cases, an index is used to quickly locate the data record(s) from which the required data is read. In other words, the index is only used to locate data records in the table and not to return data.
Index architure /indexing method.
In non-Clustered Index.
The physical order of the rows is not the same as the index-order.
The indexed column used are typically non-primary key column used in JOIN, WHERE, ORDER BY columns
There can be more than one non-clustered index on a database table.
Clustered:
Clustering alters the data block into a certain distinct order to match the index, resulting in the row data being stored in order therefore, only one clustered index can be created on a given database table. Chittered indices can greatly increases overall speed of retrieval but usually only where the data is accessed sequentially in the same or reverse order of the clustered index or when a range of item is selected.
Cluster:
When multiple databases and multiple tables are joined, it is referred to as a cluster. The record
For the tables sharing the values of a cluster key shall be stored together in the same or nearby data blocks. This may improve the joins of these tables on the cluster key.
Since the matching record are stored together and less I/0 is required to locate them.
A chanter can be keyed with B-tree index or a hash table. The data block where the table record is stored is defined by the value of cluster key.
Types of indexes
Bitmap index:
A bitmap index is a special kind of indexing that stores the bulla of its data as bit arrays(bitmaps) and answer most queries by performing bitwise logical operation on these bit maps. The most commonly used indexed, such as B+ Trees, are most efficient if
- Materialized view
Typically, data flows from are or more OLTP database into a data warehouse on a monthly, weekly or daily basis. The data is normally processed in a staging file before bag added to the data ware house. Data warehouse commonly rouge in size from 1043 to few terabyte. Usually, the vast majorities of the data is stored in a few very large fast tables.
One technologies employed in data warehouse to improve performance is the creation of summaries. Summaries one special kind of aggregate view that improve query execution time by pre calculating expensive joins and aggregation operation prior to execution and storing the results in a table in the database for e.g.: we can create a table to curtain the sum of sales by region and by product .
The summaries or aggregate that are referred in to book and literatures on data warehouse are created in oracle using a schema objet is called materialized view.
Materialized view can perform a number of roles, such as improving query performance or providing replicated data.
In data ware we can use materialize view to pre compute and stone aggregated data such on sum of the sales. Materialize such on sum of the sales. Materialize view in these environment one often referred to as summaries, because they summarized data.
They can also be used to pre compute joins with or without aggregation.
The need of materialize view
- To increase the speed of queries on very large data base. Queries to large data base often involves join bet table aggregation such as SUM a both. These operations one expensive in terms of time and processing power.
The type of materialize view we create determine how materialize view is we create determine/how materialize view we create determine how materialize view is refreshed and used by query rewrite.
- We can use almost identical syntax to perform number of roles.
E.g. materialize view can replicate data; a process formerly achieved by using the CREATE VIEW is a synonym for CREATE SNAPSHOT.
- Materialize view improve query performance by pre calculating exercise join and aggregation operation on the database prior to execution and storing the results in the database.
Type of Materialize views
- Materialized view with aggregates.
- Materialized view containing only joins
- Nested materialized view.
Est eveniet ipsam sindera pad rone matrelat sando reda
Omnis blanditiis saepe eos autem qui sunt debitis porro quia.
Exercitationem nostrum omnis. Ut reiciendis repudiandae minus. Omnis recusandae ut non quam ut quod eius qui. Ipsum quia odit vero atque qui quibusdam amet. Occaecati sed est sint aut vitae molestiae voluptate vel