DATA WAREHOUSE BASICS

William H. Inmon (born 1945) is an American computer scientist, recognized by many as the father of the data warehouse. Bill Inmon created the accepted definition of what a data warehouse is - a subject oriented, nonvolatile, integrated, time variant collection of data in support of management's decisions. Compared with the approach of the other pioneering architect of data warehousing, Ralph Kimball, Inmon's approach is often characterized as a top-down approach.
Ralph Kimball (Born 1944) is an author on the subject of data warehousing and business intelligence. He is widely regarded as one of the original architects of data warehousing and is known for long-term convictions that data warehouses must be designed to be understandable and fast. His methodology, also known as dimensional modeling or the Kimball methodology, has become the de facto standard in the area of decision support.
DATA WAREHOUSE CONCEPTS
A fundamental concept of a data warehouse is the distinction between data and informationData is composed of observable and recordable facts that are often found in operational or transactional systems. At Rutgers, these systems include the registrar’s data on students (widely known as the SRDB), human resource and payroll databases, course scheduling data, and data on financial aid. In a data warehouse environment, data only comes to have value to end-users when it is organized and presented as informationInformation is an integrated collection of facts and is used as the basis for decisionmaking. For example, an academic unit needs to have diachronic information about its extent of instructional output of its different faculty members to gauge if it is becoming more or less reliant on part-time faculty.
DATA WAREHOUSE DEFINITIONS
The data warehouse is that portion of an overall Architected Data Environment that serves as the single integrated source of data for processing information. The data warehouse has specific characteristics that include the following:
Subject-Oriented: Information is presented according to specific subjects or areas of interest, not simply as computer files. Data is manipulated to provide information about a particular subject. For example, the SRDB is not simply made accessible to end-users, but is provided structure and organized according to the specific needs.
Integrated: A single source of information for and about understanding multiple areas of interest. The data warehouse provides one-stop shopping and contains information about a variety of subjects. Thus the OIRAP data warehouse has information on students, faculty and staff, instructional workload, and student outcomes.
Non-VolatileStable information that doesn’t change each time an operational process is executed.Information is consistent regardless of when the warehouse is accessed.
Time-Variant: Containing a history of the subject, as well as current information. Historical information is an important component of a data warehouse.
Accessible: The primary purpose of a data warehouse is to provide readily accessible information to end-users.
Process-Oriented: It is important to view data warehousing as a process for delivery of information. The maintenance of a data warehouse is ongoing and iterative in nature.
Other Definitions
Data Warehouse: A data structure that is optimized for distribution. It collects and stores integrated sets of historical data from multiple operational systems and feeds them to one or more data marts. It may also provide end-user access to support enterprise views of data.
Data Mart: A data structure that is optimized for access. It is designed to facilitate end-user analysis of data. It typically supports a single, analytic application used by a distinct set of workers.
Staging Area: Any data store that is designed primarily to receive data into a warehousing environment.
Operational Data Store: A collection of data that addresses operational needs of various operational units. It is not a component of a data warehousing architecture, but a solution to operational needs.
OLAP (On-Line Analytical Processing): A method by which multidimensional analysis occurs.
Multidimensional Analysis: The ability to manipulate information by a variety of relevant categories or “dimensions” to facilitate analysis and understanding of the underlying data. It is also sometimes referred to as “drilling-down”, “drilling-across” and “slicing and dicing”
Star Schema: A means of aggregating data based on a set of known dimensions. It stores data multidimensionally in a two dimensional Relational Database Management System (RDBMS), such as Oracle.
Snowflake Schema: An extension of the star schema by means of applying additional dimensions to the dimensions of a star schema in a relational environment.
What is datawarehouse?
Datawarehouse is a
  • Subject-oriented-Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.
  • Non-volatile-Once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.
  • Integrated-Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.
  • Time-varient-In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.
database system.
What are the steps involved in loading a  datawarehouse?
There are 3 steps in datawarehousing,
  • data-extraction from the OLTP system
  • Apply the required transformations on the data
  • Load the data into datawarehouse
What is fact table?
Fact table is  a table, which is having the multiple foreign keys of the dimension tables and measures.Fact table is also called as child table having dimension codes and facts/measures.
What is dimension table?
Dimension table is a table,which is having the descriptive information of an entity.This is also called as master table having the primary key.Always the data flow from dimension table to fact table.

What is data-modelling?
The process of storing and retrieving the data from datawarehouse is called as data-modelling.

What is star schema and snow-flake schema?What is the difference between those?
Star is one fact and many dimensions where as Snowflake design is one fact and many dimensions and dimensions can have additional dim relations.No.of joins in the snow-flake schema is more compare to star schema.
Difference between OLTP and OLAP systems?
OLTP(Online transactional process)
Dynamic data
volatile
Normalised data
Efficient for data storing
OLAP(Online Analytical process)
90%static,10%Dynamic data
Non-volatile
De-normalised data
Efficient for data retrieving
What are different types of dimensions?
  • confirmed dimension
  • slowly changing dimension
  • rapidly changing dimension
  • De-generated dimension

What are different types of fact tables?
  • Additive fact table
  • semi-additive fact table
  • Non-additive or fact-less fact table
What is Federated Query?
A query is fired across multiple sources is called as federated query.
like
select cust_first_name,revenue from customers,sales
customers->SQL,sales->DB2 data sources
Hope it Helps!

1 comment:

  1. Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updatingmicrostrategy online training

    ReplyDelete