Traditionally data warehouses take too long to build and are too hard to change. WhereScape RED is an Integrated Development Environment to support the building and managing of data warehouses.
It has the flexibility to enable you to build a variety of architectures including:
It has the flexibility to enable you to build a variety of architectures including:
- enterprise data warehouses
- dimensional data warehouses
- data marts
- user facing views, aggregates and summaries
In all cases the core values of WhereScape RED are twofold: its rapid building capabilities that enable better data warehouses to be built, faster, and its integrated environment that simplifies management.
As a data warehouse specific tool, WhereScape RED embodies a simple, pragmatic approach to building data warehouses. With WhereScape RED you specify what you want to achieve by dragging and dropping objects to create a meta view, and then let WhereScape RED do the heavy lifting of creating the necessary tables and procedures etc. Data warehouse wizards prompt for additional information at critical points to provide the maximum value from the generated objects.
WhereScape RED supports these concepts to facilitate very rapid delivery of data warehouses. WhereScape RED controls the flow of data from the source systems through transforming and modeling layers to analysis areas.
Different styles of data warehousing (EDW 3NF, dimensional etc) are supported and utilize different objects, but all follow the same basic flow.
- Data Flow - Enterprise Models
- Source (OLTP) System
- load tables
- stage tables
- data store tables
- model tables, dimension tables, or detailed (transactional) fact tables
- roll up fact table(s)
- aggregate and/or KPI fact table(s)
- views
- export objects
- Microsoft Analysis Services cubes
The diagram below shows the objects and the information flow:
Data Flow:
Data is moved from source tables to load tables via scripts, database links and ODBC links. These load tables are created by dragging and dropping from a connection object. Load tables are generally based on source system tables. Their main purpose is to be a destination for moving data as simply and quickly as possible from the source system. Load tables will generally hold a single unit of data (e.g. last night or last month), and will be truncated at the start of each extract. Transformations can be performed on the columns during the load process if required.
Load tables feed stage tables, which in turn feed data store, model or dimension tables. Data from multiple load tables can be combined at this level.
First tier transactional tables (fact or model) are created and updated from stage tables. Second tier tables (model, summary rollup, aggregate, KPI, etc.) are created and updated from lower level tables.
Cubes can be created from transactional tables or views.
Procedural code:
WhereScape RED generates procedural code in the target database's native language (e.g. PL/SQL for Oracle) at each stage in the data warehouse build process. The generated code is, in nearly all cases, sufficient to create a rapid prototype of the data warehouse.
While the generation of code is often seen as a key benefit of WhereScape RED, the ability to control and manage custom code is also critical to the long term management of the data warehouse environment.
In most cases 85-100% of the generated code will be taken through to production with no customization required.
WhereScape RED and Traditional ETL Tools:
WhereScape RED's core strength is in the rapid building of data warehouse structures. Organizations that have already purchased traditional ETL tools can use WhereScape RED as a pureplay data warehouse toolset. WhereScape RED can be used to iteratively build data marts or presentation layer objects that need to be constantly updated to keep relevant for end users. In most cases, customers will find that WhereScape RED has enough ETL capabilities to build the entire data warehouse, using the database rather than a proprietary engine to perform ETL processing.
The cross over in functionality between ETL tools and WhereScape RED is not large. WhereScape RED is tightly integrated into the data warehouse database and has an embedded data warehouse building approach. For WhereScape data movement is the start of the process—from source system to load tables. The key benefits of the product: development productivity and an integrated environment to manage and maintain your data warehouse, comes after the data movement stage. Where a traditional ETL tool is already in use, the output of the ETL process is a WhereScape RED Load, Stage, Dimension, Fact or Model table from which WhereScape RED builds more advanced data warehouse structures.
Data Vault:
The Data Vault system is an alternative approach to modelling an enterprise data warehouse that has been gaining popularity among organizations.
The Data Vault data warehouse architecture was invented by Dan Linstedt to provide an alternative to the traditional data warehouse modelling approach that includes developing 3rd Normal Form (3NF) type models or dimensional star schema models. The data vault methodology seeks to improve the efficiency of data ingestion and the flexibility of structure changes.
WhereScape RED has been enhanced to expand its current Data Vault functionality and provide improved automation for creating and managing Data Vault objects in WhereScape RED managed Data Warehouses. The enhancement includes the following:
- New DSS columns for Load tables
- New Wizard for Hash key generation
- New Wizard for building Hub, Link and Satellite tables
- New Templates for Procedure generation
The Hub, Link and Satellite tables are based on standard Load or Stage tables (that do not include the hash key column type flags) then WhereScape RED reverts to this behavior and the resulting procedures are generated by internal WhereScape RED automation and not via templates.
Here I am describing how to load data from the SAP Hana
system to Hadoop environment using by wherescape RED Tool.
Hadoop Cluster Connection:
This connection is basically to get connected to Hadoop clusters to create and
Load/verify the tables and before loading into Hadoop table,
we have verified the schema which has
created correctly or not. Basically, Data would be loaded into hadoop table in
various formats depends on our requirements.
Now we load the data by the sqoop scripts.
Steps to be performed:
Step:1
Drag the table and create it as Load table
Once the we create the table, select the properties and choose the following connectivity to create the script.
· Load type based on the scripts
· Connection type should be always Linux
· Select the Sqoop script template
· Choose the template sqoop
We can see it from the below
snapshot for detailed information
By clicking the
generate tab, it will create the scope script
along with source table columns and description Once we have
generated the code kindly create the table by using a hive DDL template as to defy
in drop-down list. Make sure to check the table creation in hive Metastore then start the Load to consume the data from
Hana
Once we done
the property's activity,
please click on generate scripts in the
properties tab. Postscript generation
please verify the script for confirmation. Right click on table execute load
function. Once the load completes, start the Stage table
Step:2
Now we move the table into stage layer.
Stage tables are used to transform the data to a star schema or third normal form model. A stage table can be a fact or an EDW. 3NF table that only contains change data or a work table. In star schema data warehouses, the stage table brings all the dimensional joins together in preparation for publishing into the fact table.
A stage table is built from the Data Warehouse connection. Unless you are retrofitting an existing system, stage tables are typically built from one or more load or stage tables. They can utilize the surrogate keys from a number of dimension tables.
The use of this table is loading is to create the Hub Key and change key to make no data is
missing when it moved to further steps Before running the update script, please
verify the table in source Before generating the Key Column to bring the key
from the source table.
- Click Rebuild Option to select the template options.
- Once the template is selected, click on properties Regenerate. This should prompt to select HubKey and Changekey. Hub Key should hold the hash of primary key and changekey should hold hash of remaining keys concatenated together. Click on “OK”.
- Once we done the property's activity, please click on generate scripts in the properties tab. Postscript generation please verify the script for confirmation. Right click on table execute “Run Update script” function. Once the load completes, start the Hub table Loading.
Fact Table:
A Fact table is normally defined, for our purposes, as a table with facts (measures) and dimensional keys that allow the linking of multiple dimensions. It is normally illustrated in the form of a Star Schema with the central Fact table and the outlying dimensions.
The ultimate goal of the Fact table is to provide business information to the end user community. In many cases, different types of Fact tables are required to address different end user requirements. For simplicity, the different types of fact table are grouped together under the following headings:
Hub Table:
Hubs contain a list of unique business keys with low propensity to change. Hubs also contain a surrogate key for each Hub item and metadata describing the origin of the business key. The descriptive attributes for the information on the Hub (such as the description for the key, possibly in multiple languages) are stored in structures called Satellite tables which will be discussed below.
Links:
Associations or transactions between business keys (relating for instance the hubs for customer and product with each other through the purchase transaction) are modeled using link tables. These tables are basically many-to-many join tables, with some metadata.
Links can link to other links, to deal with changes in granularity (for instance, adding a new key to a database table would change the grain of the database table).
Satellites:
The hubs and links form the structure of the model, but have no temporal attributes and hold no descriptive attributes. These are stored in separate tables called satellites. These consist of metadata linking them to their parent hub or link, metadata describing the origin of the association and attributes, as well as a timeline with start and end dates for the attribute. Where the hubs and links provide the structure of the model, the satellites provide the "meat" of the model, the context for the business processes that are captured in hubs and links.
Step:3
Hub Table:
This table holds Business keys and generated hub keys.
Create a Hub table utilizing the above created Stage table (either by creating
new hub table and then mapping with stage columns or by dragging the source
table in to Column View pane).
- In the Table Column view Pane, select Primary Business Key and the Hash Key.
- Create MaprDB table manually in the required MAPR path.
- Right click on the Newly Created Hub Table and select “Properties”
- Use Storage and Extended properties as show in the above figure to select DDL template (make sure to select appropriate MARDB table location in MARDB location.
- Click on Rebuild to select required template and once the template is selected, click on “Regenerate” to generate the script.
- Postscript generation please verify the script for confirmation. Right click on table and click “Create(Recreate)” option to create “HIVE_MAPRDB” interface table.
- Once the table got created, right click on the table, and select “Execute Update script” to load the data into HIVE_MAPRDB interface table.
Step:4
Satellite table:
Satellites have metadata linking them to their parent hub ,
metadata describing the origin of the association and attributes, as well as a
timeline with start and end dates for the attribute. Where the hubs and links
provide the structure of the model, the satellites provide the "meat"
of the model, the context for the business processes that are captured in hubs
and links.
- Create a Satellite table utilizing the above created Stage table (either by creating new hub table and then mapping with stage columns or by dragging the source table in to Column View pane).
- In the Table Column view Pane, select Hash Key and the change key .
- Add a new column which holds “updated_date” i.e the time when record got inserted. This along with Change key will act as unique key for satellite table.
- Create Maprfs table in the required maprfs path.
- Right click on the Newly Created Hub Table and select “Properties”
- Use Storage and Extended properties as show in the above figure to select DDL template (make sure to select appropriate MARDB table location in MARDB location.
- Click on Rebuild to select required template and once the template is selected, click on “Regenerate” to generate the script.
- Postscript generation please verify the script for confirmation. Right click on table and click “Create(Recreate)” option to create “HIVE_MAPRDB” interface table.
- Once the table got created, right click on the table, and select “Execute Update script” to load the data into HIVE_MAPRDB interface table.
Here is the end of blog.
Thanks for reading..Bye.
Awesome detailed knowledge
ReplyDelete