Stage: Analyze
High Level Design Document
(HLD)
IPC & Security
Reference Manager (SRM) – Data Migration
Document Version: 1.0
17/02/2012
|
Project Name
|
Client
|
Process Name
|
|
Index Platform
Consolidation
|
Russell
Investments
|
|
|
Release Date
|
Version No.
|
Summary of Changes
|
Prepared /Modified By
|
Reviewed By
|
|
14-Feb-2012
|
0.1
|
Initial version
|
Vaithi
|
|
|
16-Feb-2012
|
0.2
|
Updated Detailed
Architecture for IPC and SRM in Section 3.3, 3.5 and 3.7
|
Shyam Sundar
|
|
|
21-Feb-2012
|
0.3
|
Updated ETL Frame
Work and Execution Components in section 4
|
Shyam Sundar
|
|
|
|
|
|
|
|
Contents
1
Executive summary
Russell Indexes launched the Platform
Consolidation Program with the primary goal—as identified by Russell’s Board of
Directors and the Indexes business unit—to reduce operational risk in the
indexes business unit—risk of loss resulting from inadequate or failed internal
processes, people, systems, or external events.
The Neptune Project, within the Platform Consolidation Program, tackles
consolidation of the platform supporting Index Operations’ business processes.
The objective of this project is to
consolidate the Global and US Index operations onto a single platform. This
consolidation will result in a single, standardized index-production unit using
uniform global-production procedures, executed in multiple locations. To
achieve this sub-project “Data Migration and Integration” within the program
Index Platform consolidations is to migrate data from the existing Source
Systems to IPC Applications in phased manner.
·
Release 0 is planned to migrate static data and securities
for Security Reference Manager (SRM). This will Integrate Securities data from
multiple source systems and then to provide the unique security definition and
unique internal security identifier for the use of all the existing and future
systems within Russell Indexes division.
·
Release 1 is planned to migrate Static data required to
create Securities and Indexes and all Russell universe securities for IPC
Applications.
·
Release 2 is planned to migrate complete data from Global
Systems (UK).
·
Release 3 is planned to migrate complete data from Eagle
Pace Systems (US)
1.1
Introduction
This document provides the High level design and overall
design considerations for Data Migration and Integration phase for the Release
0 (SRM) and Release 1 (IPC).
1.2
Purpose
This
document is to explain the Data Migration and Integration process for IPC and SRM
from the following comprehensive list of application which will be the source
of data for migration and integration into the target database:
·
US Market – Production System - Eagle
Pace database
·
Global Market – Production System -
Sybase “Global Research” database
·
R&R
- Global Research system – SQL Server “Global Index” database
The following
information related to the design and operations of the ETL processes required
to load current registration cycle-point information is as follows:
·
Overview of the High-level Extract, Transformation, and Load
(ETL) architecture to support loading Source data into the target system.
·
Detailed ETL architecture that identifies data sources,
transformations and procedures.
·
Operational Process Flow.
1.3 Scope
The scope of this
document is to define the strategy for the various phases of data migration.
The phases in this data migration project are as follows.
o
Design
o
Construction
o
Testing
o
Pre-Implementation/Dry
Runs
o
Implementation/Production
data migration
1.4 System
Description
The scope of the data migration is to migrate the data
from the existing multiple source system to SQL Server Database. The System
architecture related to these systems is:
1.4.1
Source System Description
|
Operating System
|
Software Platform
|
Database
|
|
|
1
|
Windows
2008 Server
|
Eagle
Pace - US
|
SQL
Server 2005
|
|
2
|
Unix
|
Portico
Application – UK
|
Sybase
11.x
|
|
3
|
Windows
2008 Server
|
Research
- RnR
|
SQL
Server 2005
|
1.5
1.5.1
Target System Description
|
System
|
Operating System
|
Software Platform
|
Database
|
|
1
|
Windows
2008 Server
|
Security
Reference Manager (SRM)
|
SQL
Server 2008
|
|
2
|
Windows
2008 Server
|
Index
Process Consolidation (IPC) Application
|
SQL
Server 2008
|
1.6 Methodology
and Tools
|
Environment
|
Platform / Tool
|
|
Database
Server
|
§ Microsoft
Windows® 2008 Server Enterprise Edition
§ SQL Server
2008 R2
|
|
Development
|
§ SQL Server
2008 Business Intelligence Studio
§ ODBC 32 bit
version to support Sybase connectivity
|
1.7 Definitions,
Acronyms and Abbreviation
|
ETL
|
Extraction –
Transformation – Loading
|
|
OLTP
|
Online Transactional
Processing
|
|
SSIS
|
SQL Server
Integration Services
|
|
HEX or Hexaware
|
Hexaware
Technologies Ltd.
|
|
IPC
|
Index Platform
Consolidation
|
|
ASIS / Source
|
Source
systems currently being used
|
|
To Be / Target
|
Target
or proposed system
|
|
AMD
|
Apply Market Data
|
|
Target System
|
IPC -
Eagle Pace
|
|
DM
|
Data Migration
|
|
SRM
|
Security Reference
Manager
|
|
FS
|
Functional
Specification
|
1.8
1.9 Assumptions
·
Target
data model will be developed iteration wise and so may undergo several changes.
So source data analysis has to be done based on evolving target data model.
Once the target data model is baseline unmapped fields in source will be
further analyzed to confirm whether it can be actually ignored.
·
Russell
will define the strategy, analysis, design and construct scripts for Data
Cleansing. Hexaware will support and complement this.
·
The
production cut-over window for implementation is expected to be 48 hours over a
weekend. This could change based on the volume of the record, relationship
between tables which defines the order of migration
·
The
current strategy is to extract the data from source systems using SSIS tools
and to extract, transform and load SQL Server target database.
·
Existing
master data will be extracted in Excel sheets and new target data is mapped one
to one manually.
·
Data
to be migrated is frozen before the start of the migration.
·
There
will not be any explicit lock on the data to be migrated by any of the
application accessing the data during the outage window.
·
The
current existing model is base lined and assumed to be 100% complete.
·
The
scope of data migration project is to migrate only the data that will be
accessed by the target application system.
2
Standards, Conventions and Procedures
2.1 Programming
Standards
Following is the
SharePoint link for SSIS Development Guide
http://project-
sharepoint/PWA/Program%2000001%20Enterprise%20Data%20Management/201089%20EDM%20Program%20and%20Data%20Factory/Project%20Documents/Forms/AllItems.aspx[11]
Following is the SharePoint
link for Data Modeling and Architecture Standards:
http://project-sharepoint/PWA/Program%2000001%20Enterprise%20Data%20Management/201089%20EDM%20Program%20and%20Data%20Factory/Project%20Documents/Forms/AllItems.aspx[12]
3 design Illustration
3.1 Introduction
A High Level Design
(HLD) Document provides an overview of a solution, platform, system, or
process. This will usually include a high-level architecture diagram depicting
the components, interfaces and networks that need to be further specified or
developed in the Low Level Design. The document may also depict or otherwise
refer to work flows and/or data flows between component systems.
The objective of this
document is to elaborate and provide insight on all stages of the Data
Migration and integration life cycle and its stages.
The High Level Design
Document content includes;
·
Overall
High Level Design Architecture
·
Key
Technical Requirements
·
Key
Technical Design Decisions and Criteria
·
Solution
Overview
·
List
of dependencies and prerequisites
·
Project
Risks and any assumptions made.
3.2 High level Architecture for IPC and SRM
Following diagram
depicts the overall High level architecture diagram for IPC and SRM Data
Migration and Integration layers.
- Source Layer:
o All the source systems
Eagle pace –US, Global – UK (Sybase DB) and Research (RnR) are available in
Source area.
- Landing Area:
o All the source tables
identified from US-Eagle, UK-Global and RnR for migration into SRM will be
imported as landing tables. The table structure wills same as source system and
complete data will be extracted.
- Staging Area 1:
o Source data includes Company/Security/Cross Reference information with
all the date columns, status flags and legacy info source.
o Security history data is not extracted.
o Securities data with latest effective/updated date are only extracted.
o Based on the data analysis, the fields that need to cleanse is
identified. Data cleansing is required to ensure that only accurate, consistent
and complete data is loaded into target database. Data cleansing will be
required for
§ Junk Characters/Characters
§ Invalid Values
§ Values not within Range of the field
§ Format consolidation (e.g., dates , amount fields)
§ Referential integrity (e.g.,
affiliate RUT in any transaction table should also be present in affiliate
master)
- Staging Area 2:
o Structure of table will be as same as target system.
o
Data from staging 1 are mapped one to one
relationship and converts data from source data format into destination data.
o
Data transformation is applied at this stage
includes code conversion, reformatting, consolidation, summation, etc.,
- Target System
o
Data will be moved finally from staging area 2 with
proper data element mappings to Target System.
3.3 Detailed
Architecture for IPC and SRM
Following diagram depicts the overall SRM Detailed ETL
Architecture:
All the security related data will be
loaded through ETL Process to Staging Areas. Data will be loaded to the target
following staging areas.
- All the
source tables identified from US-Eagle, UK-Global and RnR for migration
into SRM will be imported as landing tables by using ETL jobs. The table
structure is same as source system and complete data will be extracted.
- Once the data
loading into landing zone it will moves to staging area1 in between
landing and staging area1 we are applying data cleansing and validation
rules like as
§ Data must
and should within 75 years
§ Check with
null data
§ Format
consolidation (e.g., dates , amount fields)
- Once the data
loading into staging area1 it will moves to staging area2 in between
staging area1 and staging area2 we are applying Transformation rules and
check the mappings to target mappings hear table structure will be same as
target data base table structure
Transformation rules like as
- Check with mappings
- Check with data types
- Once the data
loading into staging area2 by using wrapper procedure the staging area
data will be loaded our target data base
All the security
related data will be loaded through ETL Process to Staging Areas. Data will be
loaded to the target following staging areas.
·
All
the source tables identified from US-Eagle, UK-Global and RnR for migration
into SRM will be imported as landing tables by using ETL jobs. The table
structure is same as source system and complete data will be extracted.
·
Once
the data loading into landing zone it will moves to staging area1 in between
landing and staging area1 we are applying data assessments and validation rules
like as
§ Data must and should within 75 years
§ Check with null data
§ Format consolidation (e.g., dates , amount fields)
·
Once
the data loading into staging area1 it will moves to staging area2 in between
staging area1 and staging area2 we are applying Transformation rules and check
the mappings to target mappings hear table structure will be same as target
data base table structure
Transformation rules like as
§ Check with mappings
§ Check with data types
·
Once
the data loading into staging area2 by using wrapper procedure the staging area
data will be loaded our target data base
3.4 High
Level Architecture for Static Data
Following diagram
depicts the overall SRM Static Data ETL Architecture:
These data and data
types tend to be created once maintained over a long time frame and are shared
and used by a number of business activities. This information is the system
needs for proper operation. It is the single access point for the creation,
update and deletion of content relevant for SRM in performing its functions,
such as companies, securities, and others. Required data is validated from the
three different source systems US-Eagle Pace, UK-Global and Research and then
mastering of data will be done.
Following static data
content are identified as related to securities information.
·
Country
·
Region
·
Exchanges
·
Security Classification
·
GICS, ICB, RGS Classification
·
Industry Group, Industry, Sub-Industry, Sector
·
Currencies
·
Caps Style
·
Asset Classification
·
Asset Sub-Classification
·
Static
data content like Currencies, Countries, Exchanges, etc., will be extracted
from each source systems US-Eagle, UK-Global and RnR.
·
Data
extracted from each source will be checked with mastering data.
·
Using
Excel sheets data are arranged to the structure to load in target system.
·
From
Excel sources data are loaded into Staging Area and then executed to target
system.
3.5 Detailed
Architecture for Static Data
Following diagram
depicts the overall SRM Detailed Static Data ETL Architecture:
The static data will be loaded to the
target following staging areas. Static SRM DB mainly using for reference of
securities.
- All the
source tables identified from US-Eagle, UK-Global and RnR for static data
migration into SRM data base will be imported as landing area(excel
source) by using manual process these excel sheets are maintained unique
records for each data base.
- Once the data loading into landing zone
it will moves to staging area in between landing and staging area we are
applying mapping and transformation rules
§ Check with
mappings
- Once the data
loading into staging area by using wrapper procedure the staging area data
will be loaded to our target data base.
3.6 High
Level Architecture for Data Synchronization
Following diagram
depicts the SRM Data Synchronization:
The purpose of data synchronization is to maintain the consistency of
data contained in multiple applications, databases or systems. The need for
data synchronization can either be permanent (data synchronization between
operational systems), or temporary (for example data synchronization for
certain periods). Data synchronization can be either mono-directional or
bi-directional.
In the database delta approach, the system will look at each database
table that can be synchronized and evaluate if a record has been changed,
inserted or deleted. Changed records are easiest to identify and update. New
records need to take into account identity fields, particularly if the ID field
is referenced in other places. It then becomes the responsibility of the data
synchronization controller to ensure that the new ID, assigned by the server,
cascades to the referencing tables before the data is synchronized.
From each source systems US-Eagle and UK-Global securities related data
element record has been changed or inserted can be detected using Binary
Checksum algorithm which is provided in SQL Server. Data synchronization will
be in mono-directional i.e from source system.
·
Source data related
to securities will be extracted and then either updated or inserted into
landing area.
·
In the Landing area
tables with Index Id for each row Binary Checksum is maintained. This is
compared with the previous stored values and new values against the record
changes.
·
Changes in values and
new records are extracted to Staging 1 and then transformed to Staging 2
·
Source record changes
are updated in the SRM data base and then it will be moved to RnR database
through flat file.
3.7 Detailed
Level for Data Synchronization
·
From each source
systems US-Eagle and UK-Global securities related data element record has been
changed or inserted can be detected using Binary Checksum algorithm which is
provided in SQL Server. Data synchronization will be in mono-directional i.e
from source system.
·
Source data related
to securities will be extracted and then either updated or inserted into
landing area.
·
After completion of data loading into landing zone
this landing zone data will be checked with Reconciliation data base
·
We can maintain the
Meta data at this stage with respect to these details like logging,
configuration, auditing.
·
Landing data will be
moved to stagingarea1 by using ETL process at this stage we are applying data
cleansing, data validation and binary check sum algorithm .Some of cleansing
and validation rules like as
§ data must and should within 75 years
§ Check with null data
§ Format consolidation (e.g., dates , amount fields)
·
We can maintain the
rejected records at this stage in rejected data base
·
After completion of data loading into stagingarea1
this staging area1 data will be checked with Reconciliation data base
·
We can maintain the
Meta data at this stage with respect to these details like logging,
configuration, auditing.
·
Staging area1 data
will be moved to stagingarea2 by using ETL process at this stage we are
applying data transformation rules. Some of transformation rules like as
§ Check with mappings
§ Check with data types
·
We can maintain the
rejected records at this stage in rejected data base
·
After completion of data loading into stagingarea2
this staging area2 data will be checked with Reconciliation data base
·
We can maintain the
Meta data at this stage with respect to these details like logging,
configuration, auditing.
·
Staging area2 data
will be moved to SRM data base by using ETL process at this stage we are using
wrapper procedure
·
We can maintain the
rejected records at this stage in rejected data base
·
After completion of data loading into SRM data
base this SRM data base data will be
checked with Reconciliation data base and then SRM db data will be moved to RnR data base through
flat file.
3.8 Sequence
of Execution Flow
·
The
Static data migration is performed by taking the data from the three source
systems US,Global and Research and loaded to an excel sheet using the ETL jobs.
·
This
excel now holds the consolidated information of the three source systems. Using
ETL package transformations the data is loaded to the Staging database.
·
Finally
the staging data is passed on through ETL packages and loaded to the target
database. This helps to load the Static data from the source systems to the SRM
target database.
·
In
SRM Data migration approach, data from three systems – US, Global and Research
is loaded to the SRM landing database with the help of ETL jobs.
·
This
data from the SRM landing database is cleansed and loaded to the Staging 1
database using the ETL packages. The data from Staging 1 is now passed on to
the Staging 2 database after the corresponding transformation rules are
applied.
·
Finally
the transformed data from three sources is moved to the SRM target database
using the ETL packages.
4
ETL FRamework
The concept of ETL Framework
is to provide a set of terminology and a focus by which the relative interface
requirements; common processes can be identified and reusable design and
components can be provided to streamline, through design, build and deployment
and ongoing production running.
In Data Migration and
Integration project that uses SSIS as the ETL tool, the first step is to build
an ETL framework. This framework handles error checking, alert notification,
task failures, logging, execution history and batch control.
4.1 ETL
Definition
'Extract, Transform,
Load'(ETL) is the technology specialized in:
·
Extraction of data from
one or more sources;
·
Transformations of the
data (e.g. cleansing, reformatting, standardization, aggregation, or the
application of any number of business rules); and
·
Loading of the
resulting data set into specified target systems or file formats.
4.2 ETL
Framework Architecture
Execution
Components
- Master
Package – ETL Batch processing along with the configuration details.
- Execution
Package – Provides workflow for Activities
- Container
– Provides the ETL Framework for
Activities
- Data
Flow – Responsible for all ETL processing
- Metadata
Information – Contains all the metadata information like Log details,
Audits and Configuration details.
ETL
Framework Capabilities
- This
functionality is supported by the ETL Framework. This is implemented by a
combination of the Master Package, Execution Packages and Containers.
These ETL framework capabilities would provide support for logging,
handling exceptions, sending email notifications and reconciliation
reports.
Metadata
Information
The Metadata
information like configuration details, logs and audit data are stored in the
database and will be used for the ETL package execution.
- Configuration
– Contains both runtime and system configuration information
- Log
– Populated with all Master Package, Activity and Error information from
ETL Batch runs
- Audit
– Populated with pre-update and deleted records from ETL processing
Utilities
- Excel
Reporting – Provides detailed information on ETL Batch execution
4.3 ETL
Master Package
·
Below diagram provides an overview of capabilities
seen to most ETL Frameworks. This explains about the Configurations and logging
features. In addition, most frameworks also include a collection of SSIS
templates that interface with the ETL Framework and demonstrate core ETL
patterns. The Execution package would group all the activities and helps to
extract data, validate, apply cleansing rules and perform data mapping
operation.
The SRM
Master package creates the Package template and schedules as per the vendor SLA
dynamically.
o
Logging – Contains all custom logging activity from
master and child package execution Details.
o
Configurations – All Required Database connections
strings information consists of technical metadata tables used to drive ETL
package execution including configuration tables and execution package
sequencing.
o
Reports/ Audit Trail – The logging activity is
presented to users through a predefined Email notification. The logging
activity can also be customized at any point of time.
o
The Execution package helps to extract data,
validate, cleanse, apply transformation rules and store it to the target table.
Following diagram depicts the overall SRM DM ETL Components:
Below diagram depicts
the steps of the ETL execution package.
o
ETL
Execution package – The SRM
execution package does file validation, creates staging area, validates the
data logs, and provides workflow for one or more data flows. In addition, this
ETL Framework’s execution package template supports efficient set based updates
that execute after a dataflow completes.
o
The SRM
execution package has three main phases involved:
o
Landing Area
o
Staging1 Area
o
Staging 2 Area
o
The Landing zone gets the input data from three
sources – Global Sybase, US Eagle Pace and Research databases. The extraction
of data from the source systems take place and data gets loaded to the Landing
environment.
o
The data from the Landing database is cleansed,
validated, transformed and finally loaded to the Staging1 database.
o
The Staging 1 environment now holds the transformed
data. The transformation rules are now applied to this data and finally loaded
to the Staging 2.
o
The data from Staging 2 with the help of Wrapper
Procedure is now moved to the Target SRM database.
4.4 ETL
Execution Components
4.4.1
Extraction Stage (Landing)
The data is extracted from three
source systems – US Eagle Pace, Global UK and Research databases.
This extracted data is now mapped to
the Landing database tables using one to one mapping. The Landing database now
holds the raw data from all three sources.
4.4.1.1
Entry Criteria
The Entry criteria for Landing area
would be the Source data from all three systems – US Eagle Pace, Global – UK
and the Research databases.
4.4.1.2
Approach
The ETL – SSIS
approach is used here to load data from the source systems to the landing area.
The Data flow tasks will redirect the flow to the OLEDB source task which will
get the input from US, Global and Research databases. This source data with the
help of one to one mapping is now loaded to the Landing database.
4.4.1.3
Extraction Process
The data is
extracted from the source system using the ETL SSIS tasks and containers. Once
the data is extracted, it is loaded to the landing database.
4.4.1.4
Reconciliation
Reconciliation
report is generated at the end of this process which would provide us the
record count / number of records between the Source and Landing tables. This
report helps us to validate the data.
4.4.1.5
Error Handling
The data is
loaded from the Source systems to the Landing tables. In case if the process
fails in between the ETL Error handling operation is performed and cause of the
error is tracked.
4.4.1.6
Exit Criteria
The raw data
from all three systems is now available in the Landing tables.
4.4.2
Staging 1 (Data Cleansing & Validation)
4.4.2.1
Entry Criteria
The Landing
database is the source system now which holds the consolidated data from all
three sources – US, Global and Research.
4.4.2.2
Approach
The data from
the landing database with the help of SSIS transformation tasks are validated
initially. Once the validation is completed, the cleansing rules are applied
and data is loaded to the first staging area.
4.4.2.3
Extraction Process
4.4.2.4
Reconciliation
Reconciliation
report is generated at the end of this process which would provide us the
record count / number of records between the Landing and Staging tables. This
report helps us to validate the data between the source and staging areas.
4.4.2.5
Error Handling
Error handling
takes place when the data is migrated from the landing to staging tables. ETL handles the errors by default and logs
the failure details to the relevant tables.
4.4.2.6
Exit Criteria
The Staging
table now contains the validated and cleansed data. This is now passed on to
the Staging area 2 for processing.
4.4.3
Staging 2 (Data Transformation & Mapping)
4.4.3.1
Entry Criteria
The Staging1
database is the source system which holds the validated and cleansed data. This
data will now be mapped to the corresponding data elements and loaded to the
Staging 2 environment.
4.4.3.2
Approach
The data from
the Staging1 database are mapped to the corresponding data elements using the
ETL transformations. The transformation rules are now applied to this data and
loaded to the Staging2 database.
4.4.3.3
Extraction Process
4.4.3.4
Reconciliation
Reconciliation
report is generated at the end of this process which would provide us the
record count / number of records between the Staging1 and Staging2 tables. This
report helps us to validate the data between the staging areas.
4.4.3.5
Error Handling
Error handling
takes place when the data is migrated between the two staging areas. ETL handles the errors by default and logs
the failure details to the relevant tables.
4.4.3.6
Exit Criteria
The Staging 2
database now holds the transformed and validated records. This would be
manipulated and passed on to the target SRM database.
4.4.4
Target (Data Transformation & Mapping)
4.4.4.1
Entry Criteria
The
Staging2 database is the source system which holds the transformed data. This
data is now passed on through the wrapper procedure to load the target SRM
database,
4.4.4.2
Approach
The
wrapper procedure is used to load data from the Staging environment to the
Target SRM database.
4.4.4.3
Extraction Process
4.4.4.4
Reconciliation
Reconciliation
report is generated at the end of this process which would provide us the
record count / number of records between staging and the target systems. This
report helps us to validate the data between the staging and SRM databases.
4.4.4.5
Error Handling
Error
handling takes place when the data is migrated from staging to target
databases. ETL handles the errors by
default and logs the failure details to the relevant tables.
4.4.4.6
Exit Criteria
The
SRM / Target database now holds the validated data from three systems – US,
Global and Research.
4.5 ETL
Framework DB’s (Metadata Information)
4.5.1
Configuration
Package
configurations are used to update the values of properties at the run time.
Configuration is a property/value pair that you add to a completed package.
Typically, you create a package set properties on the package objects during
package development, and then add the configuration to the package. When the
package runs, it gets the new values of the property from the configuration.
Package
configurations provide the following benefits:
·
Configurations make it easier to move packages from a development
environment to a production environment.
·
Configurations are useful when you deploy packages to many
different servers.
·
Configurations make packages more flexible.
4.5.2
Log
The Execution Log
table provides a concise summary of package execution results (e.g., start
time, end time, overall status) in a single row, along with other useful
information.
4.5.3
Audit
A key
component of the extract, transformation, and load (ETL) process is the
cleanup, or transformation, of data while in transition from one place to
another. In most cases there are restrictions - regulatory or otherwise - that
require any data that is modified in the ETL pipeline to be audited as such. it
is a wise practice to build an audit mechanism for any data you update or
exclude as part of the ETL process. Years down the road if you or your
successor is quizzed as to the reasoning certain transformations, an invaluable
tool in answering those questions will be an auditing file or table. In
addition, some of the users will occasionally request access to the unmodified
data as it was originally extracted from the source system, and an audit table
can be useful in providing that information as well.
4.6 ETL
Framework Capabilities
4.6.1
Logging
Logging SSIS packages run information into five tables:
ETL Job, ETL Package, ETL_Rowcount, ETL Task, ETL_Error, see Figure 4 –
TGIEH_SSIS_ETL_Framework Database Diagram. Therefore, you no longer need to use
SSIS Logging and dbo.sysssislog table.
4.6.2
Exceptional Handling
The
ETL handles the errors using the logging mechanism available in the SSIS
package. This logging mechanism helps to
capture all the success and failure notifications of the SSIS package. The
package failure will be mainly captured for the following events:
·
On
Error
·
On
Post Execute
·
On
Pre Execute
The
package failures will be tracked by creating separate tables and each failure /
success event is tracked in it. When there is a failure in the package, a new
row is inserted in the log table.
This
table also maintains the Start time and End time of the package along with the
description of the errors hit and package details.
4.6.3
Reconciliation
Data reconciliation
is a necessary process of comparing the data in source and target system. The
main purpose of the data reconciliation is to ensure that the reference and
transaction data has been converted and migrated correctly. At the completion
of a data migration effort where we want to understand if the validity of the entire
data set is still intact. To verify the correctness & completeness during
each stage, reconciliation will be done.
Number of records transformed in the Stage 1 table will be compared
against the Landing Table count. Similarly Amount Sum or Record count of the
transformed records in Stage 2 will be compared against the Stage 1 tables’
column sum. Reports will be generated for both the Record Count and Column Sum comparison.
4.6.4
Email Notifications
Sending an email is a frequent
requirement to notify a user on the occurrence of certain events, especially if
an unexpected event happens (for example sending notification on failure that
could be either logical or physical). SSIS provides a built-in "Send Mail
Task" to send email in these circumstances.
4.7 Configuration
and Initialization
The ETL Framework
stores its configuration information in a database and queries this database at
run time to configure the solution. The objective for every ETL solution
targeted for a production environment. Achieving this requires you to: first
use variables and expressions to dynamically configure all connections and
workflow variables, second use SSIS and custom configuration to populate these
variables at runtime. To handle any unexpected events happen in package level
to give the e-mail connection.
5
Database Management
5.1 SRM
Data Logical Process Flow
All the security
related data will be loaded through ETL Process to Staging Areas. Data will be
loaded to the target following staging areas.
·
All
the source tables identified for migration into SRM will be imported to landing
zone by using ETL process. The table structure is same as source system and
complete data will be extracted.
And after completion of data loading into landing zone this landing
zone data will be checked with Reconciliation data base by using of
reconciliation process.
·
Landing
zone data will be moved to staging area1by using ETL process. At this stage we
are applying data cleansing and validation rules like as
§ data must and should within 75 years
§ Check with null data
§ Format consolidation (e.g., dates , amount fields)
We can find any errors or any improper data we are maintain error logs.
After completion loading data in staging area1 the data will be checked with
reconciliation data base by using reconciliation process.
·
Staging
area1 data will be moved to staging area2 by using ETL process. At this stage
we are applying data transformation rules and data mappings.
§ Check with mappings
§ Check with data types
At this time find any errors we are maintain error logs. After
completion loading data in staging area2 the data will be checked with
reconciliation data base by using reconciliation process.
·
Once
the data loading into staging area2 by using wrapper procedure the staging area
data will be loaded our target data base at this time find any data base errors
we are maintain error logs. After completion of data loading in target data
base the data checked with reconciliation data base.
5.2 Static
Data Logical Process Flow
All the security
related static data will be loaded through ETL Process to Staging Areas. Data
will be loaded to the target following staging areas.
·
All
the static related data tables identified for static migration into SRM will be
imported to landing zone by using manual process. Landing zone data will be
maintained in excel sources, after
completion of data loading into landing zone this landing zone data will be
checked with Reconciliation data base by using of reconciliation process.
·
Landing
zone data will be moved to staging area. At this stage we are applying data
transformation rules and data mappings.
§ Check with mappings
§ Check with data types
At this time find any errors we are maintain error logs. After
completion loading data in staging area2 the data will be checked with
reconciliation data base by using reconciliation process.
·
Once
the data loading into staging area2 by using wrapper procedure the staging area
data will be loaded our target data base at this time find any data base errors
we are maintain error logs. After completion of data loading in target data base the data checked with
reconciliation data base. By using reconciliation process.
5.3 Landing
database
5.3.1
Landing data validation
Landing zone data will
be moved to staging area1 by using ETL process. At this stage we are applying
data cleansing and validation rules like as
§ data must and should within 75 years
§ Check with null data
§ Format consolidation (e.g., dates , amount fields)
We can find any errors or any improper data we are maintain error logs.
After completion loading data in staging area1 the data will be checked with
reconciliation data base by using reconciliation process.
5.4 Staging
1 Database
5.4.1
Staging 1 data validation
Staging area1 data
will be moved to staging area2 by using ETL process. At this stage we are
applying data transformation rules and data mappings.
§ Check with mappings
§ Check with data types
At this time find any errors we are maintain error logs. After
completion loading data in staging area2 the data will be checked with
reconciliation data base by using reconciliation process.
5.5 Staging
2 Database
5.5.1
Staging 2 data validation
Once the data loading
into staging area2 by using wrapper procedure the staging area data will be
loaded to our target data base at this time find any data base errors we are
maintain error logs. After completion of data loading in target data base the
data checked with reconciliation data base.
6
deployment strategy
7
Testing strategy
This phase comprises
of testing the data migration suite for each iteration. Testing will check all
the transformations / mappings / workflows / cleansing / audit and validations.
Individual test cases need to be prepared for testing out various
functionalities.
7.1 Strategy
The following matrix
illustrates the broad areas that the test cases will pertain to
|
Attributes
|
Measurement Plan
|
Remarks
|
|
Business important
fields for checksum
|
·
Identify
all business important fields that can be used for summation checks for data
extracts and in target tables.
·
Perform
summations on the identified fields in incoming data files and match the sum.
|
Business important fields that can be used for checksums need to be
requested to Russell users and it should be included in the extracts.
|
|
Integrity checks
|
·
Identify
all integrity constraints
·
All
data must pass through associated integrity constraints(E.g.- There can be no
detail records in the absence of a master)
|
Integrity
constraints are verified and validated
|
|
Outlier conditions
|
·
Identify
the Minimum, Maximum and default values for data attributes.
·
All
data attributes should contain a valid value.
·
Raise
alert when invalid values are detected.
|
Min, Max and Default
values should be provided in and verified and validated
|
|
Alert mechanism
|
·
Identify
all steps which need to generate alert (eg- Invalid incoming data, failed integrity
checks, outliers, load.
·
Raise
alerts.
|
Any specific alert
requirements should be specified in the ETL strategy to incorporate the same
in development
|
|
Correctness of calculations
|
·
Identify fields involving complex
calculations
·
Recalculate once loading is complete
·
Match with previously calculated values
|
Russell users to
specify critical fields involving complex calculations and the same should be
incorporated
|
|
Audit trail
|
·
Identify
data to be captured in audit trail ( E.g.- Table name, number of records on
file, Records inserted from table).
·
Capture
audit attributes during load process and store in Audit table
|
Any specific audit
requirements should be specified in the ETL specs and will be incorporated
|
|
Incoming data summary
|
·
Identify summary information for input data to be
sent in additional file (File name, number of records, date and
·
Perform checks on incoming data ( Match record
count in control file and actual number of files received, match
·
Raise alert in case of mismatch
|
Incoming control summary file specification to be provided and same
should be incorporated in the extract
|
|
Business test cases
|
·
Will write test cases to check the business
scenario for audit. Business test
cases could be writing SQL queries to get the data from target and verify it
using existing mainframe data. The choice of the business criteria can be
identified from the legacy reports or may be provided by Russell
|
Information on critical reports to be provided by Russell
|
7.2
Validation
The following are the validations that will be performed
to ensure the correctness of the data migrated.
|
S.No
|
Category
|
Source
|
Destination
|
Criteria
|
|
1
|
Number of physical
record
|
All Entities
|
All Entities
|
Exact match or deviation justified
|
|
2
|
Sum
|
Field-1; Table-1
Field-2; Table-2
Field-1; Table-1
|
Field-1; Table-1
Field-2; Table-2
Field-1; Table-3
|
Exact match or deviation justified
|
|
3
|
Sum against a branch
|
Field-1; Table-1
|
Field-1; Table-1
|
Exact match or deviation justified
|
|
4
|
Total number of
active affiliates
|
Field-1; Table-1
|
Field-1; Table-1
|
Exact match or deviation justified
|
|
5
|
Total number of
deceased affiliates
|
Field-1; Table-1
|
Field-1; Table-1
|
Exact match or deviation justified
|
|
6
|
Totals
|
Field-1; Table-1
|
Field-1; Table-1
|
Exact match or deviation justified
|
|
7
|
Status Fields
|
Group By count
|
Group By count
|
Exact match or deviation justified
|
|
8
|
Null fields
|
Count Field-1
Count Field-2
|
Count Field-1
Count Field-2
|
Exact match
|
|
9
|
Blank Fields
|
Count Field-1
Count Field-2
|
Count Field-1
Count Field-2
|
Exact match
|
|
10
|
Not Null Fields
|
Count Field-1
|
Count Field-1
|
Exact match
|
|
11
|
Duplicate Rows
|
Table-1
Table-2
|
Table-1
Table-2
|
Exact match
|
|
12
|
Deleted Rows
|
|
|
Justify
|
|
13
|
Key fields (RUT,
Folio Number)
|
Group by range
|
Group by range
|
Exact match
|
|
14
|
Name Fields
|
Compare by key
|
Compare by key
|
Exact match
|
|
16
|
Round off
|
Verify correct
decimal places
|
Verify correct
decimal places
|
Exact match
|
|
17
|
Truncation Error on
Identified Field
|
Correct truncation
|
Correct truncation
|
Exact match
|
|
18
|
Exceptions
|
Defined
|
Defined
|
Validate
|
|
19
|
Bad Records
|
Identify
|
Defined
|
Validate
|
7.3 Audit
Audit rules are expected to be defined by the Russell
team in the following format. The auditing should be done based on the reliable
reports from business. Business reports will be provided by Russell to be used
for auditing.
|
No
|
Category
|
Source
|
Destination
|
Criteria
|
|
|
|
|
|
|
Assumptions
·
Data migration suite
available (Extraction, transformation and load routines)
·
Audit and validation
routines available
·
Source Data for
migration is available
8
Pre-implementation dry/runs
Pre-Implementation or Dry run is the simulation of
production implementation in test environment. The objective is to understand
the complexities during implementation, in terms of the window for data
migration, infrastructure requirements and to fine tune the programs and
implementation procedures if required. Data migration implementation is planned
in two phases.
Assumptions
o
Tested Data migration
suite available for the current implementation phase
o
Test environment that
is simulated based on production is available
o
Source Data for
pre-implementation dry run is available
9
implementation
Implementation phase comprises of activities for
implementing the actual production data migration.
The implementation of data migration depends mainly on
the implementation window, volume of data to be migrated and the type of data.
On further analysis on data and discussions the implementation strategy will be
finalized.
Points to be considered to adopt this approach:
1. All the backup data be
extracted in 48 hours
2. All the reference data
be extracted in 48 hours
3. All the transaction,
master and catch up reference data be
extracted, cleaned, transformed and loaded in 48 hours
4. Additional effort is
involved in doing catch up for reference data
5. Testing of the data
will be in the parallel run time
6. Incremental migration
may be required for reference data
10
Technical Constraints
10.1 Performance
and Volume
As part of Initial load, 30million rows will
be extracted from Security history tables. SSIS throughput count should be
between 800 to 1000 rows per second. Objective is to maintain the high ETL
performance by sizing the Database Server capacity in terms of number of
Processors and Memory.
10.2 ETL
Job failure
In case of
any job failure during data loads, ETL load will start from the beginning with
“Update else Insert” Logic. This logic will be implemented within the ETL
tasks.
Objective of this methodology is to avoid
insertion of duplicate data in case of re-run of failed jobs.
Below are the some of the
examples to failure due to unexpected errors during the ETL process
Primary Key constraint
Bad data in the source
Database disconnections
during the process
Data type conversion
10.3
ETL
Job Scheduler
11 Appendix
11.1 Reference (Links)
|
Reference
|
Description
|
|
Functional
Spec DM – IPC & SRM
|
Functional
Specifications document for IPC & SRM Data Migration
|
11.2
Conceptual Data Model

11.3
ERD Data Model
ERD diagram for tables in scope for three systems


