|
|
|
Pharmaceutical Safety Data |
Data Warehouse Data
Verification by RTTS
|
|
Pharmaceutical |
|
A pharmaceutical firm organizes its safety data by pulling
information from multiple source systems into a data
warehouse. At the time of this writing, the data warehouse
holds in the range of 1,500,000 cases. An average case size
is roughly about 67 KB of data, which, with case versioning,
brings the size of the warehouse to about 500 Gigabyte (GB).
The goal of the software quality effort is to verify the
data integrity of each new build of the ETL (Extract,
Transform, Load) and database code by sampling the data in
the source and target systems, with a target sampling for
each run of 50% of the case data. |
|
How can a data warehouse software quality effort monitor
data integrity over multiple builds when the volume of live
case data is in the range of about 100 GB?
|
|
Implement a high volume data sampling effort that allows
upwards of 50% of the data to be verified for each build. |
-
RTTS used its
proprietary Data
Warehouse Test Framework
(DWTF) to set up
SQL-based data
comparisons between
source systems and the
data warehouse.
-
Approximately 1000 SQL
queries were written
against the combined
systems to compare case
data from the source
systems to the analogous
data in the data
warehouse after ETL.
-
Approximately 45% of the
case data was sampled in
each run
-
The DWTF was used to vet
the data quality of the
system after ETL for
each build of the
system.
During this development
cycle, 14 builds were deployed by the
development team and vetted by the software
quality team. Data integrity runs were
performed on each build in which new ETL
code was deployed, sampling ~45% of the
total case data (ca. 700,000 cases) on each
build.
Data throughput on the system is increased
simply by increasing the number of DWTF
agents used in a run. For the runs described
here, 11 clients were used and a complete
run finishes in ~72 hours. This permitted
the development team to deploy ca. 2 builds
per week, which the software quality team
could provide solid “build health” numbers
for.
During the development cycle, approximately
115 defects were discovered and remediated.
As is expected, discovery of defects leveled
off toward the end of the development cycle.
During the build cycles, the defect rate per build was monitored as
new functionality was delivered to the software quality team. Because
of the complexity of the application, the delivery of new functionality
often caused defects to appear in existing functionality. Build 3 was
clearly a highly problematic build. Around build 9, the team hit the
“coalescence point” where the defect rate was reduced with each build.
The final builds brought the number of defects to 0 and the release
candidate was deployed.
|
-
The automated framework
increased testing
coverage by
approximately 10-fold
and decreased testing
time by a factor of 3.
-
The software quality
team was able to vet
each build at the same
high level, so build
quality could be
compared and the
development trend was
available to the whole
team.
-
Automated data
verification of the
source-to-data warehouse
leg for each source
permits rapid
localization of defects
to specific blocks of
ETL code.
-
Using a large sample
size leads to
high-confidence-interval
results.
|
|
|
|
|
|
| |