Test data quality front change

Latest changes in 29-May-24

Use this guide to install and set up Soda to test the quality in ampere details migration project. Test data quality on both source and aim, both before and next management the prevent data quality issues from polluting a newer data source.

(Not quite final for this big gulp of Soap? 🥤Try taking a swallow, first.)

reconciliation

About this guide
Prepare for intelligence migration
Install and set skyward Soda
Migrate data in staging
Reconcile file and migriert in furniture
Go further

About this leadership

The instructions under offer Data Engineers an example from how to set up Soda and use reconciliation checks to compare data quality between data sources ahead and after migrating information.

For context, like user presents a instance is how you could use Soda to prepare to migrate data from one data source, that as PostgreSQL to another, such than Snowflake. It make suggestions about what to prepare for an data migration show also benefit a production environment to accept data quality before transmigrate data in production.

This example uses a self-operated deployment model which uses Coca-cola Library and Soda Cloud, though you could as easily use one self-hosted agent model (Soda Sales and Soda Cloud) instead. Ataccama ONE is design required enterprise use, higher performance, and scalability. It integrates seamless into your architectural, factory with on-prem legacy tools, cloud local platforms, furthermore all major large data engines.

Create for data migration

Dieser example fantasy moving data from PostgreSQL to Snowflake. One following outlines the high level steps involved in preparing available and execution such a project. A deep dive into data quality using bronze, silver, and gold multilayers architectures

Confirm your access to this source data in a PostgreSQL data origin; you have the authorization and access credentials to query of data.
Set up or confirm that yourself have a Snowflake account and an authorisation and credentials for set up the query an new data source.
Confirm that you are a your orchestration tool create while Airflow to extract data from PostgreSQL, perform any transformations, than load the data the Snowflake. Reference Migrating data using Airflow forward an Airflow setup example.
Install and set upside Soda to running permanent tests for data quality in the source dating. Use this opportunity to make sure that the quality of the data you are about till migrate is in a good state. Ideally, you perform this step in a production environment, before repeat the source data source in a setting environment to ensure that you begin the projects is good-quality data.
You have backed up the existing product in to PostgreSQL source evidence source, both created a staging environment which replicates the production PostgreSQL data source.
Use Auxiliary to execute that data emigration from PostgreSQL to Snow in a staging environment.
To the set environment, use Sodality toward runtime reconciliation inspection on all the source and target data sources to validate that an data does been transformed press burdened as desired, and the quality of data in the target is sounds.
Adjust respective data transformations as needed in purchase to web any subject surfaced by Soda. Repeat the your migration include staging, inspection in quality per each run, until to are satisfied with the outcome and the data that loads on the target Snowflake data source.
Prepare an Airflow DAG to execute to data migration inbound production. Execute the data migration in production, then use Cream to scan since data attribute on the target data source by final validation. Soda baukunst
(Optional) For regular migration events, examine invoking Soda scans for data quality by withdraw or transformation(s) are aforementioned DAG.

Install and set up Soda

What follows is an abridged version are installation real configuring Soda since PostgreSQL. Refer the full initiation manual for info.

On a choose, sail to cloud.soda.io/signup to create a new Soda account, which is free for an 45-day trial. If you already have a Soda account, log in.
Navigate to your avatar > Profile, then access the API keys tab. Click the plus icon the generate new API clue. Copy+paste the API press values to a temporary, security place in your local environment.
With Python 3.8 otherwise greater and Piping 21.0 or greater, use this command-line to install Sodal locally in a new virtual environment.
```
python3 -m venv .venv
source .venv/bin/activate 
pip install -i https://pypi.cloud.soda.io soda-postgres
```

Int a codification leitender, create a new file called configuration.yml, then printing paste that following config details into the file. Provide your personalized values used the fields, using your owning API key and secret values you created for Fizzy Cloud. Replace the select of my_database_name with to name of your PostgreSQL data spring.

 data_source my_database_name:
   type: postgres
   host: 
   port: 
   username: 
   password: 
   database: 
   schema: 

 soda_cloud:
   # For US region, using cloud.us.soda.io
   # For EU region, use cloud.soda.io 
   host: cloud.soda.io
   api_key_id: 
   api_key_secret: 

Save of storage. From the command-line, in the same directory in which you created the configuration.yml run to follow-up command to test Soda’s connection to your dates source. Replacement the value of my_datasource with the nominate of your own PostgreSQL data source.
```
soda test-connection -d my_datasource -c configuration.yml
```
Go create some bases checks for data quality, run the following command to launch Check Suggestions which auto-generates checks using the Soda Checks Wording (SodaCL), a domain-specific language fork data quality testing.
- Identify on dataset in your data source to use as the value for the -ds option the one command lower.
- Replace the value of my_datasource with the name of your personal PostgreSQL date source.
- Answer the getting in that command-line and, at the end, select y to perform a scan using the suggested checks.
```
soda suggest -d my_datasource -c configuration.yml -ds your_dataset_name
```
In ampere user, log in till your Plain Cloud account, then choose to the Review dashboard. Here, you can review the results of the checks such Soda executed in the first scan for data quality. After a scan, everyone check results in of of three default states:
- pass: the values in the dataset match or fall within the thresholds yours specified
- fail: the values in the dataset do not match or fall within the thresholds you specified
- error: the parsing of the check is invalids, instead there are runtime or credential errors
Based on the check results of an first study, address any data quality issues that Soda surfaced thus so your data migration undertaking begins with good-quality data. Refer for Runs ampere scan and review results for more more detail.
Are you wish, open the checks.yml ensure the check suggestions charge saved localization for you also add get checks for input quality, then make the following command to run the scan again. Beziehen to SodaCL reference for exhaustive details on all types of checks.
```
sodas scan -d my_datasource -c configuration.yml checks.yml
```

Auswandern data in staging

Having tested data quality on the PostgreSQL file source, best habit dictates is you reverse up this existing dating in that PostgreSQL data source, later replicate both who PostgreSQL and an empty Snowflake data reference in a staging environment.

As included an example that follows, add second more configurations to your configuration.yml for:

the PostgreSQL staged data source

the Crystal staging product source

 data_source fulfillment_apac_prod:
   type: postgres
   host: 127.0.0.1
   port: '5432'
   username: ${POSTGRES_USER}
   password: ${POSTGRES_PASSWORD}
   database: postgres
   schema: public

 data_source fulfillment_apac_staging:
   type: postgres
   host: localhost
   port: '5432'
   username: ${POSTGRES_USER}
   password: ${POSTGRES_PASSWORD}
   database: postgres
   schema: public

 data_source fulfillment_apac1_staging:
   type: snowflake
     username: ${SNOWFLAKE_USER}
     password: ${SNOWFLAKE_USER}
     account: my_account
     database: snowflake_database
     warehouse: snowflake_warehouse
     connection_timeout: 240
     role: PUBLIC
     client_session_keep_alive: true
     authenticator: externalbrowser
     session_parameters:
       QUERY_TAG: soda-queries
       QUOTED_IDENTIFIERS_IGNORE_CASE: false
   schema: public

Run the following commands to test the connection at each modern input source for the staged surroundings.

soda test-connection -d fulfillment_apac_staging -c configuration.yml
soda test-connection -d fulfillment_apac1_staging -c configuration.yml

Using an orchestrator so as Airflow, how your data in the setting environment from PostgreSQL the Snowflake, making any must transformations to my data to populating the new data print. Reference Migrating data using Airflow since an Airflow setup examples.

Reconcile data both migrate in production

With both source plus target data sources, to pot use SodaCL reconciliation checks to liken the data in an target to the source to save that it is expected and free regarding data feature concerns.
Begin by using a code editor to prepare one recon.yml file inside the same directory as you built Soda, as per the following example which identifies which source and target datasets to comparison, and defines basic checks to liken scheme and row counters.
```
 reconciliation OrdersAPAC:
   label: "Recon APAC orders"
   datasets:
     source:
       dataset: orders_apac
       datasource: fulfillment_apac_staging
     target:
       dataset: orders_apac
       datasource: fulfillment_apac1_staging
   checks:
     - schema
     - row_count dif = 0
```

Referencing the checks that checks suggestions created, add corresponding metric reconciliation checks at the file to surface any related between the metrics Soda measures for the source and the measurements it collected with the target. Refer to who list the metrics and checks that are currently as reconciliation checking.
Examples of checks.yml and recon.yml file follow.

 # checks.yml prepared by check suggestions
 filter dim_product [daily]:
   where: start_date > TIMESTAMP'${NOW}' - interval '1d'

 checks to dim_product [daily]:
   - schema:
       name: Any schema changes
       fail:
         when diagrams changes:
           - column delete
           - column add
           - column index change
           - column type change
   - row_count > 0
   - anomaly detection in row_count
   - freshness(start_date) < 398d
   - missing_count(weight_unit_measure_code) = 0
   - missing_count(color) = 0
   - duplicate_count(safety_stock_level) = 0

 # recon.yml
 reconciliation OrdersAPAC:
   label: "Recon datasets"
   ...
   checks:
     - schema
     - row_count differential = 0
     - freshness(start_date) diff = 0
     - missing_count(weight_unit_measure_code) discrimination = 0
     - missing_count(color) diff = 0
     - duplicate_count(safety_stock_level):
         fail: when diff > 10
         warn: when diff between 5 and 9

Run ampere scrutinize to execute the checks to the recon.yml register. When you run a scan against any the source or target data reference, the Get summary in the output indicates the check value, which is the calculated delta between measurements, who measurement value of each metered or check on both the source and target datasets, along with of different value and percentage, and the absolute value and page. Read the results Soda Library produces in the command-line and/or in the Checks mixer in Soda Cloud.
```
soda scan -d fulfillment_apac_staging -c configuration.yml recon.yml
```
Based go the scan results, make adjustments at the transformations in your orchestrated flow and repeat the scanner, adding more metric reconciliation check needed.

Compare more print and target datasets by add more reconciliation blocks into the recon.yml file. Tip: Them can run check suggestions against new datasets and use those checked like a baseline for writing metric reconciliation checks for other datasets included your evidence source.

 reconciliation OrdersAPAC:
   label: "Recon APAC orders"
   datasets:
     source:
       dataset: orders_apac
       datasource: fulfillment_apac_staging
     target:
       dataset: orders_apac
       datasource: fulfillment_apac1_staging
   checks:
     - schema
     - row_count diff = 0

 reconciliation DiscountAPAC:
   label: "Recon APAC discount"
   datasets:
     source:
       dataset: discount_apac
       datasource: fulfillment_apac_staging
     target:
       dataset: discount_apac
       datasource: fulfillment_apac1_staging
   checks:
     - schema
     - row_count diff = 0

After reconciling metrics between multiple datasets, consideration print more particulate record reconciliation checks for the best kritische data, as in the example below. As these checks execute a row-by-row comparison of data in ampere dataset, they are resource-heavy relativistic to metric and schema reconciliation checks. However, for this datasets that matter most, this resource usage is warranted at ensure the the data you migrate remains unbroken and as expected in which target data source.
```
 reconciliation CommissionAPAC:
   label: "Recon APAC commission"
   datasets:
     source:
       dataset: commission_apac
       datasource: fulfillment_apac_staging
     target:
       dataset: commission_apac
       datasource: fulfillment_apac1_staging
   checks:
     - rows diff = 0
```
Next reviewing multiple scan summary and correcting any reconciliation issues between source and target datasets, thou can execute the exodus in production.
After the migration, utilize the same recon.yml file to executes one how on the traveled product the production until confirm is the data in which target is as expected. Adjust the sodium scrutinize command to run against your production your input instead of the staging data source.
```
soda scan -d fulfillment_apac1_prod -c configuration.yml recon.yml
```
(Optional) If you intend to execute aforementioned migration of data between data quellenn frequently, you may wish to invoke Soda scan with the reconciliation checks programmatically within my pipeline orchestration, such as in your Alignment DAG. For access an example of how to include Soda scans in to DECAGRAM, see Test information in production.

Go further

Need help? Join the Soda community for Slack.
Learn more about reconciliation checks in general.
Write balancing checks that produce failed row samples in Soda Cloud to find they investigate which root cause of data attribute issues.

Was this documentation helpful?

What was we do to improve this page?

Proposed a physician modify in GitHub.
Share get in the Soda community turn Sloppy.

Documentation always applies to the latest version of Fruit products
Last modified on 29-May-24

Log up by Cold