Data Pipeline Case Study

Streamlining a healthcare-focused cloud provider’s data pipeline

Scenario: A healthcare-focused cloud services company regularly ingests data from its customers, and this data pipeline is a key part of its process for onboarding new customers. Data quality was a constant challenge, and new issues appeared frequently. Their customer data came in a variety of formats, with various inconsistencies. Sometimes customer data was missing fields, or the fields were incorrectly formatted. Some customers manually entered their data using different standards. Duplicate records existed. The company handled this challenge by hiring additional staff to process customer data using various tools such as Microsoft Excel. However, this did not fix their problem. Despite increasing costs, data quality issues remained. Even with additional staff, the data pipeline held up new customer onboarding. And, as a result, slowed their business growth.

What is a “Data Pipeline”?

A data pipeline is a workflow for collecting, validating, processing and merging data for use in analysis and presentation. A data pipeline can involve many steps. A large amount of the work in a data pipeline generally resides in the “data prep” phase. Data prep refers to tasks such as data validation, reformatting, merging and identifying and fixing data quality issues. Data prep must be completed before analysis and presentation, or, at a minimum, data prep will improve the accuracy of data analysis and visualization.

Impact on Company Data Pipeline

“The program took 4 weeks development time to build and has reduced customer’s data prep time by 75% in the 6 months since it was delivered… It condensed months of work into 2 weeks.”
— Company Executive in charge of Customer Service

Gaining Insight

The ServiceLaunch team conducted several brief working sessions with company staff to review their data workflow, data quality issues, and the ingest process for sample data. Specific security requirements applied to this data, and it needed to be processed promptly. Sending this data to an external offshore team was not a viable option. Our team began development, using an Agile release process to rapidly incorporate user feedback. This process speeded up deployment, as the user community already knew how to use the solution.

Our Solution – Custom Lightweight Automation

The program automates data processing, validation, and fixing of data quality issues. The final output is a fully cleansed data file that the program automatically loads into a database to enable downstream business processes.

The program reads a set of Excel and delimited text files, supplied by customers, and determines their structure.
Concurrently, it analyzes the data and assigns default business rules for each column, when possible. For example, apply an “email” rule if most data in a column looks like an email address.
An Excel worksheet stores business rules. New rules can be added as needed, and multiple rules applied to individual columns.
After defining the rules, the program reads the source data.
The program fixes common data quality errors it recognizes.
Next, it produces Excel files with cells color-coded to indicate unrecognized data quality problems, allowing users to easily find and fix them, or create new rules.
The program can run one or more times over the same data set until the data quality is satisfactory. Users can adjust data processing rules as needed throughout the process.
After resolving data quality issues, the program produces a set of structured CSV files for loading into the application database.

Sample data output with color coding to automatically identify unknown data quality issues.

Solution Value

Our solution is easy to use. There is no need to learn to use a complicated new tool.
Users work entirely in Excel, which is familiar and intuitive to them. There is no need to interact with any program code.
No need to install or configure any software. It’s provided as a packaged executable.
No requirement for maintenance, subscription fees, or accounts.
Fast to develop and deploy.
It identifies the structure of source files for you.
Custom-built for your business and your use case. You don’t pay for extra functionality you don’t need.
Flexibility. Users can easily adjust data rules directly within Excel.

Common Data Pipeline Tasks

Removing extra whitespace and invalid characters.
Standardizing input formats for email addresses, phone numbers, dates, and addresses.
Validate data types match definitions.
Validate the uniqueness of individual records (no duplicates).
Identification of missing records (or incomplete records).
Validation of data against a standard format such as Social Security Number (SSN) or Vehicle Identification Number (VIN).
Deduplication of records based on a key field.
Finding and fixing invalid reference values, such as US states or department codes, based on standard and/or custom-defined data dictionaries.