Scraper CSV validation in CI

Problem

When working on scraper overrides or relation files, it is important to receive understandable feedback quickly. Currently, the error messages from the scraper pipeline are difficult to interpret and do not point to the exact position of the problematic data. Some issues are not caught until someone notices them in production.

To improve this, we want to add validation for all manually editable CSV files. It should be possible to define a set of data constraints for each csv file and column. We are therefore looking for a library or framework that allows defining and enforcing such constraints.

Constraints

Basic validators should be easy to define (strings, numbers, enums, regex, arrays...)
Custom validators must be supported (e.g. check if unique_name of a relation is a valid plant we know about)
Must support custom error messages with filename and line numbers
Should be easy to understand and extend
Should not abort validation at the first error

Assumptions

Biologists can read error messages from CI jobs
CSV file sizes will not grow substantially (into the gigabytes)
Validation of the CSV format itself (encoding, csv separator, line endings, ...) can be a separate CI step / solution

Solutions

CSV Schema

CSV Schema is a custom language to validate csv files. There is a reference implementation written in Java. The language itself is readable without programming knowledge.

Pros:

Supports many validators
Easy to define simple rules
Can be read without having programming knowledge
Includes support for CSV format validation

Cons:

Does not support custom functions outside of combinations of supported expressions
Introduces a new language (albeit schema language) to the project

Validation in PostgreSQL only

One approach would be to implement all validation as checks and triggers in the database.

Pros:

Ensures data integrity
Avoids possible duplication of constraints. With other solutions we might want to add important constraints to the database too.

Cons:

No validation of CSV format itself
Errors would be found quite late in the scraper pipeline, at which point previous steps could've already failed because of problematic data
Since validation happens late, it's also much harder to generate nice error messages. To reference which CSV lines caused a failed validation, this information would have to be passed all the way through the scraper. Since many CSV lines can modify a single plant, we would need to track the complete history of each plants attribute.

JavaScript Validation Library Zod

Zod is a library that validates JavaScript objects based on a schema which is also defined in JavaScript. It has no functions to handle CSV files so we would have to write a small wrapper to read the files, validate each row and print possible error messages.

Pros:

Easy to use and extend
Many validators included
Widely used and good documentation
Also used in the permaplant frontend

Cons:

No validation of CSV format itself
Only per object constraints out of the box (i.e. per row of csv)

Python Data Validation Framework Pandera / Pandas

Similar to Zod the Pandas framework allows the definition of a schema and then validates dataframes loaded with pandas. The ease of use and extendability is similar to Zod, but Pandas also includes functions to work with CSV files.

Pros:

Easy to use and extend
Many validators included
Stronger CSV integration than Zod

Cons:

Uses Python, when the rest of the scraper uses JavaScript
Only per object constraints out of the box (i.e. per row of csv)

JavaScript Test Framework

We could build our own CSV validation upon a testing framework like vite. But if we are not using any other library this would mean a lot of extra effort.

Pros:

Output compatible with CI
Most customizable
No dependencies

Cons:

Most effort if not using an additional library for data validation
No validation of CSV format itself
Hard to define the scope of a test case in this context: A csv file? Then a test will abort on the first error A row in a csv file? Then there will be a lot of unnecessary output

Decision

Our current preferred solution is Zod for validating CSV files in CI.

Rationale

Zod is already used in the PermaplanT frontend
The scraper is also written in JavaScript, keeping everything in the same language
Supports custom validators
Trivial to write a wrapper script that reads each CSV, parses each row and collect errors and line numbers to print at the end
Easy to extend and well documented

Implications

Additionally to the validators, a wrapper script must be written to handle CSV reading, row iteration, error collecting and reporting
CSV format validation (encoding, separator, line endings) remains a separate step

PermaplanT

Python Data Validation Framework Pandera / Pandas