Scraper CSV validation in CI
Problem
When working on scraper overrides or relation files, it is important to receive understandable feedback quickly. Currently, the error messages from the scraper pipeline are difficult to interpret and do not point to the exact position of the problematic data. Some issues are not caught until someone notices them in production.
To improve this, we want to add validation for all manually editable CSV files. It should be possible to define a set of data constraints for each csv file and column. We are therefore looking for a library or framework that allows defining and enforcing such constraints.
Constraints
- Basic validators should be easy to define (strings, numbers, enums, regex, arrays...)
- Custom validators must be supported (e.g. check if unique_name of a relation is a valid plant we know about)
- Must support custom error messages with filename and line numbers
- Should be easy to understand and extend
- Should not abort validation at the first error
Assumptions
- Biologists can read error messages from CI jobs
- CSV file sizes will not grow substantially (into the gigabytes)
- Validation of the CSV format itself (encoding, csv separator, line endings, ...) can be a separate CI step / solution
Solutions
CSV Schema
CSV Schema is a custom language to validate csv files. There is a reference implementation written in Java. The language itself is readable without programming knowledge.
Pros:
- Supports many validators
- Easy to define simple rules
- Can be read without having programming knowledge
- Includes support for CSV format validation
Cons:
- Does not support custom functions outside of combinations of supported expressions
- Introduces a new language (albeit schema language) to the project
Validation in PostgreSQL only
One approach would be to implement all validation as checks and triggers in the database.
Pros:
- Ensures data integrity
- Avoids possible duplication of constraints. With other solutions we might want to add important constraints to the database too.
Cons:
- No validation of CSV format itself
- Errors would be found quite late in the scraper pipeline, at which point previous steps could've already failed because of problematic data
- Since validation happens late, it's also much harder to generate nice error messages. To reference which CSV lines caused a failed validation, this information would have to be passed all the way through the scraper. Since many CSV lines can modify a single plant, we would need to track the complete history of each plants attribute.
JavaScript Validation Library Zod
Zod is a library that validates JavaScript objects based on a schema which is also defined in JavaScript. It has no functions to handle CSV files so we would have to write a small wrapper to read the files, validate each row and print possible error messages.
Pros:
- Easy to use and extend
- Many validators included
- Widely used and good documentation
- Also used in the permaplant frontend
Cons:
- No validation of CSV format itself
- Only per object constraints out of the box (i.e. per row of csv)
Python Data Validation Framework Pandera / Pandas
Similar to Zod the Pandas framework allows the definition of a schema and then validates dataframes loaded with pandas. The ease of use and extendability is similar to Zod, but Pandas also includes functions to work with CSV files.
Pros:
- Easy to use and extend
- Many validators included
- Stronger CSV integration than Zod
Cons:
- Uses Python, when the rest of the scraper uses JavaScript
- Only per object constraints out of the box (i.e. per row of csv)
JavaScript Test Framework
We could build our own CSV validation upon a testing framework like vite. But if we are not using any other library this would mean a lot of extra effort.
Pros:
- Output compatible with CI
- Most customizable
- No dependencies
Cons:
- Most effort if not using an additional library for data validation
- No validation of CSV format itself
- Hard to define the scope of a test case in this context: A csv file? Then a test will abort on the first error A row in a csv file? Then there will be a lot of unnecessary output
Decision
Our current preferred solution is Zod for validating CSV files in CI.
Rationale
- Zod is already used in the PermaplanT frontend
- The scraper is also written in JavaScript, keeping everything in the same language
- Supports custom validators
- Trivial to write a wrapper script that reads each CSV, parses each row and collect errors and line numbers to print at the end
- Easy to extend and well documented
Implications
- Additionally to the validators, a wrapper script must be written to handle CSV reading, row iteration, error collecting and reporting
- CSV format validation (encoding, separator, line endings) remains a separate step