csv/
tutorial.rs

1/*!
2A tutorial for handling CSV data in Rust.
3
4This tutorial will cover basic CSV reading and writing, automatic
5(de)serialization with Serde, CSV transformations and performance.
6
7This tutorial is targeted at beginner Rust programmers. Experienced Rust
8programmers may find this tutorial to be too verbose, but skimming may be
9useful. There is also a
10[cookbook](../cookbook/index.html)
11of examples for those that prefer more information density.
12
13For an introduction to Rust, please see the
14[official book](https://doc.rust-lang.org/book/second-edition/).
15If you haven't written any Rust code yet but have written code in another
16language, then this tutorial might be accessible to you without needing to read
17the book first.
18
19# Table of contents
20
211. [Setup](#setup)
221. [Basic error handling](#basic-error-handling)
23    * [Switch to recoverable errors](#switch-to-recoverable-errors)
241. [Reading CSV](#reading-csv)
25    * [Reading headers](#reading-headers)
26    * [Delimiters, quotes and variable length records](#delimiters-quotes-and-variable-length-records)
27    * [Reading with Serde](#reading-with-serde)
28    * [Handling invalid data with Serde](#handling-invalid-data-with-serde)
291. [Writing CSV](#writing-csv)
30    * [Writing tab separated values](#writing-tab-separated-values)
31    * [Writing with Serde](#writing-with-serde)
321. [Pipelining](#pipelining)
33    * [Filter by search](#filter-by-search)
34    * [Filter by population count](#filter-by-population-count)
351. [Performance](#performance)
36    * [Amortizing allocations](#amortizing-allocations)
37    * [Serde and zero allocation](#serde-and-zero-allocation)
38    * [CSV parsing without the standard library](#csv-parsing-without-the-standard-library)
391. [Closing thoughts](#closing-thoughts)
40
41# Setup
42
43In this section, we'll get you setup with a simple program that reads CSV data
44and prints a "debug" version of each record. This assumes that you have the
45[Rust toolchain installed](https://www.rust-lang.org/install.html),
46which includes both Rust and Cargo.
47
48We'll start by creating a new Cargo project:
49
50```text
51$ cargo new --bin csvtutor
52$ cd csvtutor
53```
54
55Once inside `csvtutor`, open `Cargo.toml` in your favorite text editor and add
56`csv = "1.1"` to your `[dependencies]` section. At this point, your
57`Cargo.toml` should look something like this:
58
59```text
60[package]
61name = "csvtutor"
62version = "0.1.0"
63authors = ["Your Name"]
64
65[dependencies]
66csv = "1.1"
67```
68
69Next, let's build your project. Since you added the `csv` crate as a
70dependency, Cargo will automatically download it and compile it for you. To
71build your project, use Cargo:
72
73```text
74$ cargo build
75```
76
77This will produce a new binary, `csvtutor`, in your `target/debug` directory.
78It won't do much at this point, but you can run it:
79
80```text
81$ ./target/debug/csvtutor
82Hello, world!
83```
84
85Let's make our program do something useful. Our program will read CSV data on
86stdin and print debug output for each record on stdout. To write this program,
87open `src/main.rs` in your favorite text editor and replace its contents with
88this:
89
90```no_run
91//tutorial-setup-01.rs
92// Import the standard library's I/O module so we can read from stdin.
93use std::io;
94
95// The `main` function is where your program starts executing.
96fn main() {
97    // Create a CSV parser that reads data from stdin.
98    let mut rdr = csv::Reader::from_reader(io::stdin());
99    // Loop over each record.
100    for result in rdr.records() {
101        // An error may occur, so abort the program in an unfriendly way.
102        // We will make this more friendly later!
103        let record = result.expect("a CSV record");
104        // Print a debug version of the record.
105        println!("{:?}", record);
106    }
107}
108```
109
110Don't worry too much about what this code means; we'll dissect it in the next
111section. For now, try rebuilding your project:
112
113```text
114$ cargo build
115```
116
117Assuming that succeeds, let's try running our program. But first, we will need
118some CSV data to play with! For that, we will use a random selection of 100
119US cities, along with their population size and geographical coordinates. (We
120will use this same CSV data throughout the entire tutorial.) To get the data,
121download it from github:
122
123```text
124$ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop.csv'
125```
126
127And now finally, run your program on `uspop.csv`:
128
129```text
130$ ./target/debug/csvtutor < uspop.csv
131StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
132StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])
133StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"])
134# ... and much more
135```
136
137# Basic error handling
138
139Since reading CSV data can result in errors, error handling is pervasive
140throughout the examples in this tutorial. Therefore, we're going to spend a
141little bit of time going over basic error handling, and in particular, fix
142our previous example to show errors in a more friendly way. **If you're already
143comfortable with things like `Result` and `try!`/`?` in Rust, then you can
144safely skip this section.**
145
146Note that
147[The Rust Programming Language Book](https://doc.rust-lang.org/book/second-edition/)
148contains an
149[introduction to general error handling](https://doc.rust-lang.org/book/second-edition/ch09-00-error-handling.html).
150For a deeper dive, see
151[my blog post on error handling in Rust](http://blog.burntsushi.net/rust-error-handling/).
152The blog post is especially important if you plan on building Rust libraries.
153
154With that out of the way, error handling in Rust comes in two different forms:
155unrecoverable errors and recoverable errors.
156
157Unrecoverable errors generally correspond to things like bugs in your program,
158which might occur when an invariant or contract is broken. At that point, the
159state of your program is unpredictable, and there's typically little recourse
160other than *panicking*. In Rust, a panic is similar to simply aborting your
161program, but it will unwind the stack and clean up resources before your
162program exits.
163
164On the other hand, recoverable errors generally correspond to predictable
165errors. A non-existent file or invalid CSV data are examples of recoverable
166errors. In Rust, recoverable errors are handled via `Result`. A `Result`
167represents the state of a computation that has either succeeded or failed.
168It is defined like so:
169
170```
171enum Result<T, E> {
172    Ok(T),
173    Err(E),
174}
175```
176
177That is, a `Result` either contains a value of type `T` when the computation
178succeeds, or it contains a value of type `E` when the computation fails.
179
180The relationship between unrecoverable errors and recoverable errors is
181important. In particular, it is **strongly discouraged** to treat recoverable
182errors as if they were unrecoverable. For example, panicking when a file could
183not be found, or if some CSV data is invalid, is considered bad practice.
184Instead, predictable errors should be handled using Rust's `Result` type.
185
186With our new found knowledge, let's re-examine our previous example and dissect
187its error handling.
188
189```no_run
190//tutorial-error-01.rs
191use std::io;
192
193fn main() {
194    let mut rdr = csv::Reader::from_reader(io::stdin());
195    for result in rdr.records() {
196        let record = result.expect("a CSV record");
197        println!("{:?}", record);
198    }
199}
200```
201
202There are two places where an error can occur in this program. The first is
203if there was a problem reading a record from stdin. The second is if there is
204a problem writing to stdout. In general, we will ignore the latter problem in
205this tutorial, although robust command line applications should probably try
206to handle it (e.g., when a broken pipe occurs). The former however is worth
207looking into in more detail. For example, if a user of this program provides
208invalid CSV data, then the program will panic:
209
210```text
211$ cat invalid
212header1,header2
213foo,bar
214quux,baz,foobar
215$ ./target/debug/csvtutor < invalid
216StringRecord(["foo", "bar"])
217thread 'main' panicked at 'a CSV record: Error(UnequalLengths { pos: Some(Position { byte: 24, line: 3, record: 2 }), expected_len: 2, len: 3 })', src/main.rs:13:29
218note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
219```
220
221What happened here? First and foremost, we should talk about why the CSV data
222is invalid. The CSV data consists of three records: a header and two data
223records. The header and first data record have two fields, but the second
224data record has three fields. By default, the csv crate will treat inconsistent
225record lengths as an error.
226(This behavior can be toggled using the
227[`ReaderBuilder::flexible`](../struct.ReaderBuilder.html#method.flexible)
228config knob.) This explains why the first data record is printed in this
229example, since it has the same number of fields as the header record. That is,
230we don't actually hit an error until we parse the second data record.
231
232(Note that the CSV reader automatically interprets the first record as a
233header. This can be toggled with the
234[`ReaderBuilder::has_headers`](../struct.ReaderBuilder.html#method.has_headers)
235config knob.)
236
237So what actually causes the panic to happen in our program? That would be the
238first line in our loop:
239
240```ignore
241for result in rdr.records() {
242    let record = result.expect("a CSV record"); // this panics
243    println!("{:?}", record);
244}
245```
246
247The key thing to understand here is that `rdr.records()` returns an iterator
248that yields `Result` values. That is, instead of yielding records, it yields
249a `Result` that contains either a record or an error. The `expect` method,
250which is defined on `Result`, *unwraps* the success value inside the `Result`.
251Since the `Result` might contain an error instead, `expect` will *panic* when
252it does contain an error.
253
254It might help to look at the implementation of `expect`:
255
256```ignore
257use std::fmt;
258
259// This says, "for all types T and E, where E can be turned into a human
260// readable debug message, define the `expect` method."
261impl<T, E: fmt::Debug> Result<T, E> {
262    fn expect(self, msg: &str) -> T {
263        match self {
264            Ok(t) => t,
265            Err(e) => panic!("{}: {:?}", msg, e),
266        }
267    }
268}
269```
270
271Since this causes a panic if the CSV data is invalid, and invalid CSV data is
272a perfectly predictable error, we've turned what should be a *recoverable*
273error into an *unrecoverable* error. We did this because it is expedient to
274use unrecoverable errors. Since this is bad practice, we will endeavor to avoid
275unrecoverable errors throughout the rest of the tutorial.
276
277## Switch to recoverable errors
278
279We'll convert our unrecoverable error to a recoverable error in 3 steps. First,
280let's get rid of the panic and print an error message manually:
281
282```no_run
283//tutorial-error-02.rs
284use std::{io, process};
285
286fn main() {
287    let mut rdr = csv::Reader::from_reader(io::stdin());
288    for result in rdr.records() {
289        // Examine our Result.
290        // If there was no problem, print the record.
291        // Otherwise, print the error message and quit the program.
292        match result {
293            Ok(record) => println!("{:?}", record),
294            Err(err) => {
295                println!("error reading CSV from <stdin>: {}", err);
296                process::exit(1);
297            }
298        }
299    }
300}
301```
302
303If we run our program again, we'll still see an error message, but it is no
304longer a panic message:
305
306```text
307$ cat invalid
308header1,header2
309foo,bar
310quux,baz,foobar
311$ ./target/debug/csvtutor < invalid
312StringRecord { position: Some(Position { byte: 16, line: 2, record: 1 }), fields: ["foo", "bar"] }
313error reading CSV from <stdin>: CSV error: record 2 (line: 3, byte: 24): found record with 3 fields, but the previous record has 2 fields
314```
315
316The second step for moving to recoverable errors is to put our CSV record loop
317into a separate function. This function then has the option of *returning* an
318error, which our `main` function can then inspect and decide what to do with.
319
320```no_run
321//tutorial-error-03.rs
322use std::{error::Error, io, process};
323
324fn main() {
325    if let Err(err) = run() {
326        println!("{}", err);
327        process::exit(1);
328    }
329}
330
331fn run() -> Result<(), Box<dyn Error>> {
332    let mut rdr = csv::Reader::from_reader(io::stdin());
333    for result in rdr.records() {
334        // Examine our Result.
335        // If there was no problem, print the record.
336        // Otherwise, convert our error to a Box<dyn Error> and return it.
337        match result {
338            Err(err) => return Err(From::from(err)),
339            Ok(record) => {
340              println!("{:?}", record);
341            }
342        }
343    }
344    Ok(())
345}
346```
347
348Our new function, `run`, has a return type of `Result<(), Box<dyn Error>>`. In
349simple terms, this says that `run` either returns nothing when successful, or
350if an error occurred, it returns a `Box<dyn Error>`, which stands for "any kind of
351error." A `Box<dyn Error>` is hard to inspect if we cared about the specific error
352that occurred. But for our purposes, all we need to do is gracefully print an
353error message and exit the program.
354
355The third and final step is to replace our explicit `match` expression with a
356special Rust language feature: the question mark.
357
358```no_run
359//tutorial-error-04.rs
360use std::{error::Error, io, process};
361
362fn main() {
363    if let Err(err) = run() {
364        println!("{}", err);
365        process::exit(1);
366    }
367}
368
369fn run() -> Result<(), Box<dyn Error>> {
370    let mut rdr = csv::Reader::from_reader(io::stdin());
371    for result in rdr.records() {
372        // This is effectively the same code as our `match` in the
373        // previous example. In other words, `?` is syntactic sugar.
374        let record = result?;
375        println!("{:?}", record);
376    }
377    Ok(())
378}
379```
380
381This last step shows how we can use the `?` to automatically forward errors
382to our caller without having to do explicit case analysis with `match`
383ourselves. We will use the `?` heavily throughout this tutorial, and it's
384important to note that it can **only be used in functions that return
385`Result`.**
386
387We'll end this section with a word of caution: using `Box<dyn Error>` as our error
388type is the minimally acceptable thing we can do here. Namely, while it allows
389our program to gracefully handle errors, it makes it hard for callers to
390inspect the specific error condition that occurred. However, since this is a
391tutorial on writing command line programs that do CSV parsing, we will consider
392ourselves satisfied. If you'd like to know more, or are interested in writing
393a library that handles CSV data, then you should check out my
394[blog post on error handling](http://blog.burntsushi.net/rust-error-handling/).
395
396With all that said, if all you're doing is writing a one-off program to do
397CSV transformations, then using methods like `expect` and panicking when an
398error occurs is a perfectly reasonable thing to do. Nevertheless, this tutorial
399will endeavor to show idiomatic code.
400
401# Reading CSV
402
403Now that we've got you setup and covered basic error handling, it's time to do
404what we came here to do: handle CSV data. We've already seen how to read
405CSV data from `stdin`, but this section will cover how to read CSV data from
406files and how to configure our CSV reader to data formatted with different
407delimiters and quoting strategies.
408
409First up, let's adapt the example we've been working with to accept a file
410path argument instead of stdin.
411
412```no_run
413//tutorial-read-01.rs
414use std::{
415    env,
416    error::Error,
417    ffi::OsString,
418    fs::File,
419    process,
420};
421
422fn run() -> Result<(), Box<dyn Error>> {
423    let file_path = get_first_arg()?;
424    let file = File::open(file_path)?;
425    let mut rdr = csv::Reader::from_reader(file);
426    for result in rdr.records() {
427        let record = result?;
428        println!("{:?}", record);
429    }
430    Ok(())
431}
432
433/// Returns the first positional argument sent to this process. If there are no
434/// positional arguments, then this returns an error.
435fn get_first_arg() -> Result<OsString, Box<dyn Error>> {
436    match env::args_os().nth(1) {
437        None => Err(From::from("expected 1 argument, but got none")),
438        Some(file_path) => Ok(file_path),
439    }
440}
441
442fn main() {
443    if let Err(err) = run() {
444        println!("{}", err);
445        process::exit(1);
446    }
447}
448```
449
450If you replace the contents of your `src/main.rs` file with the above code,
451then you should be able to rebuild your project and try it out:
452
453```text
454$ cargo build
455$ ./target/debug/csvtutor uspop.csv
456StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
457StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])
458StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"])
459# ... and much more
460```
461
462This example contains two new pieces of code:
463
4641. Code for querying the positional arguments of your program. We put this code
465   into its own function called `get_first_arg`. Our program expects a file
466   path in the first position (which is indexed at `1`; the argument at index
467   `0` is the executable name), so if one doesn't exist, then `get_first_arg`
468   returns an error.
4692. Code for opening a file. In `run`, we open a file using `File::open`. If
470   there was a problem opening the file, we forward the error to the caller of
471   `run` (which is `main` in this program). Note that we do *not* wrap the
472   `File` in a buffer. The CSV reader does buffering internally, so there's
473   no need for the caller to do it.
474
475Now is a good time to introduce an alternate CSV reader constructor, which
476makes it slightly more convenient to open CSV data from a file. That is,
477instead of:
478
479```ignore
480let file_path = get_first_arg()?;
481let file = File::open(file_path)?;
482let mut rdr = csv::Reader::from_reader(file);
483```
484
485you can use:
486
487```ignore
488let file_path = get_first_arg()?;
489let mut rdr = csv::Reader::from_path(file_path)?;
490```
491
492`csv::Reader::from_path` will open the file for you and return an error if
493the file could not be opened.
494
495## Reading headers
496
497If you had a chance to look at the data inside `uspop.csv`, you would notice
498that there is a header record that looks like this:
499
500```text
501City,State,Population,Latitude,Longitude
502```
503
504Now, if you look back at the output of the commands you've run so far, you'll
505notice that the header record is never printed. Why is that? By default, the
506CSV reader will interpret the first record in CSV data as a header, which
507is typically distinct from the actual data in the records that follow.
508Therefore, the header record is always skipped whenever you try to read or
509iterate over the records in CSV data.
510
511The CSV reader does not try to be smart about the header record and does
512**not** employ any heuristics for automatically detecting whether the first
513record is a header or not. Instead, if you don't want to treat the first record
514as a header, you'll need to tell the CSV reader that there are no headers.
515
516To configure a CSV reader to do this, we'll need to use a
517[`ReaderBuilder`](../struct.ReaderBuilder.html)
518to build a CSV reader with our desired configuration. Here's an example that
519does just that. (Note that we've moved back to reading from `stdin`, since it
520produces terser examples.)
521
522```no_run
523//tutorial-read-headers-01.rs
524# use std::{error::Error, io, process};
525#
526fn run() -> Result<(), Box<dyn Error>> {
527    let mut rdr = csv::ReaderBuilder::new()
528        .has_headers(false)
529        .from_reader(io::stdin());
530    for result in rdr.records() {
531        let record = result?;
532        println!("{:?}", record);
533    }
534    Ok(())
535}
536#
537# fn main() {
538#     if let Err(err) = run() {
539#         println!("{}", err);
540#         process::exit(1);
541#     }
542# }
543```
544
545If you compile and run this program with our `uspop.csv` data, then you'll see
546that the header record is now printed:
547
548```text
549$ cargo build
550$ ./target/debug/csvtutor < uspop.csv
551StringRecord(["City", "State", "Population", "Latitude", "Longitude"])
552StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
553StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])
554StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"])
555```
556
557If you ever need to access the header record directly, then you can use the
558[`Reader::headers`](../struct.Reader.html#method.headers)
559method like so:
560
561```no_run
562//tutorial-read-headers-02.rs
563# use std::{error::Error, io, process};
564#
565fn run() -> Result<(), Box<dyn Error>> {
566    let mut rdr = csv::Reader::from_reader(io::stdin());
567    let headers = rdr.headers()?;
568    println!("{:?}", headers);
569    for result in rdr.records() {
570        let record = result?;
571        println!("{:?}", record);
572    }
573    // We can ask for the headers at any time.
574    let headers = rdr.headers()?;
575    println!("{:?}", headers);
576    Ok(())
577}
578#
579# fn main() {
580#     if let Err(err) = run() {
581#         println!("{}", err);
582#         process::exit(1);
583#     }
584# }
585```
586
587## Delimiters, quotes and variable length records
588
589In this section we'll temporarily depart from our `uspop.csv` data set and
590show how to read some CSV data that is a little less clean. This CSV data
591uses `;` as a delimiter, escapes quotes with `\"` (instead of `""`) and has
592records of varying length. Here's the data, which contains a list of WWE
593wrestlers and the year they started, if it's known:
594
595```text
596$ cat strange.csv
597"\"Hacksaw\" Jim Duggan";1987
598"Bret \"Hit Man\" Hart";1984
599# We're not sure when Rafael started, so omit the year.
600Rafael Halperin
601"\"Big Cat\" Ernie Ladd";1964
602"\"Macho Man\" Randy Savage";1985
603"Jake \"The Snake\" Roberts";1986
604```
605
606To read this CSV data, we'll want to do the following:
607
6081. Disable headers, since this data has none.
6092. Change the delimiter from `,` to `;`.
6103. Change the quote strategy from doubled (e.g., `""`) to escaped (e.g., `\"`).
6114. Permit flexible length records, since some omit the year.
6125. Ignore lines beginning with a `#`.
613
614All of this (and more!) can be configured with a
615[`ReaderBuilder`](../struct.ReaderBuilder.html),
616as seen in the following example:
617
618```no_run
619//tutorial-read-delimiter-01.rs
620# use std::{error::Error, io, process};
621#
622fn run() -> Result<(), Box<dyn Error>> {
623    let mut rdr = csv::ReaderBuilder::new()
624        .has_headers(false)
625        .delimiter(b';')
626        .double_quote(false)
627        .escape(Some(b'\\'))
628        .flexible(true)
629        .comment(Some(b'#'))
630        .from_reader(io::stdin());
631    for result in rdr.records() {
632        let record = result?;
633        println!("{:?}", record);
634    }
635    Ok(())
636}
637#
638# fn main() {
639#     if let Err(err) = run() {
640#         println!("{}", err);
641#         process::exit(1);
642#     }
643# }
644```
645
646Now re-compile your project and try running the program on `strange.csv`:
647
648```text
649$ cargo build
650$ ./target/debug/csvtutor < strange.csv
651StringRecord(["\"Hacksaw\" Jim Duggan", "1987"])
652StringRecord(["Bret \"Hit Man\" Hart", "1984"])
653StringRecord(["Rafael Halperin"])
654StringRecord(["\"Big Cat\" Ernie Ladd", "1964"])
655StringRecord(["\"Macho Man\" Randy Savage", "1985"])
656StringRecord(["Jake \"The Snake\" Roberts", "1986"])
657```
658
659You should feel encouraged to play around with the settings. Some interesting
660things you might try:
661
6621. If you remove the `escape` setting, notice that no CSV errors are reported.
663   Instead, records are still parsed. This is a feature of the CSV parser. Even
664   though it gets the data slightly wrong, it still provides a parse that you
665   might be able to work with. This is a useful property given the messiness
666   of real world CSV data.
6672. If you remove the `delimiter` setting, parsing still succeeds, although
668   every record has exactly one field.
6693. If you remove the `flexible` setting, the reader will print the first two
670   records (since they both have the same number of fields), but will return a
671   parse error on the third record, since it has only one field.
672
673This covers most of the things you might want to configure on your CSV reader,
674although there are a few other knobs. For example, you can change the record
675terminator from a new line to any other character. (By default, the terminator
676is `CRLF`, which treats each of `\r\n`, `\r` and `\n` as single record
677terminators.) For more details, see the documentation and examples for each of
678the methods on
679[`ReaderBuilder`](../struct.ReaderBuilder.html).
680
681## Reading with Serde
682
683One of the most convenient features of this crate is its support for
684[Serde](https://serde.rs/).
685Serde is a framework for automatically serializing and deserializing data into
686Rust types. In simpler terms, that means instead of iterating over records
687as an array of string fields, we can iterate over records of a specific type
688of our choosing.
689
690For example, let's take a look at some data from our `uspop.csv` file:
691
692```text
693City,State,Population,Latitude,Longitude
694Davidsons Landing,AK,,65.2419444,-165.2716667
695Kenai,AK,7610,60.5544444,-151.2583333
696```
697
698While some of these fields make sense as strings (`City`, `State`), other
699fields look more like numbers. For example, `Population` looks like it contains
700integers while `Latitude` and `Longitude` appear to contain decimals. If we
701wanted to convert these fields to their "proper" types, then we need to do
702a lot of manual work. This next example shows how.
703
704```no_run
705//tutorial-read-serde-01.rs
706# use std::{error::Error, io, process};
707#
708fn run() -> Result<(), Box<dyn Error>> {
709    let mut rdr = csv::Reader::from_reader(io::stdin());
710    for result in rdr.records() {
711        let record = result?;
712
713        let city = &record[0];
714        let state = &record[1];
715        // Some records are missing population counts, so if we can't
716        // parse a number, treat the population count as missing instead
717        // of returning an error.
718        let pop: Option<u64> = record[2].parse().ok();
719        // Lucky us! Latitudes and longitudes are available for every record.
720        // Therefore, if one couldn't be parsed, return an error.
721        let latitude: f64 = record[3].parse()?;
722        let longitude: f64 = record[4].parse()?;
723
724        println!(
725            "city: {:?}, state: {:?}, \
726             pop: {:?}, latitude: {:?}, longitude: {:?}",
727            city, state, pop, latitude, longitude);
728    }
729    Ok(())
730}
731#
732# fn main() {
733#     if let Err(err) = run() {
734#         println!("{}", err);
735#         process::exit(1);
736#     }
737# }
738```
739
740The problem here is that we need to parse each individual field manually, which
741can be labor intensive and repetitive. Serde, however, makes this process
742automatic. For example, we can ask to deserialize every record into a tuple
743type: `(String, String, Option<u64>, f64, f64)`.
744
745```no_run
746//tutorial-read-serde-02.rs
747# use std::{error::Error, io, process};
748#
749// This introduces a type alias so that we can conveniently reference our
750// record type.
751type Record = (String, String, Option<u64>, f64, f64);
752
753fn run() -> Result<(), Box<dyn Error>> {
754    let mut rdr = csv::Reader::from_reader(io::stdin());
755    // Instead of creating an iterator with the `records` method, we create
756    // an iterator with the `deserialize` method.
757    for result in rdr.deserialize() {
758        // We must tell Serde what type we want to deserialize into.
759        let record: Record = result?;
760        println!("{:?}", record);
761    }
762    Ok(())
763}
764#
765# fn main() {
766#     if let Err(err) = run() {
767#         println!("{}", err);
768#         process::exit(1);
769#     }
770# }
771```
772
773Running this code should show similar output as previous examples:
774
775```text
776$ cargo build
777$ ./target/debug/csvtutor < uspop.csv
778("Davidsons Landing", "AK", None, 65.2419444, -165.2716667)
779("Kenai", "AK", Some(7610), 60.5544444, -151.2583333)
780("Oakman", "AL", None, 33.7133333, -87.3886111)
781# ... and much more
782```
783
784One of the downsides of using Serde this way is that the type you use must
785match the order of fields as they appear in each record. This can be a pain
786if your CSV data has a header record, since you might tend to think about each
787field as a value of a particular named field rather than as a numbered field.
788One way we might achieve this is to deserialize our record into a map type like
789[`HashMap`](https://doc.rust-lang.org/std/collections/struct.HashMap.html)
790or
791[`BTreeMap`](https://doc.rust-lang.org/std/collections/struct.BTreeMap.html).
792The next example shows how, and in particular, notice that the only thing that
793changed from the last example is the definition of the `Record` type alias and
794a new `use` statement that imports `HashMap` from the standard library:
795
796```no_run
797//tutorial-read-serde-03.rs
798use std::collections::HashMap;
799# use std::{error::Error, io, process};
800
801// This introduces a type alias so that we can conveniently reference our
802// record type.
803type Record = HashMap<String, String>;
804
805fn run() -> Result<(), Box<dyn Error>> {
806    let mut rdr = csv::Reader::from_reader(io::stdin());
807    for result in rdr.deserialize() {
808        let record: Record = result?;
809        println!("{:?}", record);
810    }
811    Ok(())
812}
813#
814# fn main() {
815#     if let Err(err) = run() {
816#         println!("{}", err);
817#         process::exit(1);
818#     }
819# }
820```
821
822Running this program shows similar results as before, but each record is
823printed as a map:
824
825```text
826$ cargo build
827$ ./target/debug/csvtutor < uspop.csv
828{"City": "Davidsons Landing", "Latitude": "65.2419444", "State": "AK", "Population": "", "Longitude": "-165.2716667"}
829{"City": "Kenai", "Population": "7610", "State": "AK", "Longitude": "-151.2583333", "Latitude": "60.5544444"}
830{"State": "AL", "City": "Oakman", "Longitude": "-87.3886111", "Population": "", "Latitude": "33.7133333"}
831```
832
833This method works especially well if you need to read CSV data with header
834records, but whose exact structure isn't known until your program runs.
835However, in our case, we know the structure of the data in `uspop.csv`. In
836particular, with the `HashMap` approach, we've lost the specific types we had
837for each field in the previous example when we deserialized each record into a
838`(String, String, Option<u64>, f64, f64)`. Is there a way to identify fields
839by their corresponding header name *and* assign each field its own unique
840type? The answer is yes, but we'll need to bring in Serde's `derive` feature
841first. You can do that by adding this to the `[dependencies]` section of your
842`Cargo.toml` file:
843
844```text
845serde = { version = "1", features = ["derive"] }
846```
847
848With these crates added to our project, we can now define our own custom struct
849that represents our record. We then ask Serde to automatically write the glue
850code required to populate our struct from a CSV record. The next example shows
851how. Don't miss the new Serde imports!
852
853```no_run
854//tutorial-read-serde-04.rs
855# #![allow(dead_code)]
856# use std::{error::Error, io, process};
857
858// This lets us write `#[derive(Deserialize)]`.
859use serde::Deserialize;
860
861// We don't need to derive `Debug` (which doesn't require Serde), but it's a
862// good habit to do it for all your types.
863//
864// Notice that the field names in this struct are NOT in the same order as
865// the fields in the CSV data!
866#[derive(Debug, Deserialize)]
867#[serde(rename_all = "PascalCase")]
868struct Record {
869    latitude: f64,
870    longitude: f64,
871    population: Option<u64>,
872    city: String,
873    state: String,
874}
875
876fn run() -> Result<(), Box<dyn Error>> {
877    let mut rdr = csv::Reader::from_reader(io::stdin());
878    for result in rdr.deserialize() {
879        let record: Record = result?;
880        println!("{:?}", record);
881        // Try this if you don't like each record smushed on one line:
882        // println!("{:#?}", record);
883    }
884    Ok(())
885}
886
887fn main() {
888    if let Err(err) = run() {
889        println!("{}", err);
890        process::exit(1);
891    }
892}
893```
894
895Compile and run this program to see similar output as before:
896
897```text
898$ cargo build
899$ ./target/debug/csvtutor < uspop.csv
900Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
901Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }
902Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" }
903```
904
905Once again, we didn't need to change our `run` function at all: we're still
906iterating over records using the `deserialize` iterator that we started with
907in the beginning of this section. The only thing that changed in this example
908was the definition of the `Record` type and a new `use` statement. Our `Record`
909type is now a custom struct that we defined instead of a type alias, and as a
910result, Serde doesn't know how to deserialize it by default. However, a special
911compiler plugin provided by Serde is available, which will read your struct
912definition at compile time and generate code that will deserialize a CSV record
913into a `Record` value. To see what happens if you leave out the automatic
914derive, change `#[derive(Debug, Deserialize)]` to `#[derive(Debug)]`.
915
916One other thing worth mentioning in this example is the use of
917`#[serde(rename_all = "PascalCase")]`. This directive helps Serde map your
918struct's field names to the header names in the CSV data. If you recall, our
919header record is:
920
921```text
922City,State,Population,Latitude,Longitude
923```
924
925Notice that each name is capitalized, but the fields in our struct are not. The
926`#[serde(rename_all = "PascalCase")]` directive fixes that by interpreting each
927field in `PascalCase`, where the first letter of the field is capitalized. If
928we didn't tell Serde about the name remapping, then the program will quit with
929an error:
930
931```text
932$ ./target/debug/csvtutor < uspop.csv
933CSV deserialize error: record 1 (line: 2, byte: 41): missing field `latitude`
934```
935
936We could have fixed this through other means. For example, we could have used
937capital letters in our field names:
938
939```ignore
940#[derive(Debug, Deserialize)]
941struct Record {
942    Latitude: f64,
943    Longitude: f64,
944    Population: Option<u64>,
945    City: String,
946    State: String,
947}
948```
949
950However, this violates Rust naming style. (In fact, the Rust compiler
951will even warn you that the names do not follow convention!)
952
953Another way to fix this is to ask Serde to rename each field individually. This
954is useful when there is no consistent name mapping from fields to header names:
955
956```ignore
957#[derive(Debug, Deserialize)]
958struct Record {
959    #[serde(rename = "Latitude")]
960    latitude: f64,
961    #[serde(rename = "Longitude")]
962    longitude: f64,
963    #[serde(rename = "Population")]
964    population: Option<u64>,
965    #[serde(rename = "City")]
966    city: String,
967    #[serde(rename = "State")]
968    state: String,
969}
970```
971
972To read more about renaming fields and about other Serde directives, please
973consult the
974[Serde documentation on attributes](https://serde.rs/attributes.html).
975
976## Handling invalid data with Serde
977
978In this section we will see a brief example of how to deal with data that isn't
979clean. To do this exercise, we'll work with a slightly tweaked version of the
980US population data we've been using throughout this tutorial. This version of
981the data is slightly messier than what we've been using. You can get it like
982so:
983
984```text
985$ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop-null.csv'
986```
987
988Let's start by running our program from the previous section:
989
990```no_run
991//tutorial-read-serde-invalid-01.rs
992# #![allow(dead_code)]
993# use std::{error::Error, io, process};
994#
995# use serde::Deserialize;
996#
997#[derive(Debug, Deserialize)]
998#[serde(rename_all = "PascalCase")]
999struct Record {
1000    latitude: f64,
1001    longitude: f64,
1002    population: Option<u64>,
1003    city: String,
1004    state: String,
1005}
1006
1007fn run() -> Result<(), Box<dyn Error>> {
1008    let mut rdr = csv::Reader::from_reader(io::stdin());
1009    for result in rdr.deserialize() {
1010        let record: Record = result?;
1011        println!("{:?}", record);
1012    }
1013    Ok(())
1014}
1015#
1016# fn main() {
1017#     if let Err(err) = run() {
1018#         println!("{}", err);
1019#         process::exit(1);
1020#     }
1021# }
1022```
1023
1024Compile and run it on our messier data:
1025
1026```text
1027$ cargo build
1028$ ./target/debug/csvtutor < uspop-null.csv
1029Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
1030Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }
1031Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" }
1032# ... more records
1033CSV deserialize error: record 42 (line: 43, byte: 1710): field 2: invalid digit found in string
1034```
1035
1036Oops! What happened? The program printed several records, but stopped when it
1037tripped over a deserialization problem. The error message says that it found
1038an invalid digit in the field at index `2` (which is the `Population` field)
1039on line 43. What does line 43 look like?
1040
1041```text
1042$ head -n 43 uspop-null.csv | tail -n1
1043Flint Springs,KY,NULL,37.3433333,-86.7136111
1044```
1045
1046Ah! The third field (index `2`) is supposed to either be empty or contain a
1047population count. However, in this data, it seems that `NULL` sometimes appears
1048as a value, presumably to indicate that there is no count available.
1049
1050The problem with our current program is that it fails to read this record
1051because it doesn't know how to deserialize a `NULL` string into an
1052`Option<u64>`. That is, a `Option<u64>` either corresponds to an empty field
1053or an integer.
1054
1055To fix this, we tell Serde to convert any deserialization errors on this field
1056to a `None` value, as shown in this next example:
1057
1058```no_run
1059//tutorial-read-serde-invalid-02.rs
1060# #![allow(dead_code)]
1061# use std::{error::Error, io, process};
1062#
1063# use serde::Deserialize;
1064#[derive(Debug, Deserialize)]
1065#[serde(rename_all = "PascalCase")]
1066struct Record {
1067    latitude: f64,
1068    longitude: f64,
1069    #[serde(deserialize_with = "csv::invalid_option")]
1070    population: Option<u64>,
1071    city: String,
1072    state: String,
1073}
1074
1075fn run() -> Result<(), Box<dyn Error>> {
1076    let mut rdr = csv::Reader::from_reader(io::stdin());
1077    for result in rdr.deserialize() {
1078        let record: Record = result?;
1079        println!("{:?}", record);
1080    }
1081    Ok(())
1082}
1083#
1084# fn main() {
1085#     if let Err(err) = run() {
1086#         println!("{}", err);
1087#         process::exit(1);
1088#     }
1089# }
1090```
1091
1092If you compile and run this example, then it should run to completion just
1093like the other examples:
1094
1095```text
1096$ cargo build
1097$ ./target/debug/csvtutor < uspop-null.csv
1098Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
1099Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }
1100Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" }
1101# ... and more
1102```
1103
1104The only change in this example was adding this attribute to the `population`
1105field in our `Record` type:
1106
1107```ignore
1108#[serde(deserialize_with = "csv::invalid_option")]
1109```
1110
1111The
1112[`invalid_option`](../fn.invalid_option.html)
1113function is a generic helper function that does one very simple thing: when
1114applied to `Option` fields, it will convert any deserialization error into a
1115`None` value. This is useful when you need to work with messy CSV data.
1116
1117# Writing CSV
1118
1119In this section we'll show a few examples that write CSV data. Writing CSV data
1120tends to be a bit more straight-forward than reading CSV data, since you get to
1121control the output format.
1122
1123Let's start with the most basic example: writing a few CSV records to `stdout`.
1124
1125```no_run
1126//tutorial-write-01.rs
1127use std::{error::Error, io, process};
1128
1129fn run() -> Result<(), Box<dyn Error>> {
1130    let mut wtr = csv::Writer::from_writer(io::stdout());
1131    // Since we're writing records manually, we must explicitly write our
1132    // header record. A header record is written the same way that other
1133    // records are written.
1134    wtr.write_record(["City", "State", "Population", "Latitude", "Longitude"])?;
1135    wtr.write_record(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?;
1136    wtr.write_record(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?;
1137    wtr.write_record(["Oakman", "AL", "", "33.7133333", "-87.3886111"])?;
1138
1139    // A CSV writer maintains an internal buffer, so it's important
1140    // to flush the buffer when you're done.
1141    wtr.flush()?;
1142    Ok(())
1143}
1144
1145fn main() {
1146    if let Err(err) = run() {
1147        println!("{}", err);
1148        process::exit(1);
1149    }
1150}
1151```
1152
1153Compiling and running this example results in CSV data being printed:
1154
1155```text
1156$ cargo build
1157$ ./target/debug/csvtutor
1158City,State,Population,Latitude,Longitude
1159Davidsons Landing,AK,,65.2419444,-165.2716667
1160Kenai,AK,7610,60.5544444,-151.2583333
1161Oakman,AL,,33.7133333,-87.3886111
1162```
1163
1164Before moving on, it's worth taking a closer look at the `write_record`
1165method. In this example, it looks rather simple, but if you're new to Rust then
1166its type signature might look a little daunting:
1167
1168```ignore
1169pub fn write_record<I, T>(&mut self, record: I) -> csv::Result<()>
1170    where I: IntoIterator<Item=T>, T: AsRef<[u8]>
1171{
1172    // implementation elided
1173}
1174```
1175
1176To understand the type signature, we can break it down piece by piece.
1177
11781. The method takes two parameters: `self` and `record`.
11792. `self` is a special parameter that corresponds to the `Writer` itself.
11803. `record` is the CSV record we'd like to write. Its type is `I`, which is
1181   a generic type.
11824. In the method's `where` clause, the `I` type is constrained by the
1183   `IntoIterator<Item=T>` bound. What that means is that `I` must satisfy the
1184   `IntoIterator` trait. If you look at the documentation of the
1185   [`IntoIterator` trait](https://doc.rust-lang.org/std/iter/trait.IntoIterator.html),
1186   then we can see that it describes types that can build iterators. In this
1187   case, we want an iterator that yields *another* generic type `T`, where
1188   `T` is the type of each field we want to write.
11895. `T` also appears in the method's `where` clause, but its constraint is the
1190   `AsRef<[u8]>` bound. The `AsRef` trait is a way to describe zero cost
1191   conversions between types in Rust. In this case, the `[u8]` in `AsRef<[u8]>`
1192   means that we want to be able to *borrow* a slice of bytes from `T`.
1193   The CSV writer will take these bytes and write them as a single field.
1194   The `AsRef<[u8]>` bound is useful because types like `String`, `&str`,
1195   `Vec<u8>` and `&[u8]` all satisfy it.
11966. Finally, the method returns a `csv::Result<()>`, which is short-hand for
1197   `Result<(), csv::Error>`. That means `write_record` either returns nothing
1198   on success or returns a `csv::Error` on failure.
1199
1200Now, let's apply our new found understanding of the type signature of
1201`write_record`. If you recall, in our previous example, we used it like so:
1202
1203```ignore
1204wtr.write_record(["field 1", "field 2", "etc"])?;
1205```
1206
1207So how do the types match up? Well, the type of each of our fields in this
1208code is `&'static str` (which is the type of a string literal in Rust). Since
1209we put them in a slice literal, the type of our parameter is
1210`&'static [&'static str]`, or more succinctly written as `&[&str]` without the
1211lifetime annotations. Since slices satisfy the `IntoIterator` bound and
1212strings satisfy the `AsRef<[u8]>` bound, this ends up being a legal call.
1213
1214Here are a few more examples of ways you can call `write_record`:
1215
1216```no_run
1217# use csv;
1218# let mut wtr = csv::Writer::from_writer(vec![]);
1219// A slice of byte strings.
1220wtr.write_record(&[b"a", b"b", b"c"]);
1221// An array of byte strings.
1222wtr.write_record([b"a", b"b", b"c"]);
1223// A vector.
1224wtr.write_record(vec!["a", "b", "c"]);
1225// A string record.
1226wtr.write_record(&csv::StringRecord::from(vec!["a", "b", "c"]));
1227// A byte record.
1228wtr.write_record(&csv::ByteRecord::from(vec!["a", "b", "c"]));
1229```
1230
1231Finally, the example above can be easily adapted to write to a file instead
1232of `stdout`:
1233
1234```no_run
1235//tutorial-write-02.rs
1236use std::{
1237    env,
1238    error::Error,
1239    ffi::OsString,
1240    process,
1241};
1242
1243fn run() -> Result<(), Box<dyn Error>> {
1244    let file_path = get_first_arg()?;
1245    let mut wtr = csv::Writer::from_path(file_path)?;
1246
1247    wtr.write_record(["City", "State", "Population", "Latitude", "Longitude"])?;
1248    wtr.write_record(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?;
1249    wtr.write_record(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?;
1250    wtr.write_record(["Oakman", "AL", "", "33.7133333", "-87.3886111"])?;
1251
1252    wtr.flush()?;
1253    Ok(())
1254}
1255
1256/// Returns the first positional argument sent to this process. If there are no
1257/// positional arguments, then this returns an error.
1258fn get_first_arg() -> Result<OsString, Box<dyn Error>> {
1259    match env::args_os().nth(1) {
1260        None => Err(From::from("expected 1 argument, but got none")),
1261        Some(file_path) => Ok(file_path),
1262    }
1263}
1264
1265fn main() {
1266    if let Err(err) = run() {
1267        println!("{}", err);
1268        process::exit(1);
1269    }
1270}
1271```
1272
1273## Writing tab separated values
1274
1275In the previous section, we saw how to write some simple CSV data to `stdout`
1276that looked like this:
1277
1278```text
1279City,State,Population,Latitude,Longitude
1280Davidsons Landing,AK,,65.2419444,-165.2716667
1281Kenai,AK,7610,60.5544444,-151.2583333
1282Oakman,AL,,33.7133333,-87.3886111
1283```
1284
1285You might wonder to yourself: what's the point of using a CSV writer if the
1286data is so simple? Well, the benefit of a CSV writer is that it can handle all
1287types of data without sacrificing the integrity of your data. That is, it knows
1288when to quote fields that contain special CSV characters (like commas or new
1289lines) or escape literal quotes that appear in your data. The CSV writer can
1290also be easily configured to use different delimiters or quoting strategies.
1291
1292In this section, we'll take a look a look at how to tweak some of the settings
1293on a CSV writer. In particular, we'll write TSV ("tab separated values")
1294instead of CSV, and we'll ask the CSV writer to quote all non-numeric fields.
1295Here's an example:
1296
1297```no_run
1298//tutorial-write-delimiter-01.rs
1299# use std::{error::Error, io, process};
1300#
1301fn run() -> Result<(), Box<dyn Error>> {
1302    let mut wtr = csv::WriterBuilder::new()
1303        .delimiter(b'\t')
1304        .quote_style(csv::QuoteStyle::NonNumeric)
1305        .from_writer(io::stdout());
1306
1307    wtr.write_record(["City", "State", "Population", "Latitude", "Longitude"])?;
1308    wtr.write_record(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?;
1309    wtr.write_record(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?;
1310    wtr.write_record(["Oakman", "AL", "", "33.7133333", "-87.3886111"])?;
1311
1312    wtr.flush()?;
1313    Ok(())
1314}
1315#
1316# fn main() {
1317#     if let Err(err) = run() {
1318#         println!("{}", err);
1319#         process::exit(1);
1320#     }
1321# }
1322```
1323
1324Compiling and running this example gives:
1325
1326```text
1327$ cargo build
1328$ ./target/debug/csvtutor
1329"City"  "State" "Population"    "Latitude"      "Longitude"
1330"Davidsons Landing"     "AK"    ""      65.2419444      -165.2716667
1331"Kenai" "AK"    7610    60.5544444      -151.2583333
1332"Oakman"        "AL"    ""      33.7133333      -87.3886111
1333```
1334
1335In this example, we used a new type
1336[`QuoteStyle`](../enum.QuoteStyle.html).
1337The `QuoteStyle` type represents the different quoting strategies available
1338to you. The default is to add quotes to fields only when necessary. This
1339probably works for most use cases, but you can also ask for quotes to always
1340be put around fields, to never be put around fields or to always be put around
1341non-numeric fields.
1342
1343## Writing with Serde
1344
1345Just like the CSV reader supports automatic deserialization into Rust types
1346with Serde, the CSV writer supports automatic serialization from Rust types
1347into CSV records using Serde. In this section, we'll learn how to use it.
1348
1349As with reading, let's start by seeing how we can serialize a Rust tuple.
1350
1351```no_run
1352//tutorial-write-serde-01.rs
1353# use std::{error::Error, io, process};
1354#
1355fn run() -> Result<(), Box<dyn Error>> {
1356    let mut wtr = csv::Writer::from_writer(io::stdout());
1357
1358    // We still need to write headers manually.
1359    wtr.write_record(["City", "State", "Population", "Latitude", "Longitude"])?;
1360
1361    // But now we can write records by providing a normal Rust value.
1362    //
1363    // Note that the odd `None::<u64>` syntax is required because `None` on
1364    // its own doesn't have a concrete type, but Serde needs a concrete type
1365    // in order to serialize it. That is, `None` has type `Option<T>` but
1366    // `None::<u64>` has type `Option<u64>`.
1367    wtr.serialize(("Davidsons Landing", "AK", None::<u64>, 65.2419444, -165.2716667))?;
1368    wtr.serialize(("Kenai", "AK", Some(7610), 60.5544444, -151.2583333))?;
1369    wtr.serialize(("Oakman", "AL", None::<u64>, 33.7133333, -87.3886111))?;
1370
1371    wtr.flush()?;
1372    Ok(())
1373}
1374#
1375# fn main() {
1376#     if let Err(err) = run() {
1377#         println!("{}", err);
1378#         process::exit(1);
1379#     }
1380# }
1381```
1382
1383Compiling and running this program gives the expected output:
1384
1385```text
1386$ cargo build
1387$ ./target/debug/csvtutor
1388City,State,Population,Latitude,Longitude
1389Davidsons Landing,AK,,65.2419444,-165.2716667
1390Kenai,AK,7610,60.5544444,-151.2583333
1391Oakman,AL,,33.7133333,-87.3886111
1392```
1393
1394The key thing to note in the above example is the use of `serialize` instead
1395of `write_record` to write our data. In particular, `write_record` is used
1396when writing a simple record that contains string-like data only. On the other
1397hand, `serialize` is used when your data consists of more complex values like
1398numbers, floats or optional values. Of course, you could always convert the
1399complex values to strings and then use `write_record`, but Serde can do it for
1400you automatically.
1401
1402As with reading, we can also serialize custom structs as CSV records. As a
1403bonus, the fields in a struct will automatically be written as a header
1404record!
1405
1406To write custom structs as CSV records, we'll need to make use of Serde's
1407automatic `derive` feature again. As in the
1408[previous section on reading with Serde](#reading-with-serde),
1409we'll need to add a couple crates to our `[dependencies]` section in our
1410`Cargo.toml` (if they aren't already there):
1411
1412```text
1413serde = { version = "1", features = ["derive"] }
1414```
1415
1416And we'll also need to add a new `use` statement to our code, for Serde, as
1417shown in the example:
1418
1419```no_run
1420//tutorial-write-serde-02.rs
1421use std::{error::Error, io, process};
1422
1423use serde::Serialize;
1424
1425// Note that structs can derive both Serialize and Deserialize!
1426#[derive(Debug, Serialize)]
1427#[serde(rename_all = "PascalCase")]
1428struct Record<'a> {
1429    city: &'a str,
1430    state: &'a str,
1431    population: Option<u64>,
1432    latitude: f64,
1433    longitude: f64,
1434}
1435
1436fn run() -> Result<(), Box<dyn Error>> {
1437    let mut wtr = csv::Writer::from_writer(io::stdout());
1438
1439    wtr.serialize(Record {
1440        city: "Davidsons Landing",
1441        state: "AK",
1442        population: None,
1443        latitude: 65.2419444,
1444        longitude: -165.2716667,
1445    })?;
1446    wtr.serialize(Record {
1447        city: "Kenai",
1448        state: "AK",
1449        population: Some(7610),
1450        latitude: 60.5544444,
1451        longitude: -151.2583333,
1452    })?;
1453    wtr.serialize(Record {
1454        city: "Oakman",
1455        state: "AL",
1456        population: None,
1457        latitude: 33.7133333,
1458        longitude: -87.3886111,
1459    })?;
1460
1461    wtr.flush()?;
1462    Ok(())
1463}
1464
1465fn main() {
1466    if let Err(err) = run() {
1467        println!("{}", err);
1468        process::exit(1);
1469    }
1470}
1471```
1472
1473Compiling and running this example has the same output as last time, even
1474though we didn't explicitly write a header record:
1475
1476```text
1477$ cargo build
1478$ ./target/debug/csvtutor
1479City,State,Population,Latitude,Longitude
1480Davidsons Landing,AK,,65.2419444,-165.2716667
1481Kenai,AK,7610,60.5544444,-151.2583333
1482Oakman,AL,,33.7133333,-87.3886111
1483```
1484
1485In this case, the `serialize` method noticed that we were writing a struct
1486with field names. When this happens, `serialize` will automatically write a
1487header record (only if no other records have been written) that consists of
1488the fields in the struct in the order in which they are defined. Note that
1489this behavior can be disabled with the
1490[`WriterBuilder::has_headers`](../struct.WriterBuilder.html#method.has_headers)
1491method.
1492
1493It's also worth pointing out the use of a *lifetime parameter* in our `Record`
1494struct:
1495
1496```ignore
1497struct Record<'a> {
1498    city: &'a str,
1499    state: &'a str,
1500    population: Option<u64>,
1501    latitude: f64,
1502    longitude: f64,
1503}
1504```
1505
1506The `'a` lifetime parameter corresponds to the lifetime of the `city` and
1507`state` string slices. This says that the `Record` struct contains *borrowed*
1508data. We could have written our struct without borrowing any data, and
1509therefore, without any lifetime parameters:
1510
1511```ignore
1512struct Record {
1513    city: String,
1514    state: String,
1515    population: Option<u64>,
1516    latitude: f64,
1517    longitude: f64,
1518}
1519```
1520
1521However, since we had to replace our borrowed `&str` types with owned `String`
1522types, we're now forced to allocate a new `String` value for both of `city`
1523and `state` for every record that we write. There's no intrinsic problem with
1524doing that, but it might be a bit wasteful.
1525
1526For more examples and more details on the rules for serialization, please see
1527the
1528[`Writer::serialize`](../struct.Writer.html#method.serialize)
1529method.
1530
1531# Pipelining
1532
1533In this section, we're going to cover a few examples that demonstrate programs
1534that take CSV data as input, and produce possibly transformed or filtered CSV
1535data as output. This shows how to write a complete program that efficiently
1536reads and writes CSV data. Rust is well positioned to perform this task, since
1537you'll get great performance with the convenience of a high level CSV library.
1538
1539## Filter by search
1540
1541The first example of CSV pipelining we'll look at is a simple filter. It takes
1542as input some CSV data on stdin and a single string query as its only
1543positional argument, and it will produce as output CSV data that only contains
1544rows with a field that matches the query.
1545
1546```no_run
1547//tutorial-pipeline-search-01.rs
1548use std::{env, error::Error, io, process};
1549
1550fn run() -> Result<(), Box<dyn Error>> {
1551    // Get the query from the positional arguments.
1552    // If one doesn't exist, return an error.
1553    let query = match env::args().nth(1) {
1554        None => return Err(From::from("expected 1 argument, but got none")),
1555        Some(query) => query,
1556    };
1557
1558    // Build CSV readers and writers to stdin and stdout, respectively.
1559    let mut rdr = csv::Reader::from_reader(io::stdin());
1560    let mut wtr = csv::Writer::from_writer(io::stdout());
1561
1562    // Before reading our data records, we should write the header record.
1563    wtr.write_record(rdr.headers()?)?;
1564
1565    // Iterate over all the records in `rdr`, and write only records containing
1566    // `query` to `wtr`.
1567    for result in rdr.records() {
1568        let record = result?;
1569        if record.iter().any(|field| field == query) {
1570            wtr.write_record(&record)?;
1571        }
1572    }
1573
1574    // CSV writers use an internal buffer, so we should always flush when done.
1575    wtr.flush()?;
1576    Ok(())
1577}
1578
1579fn main() {
1580    if let Err(err) = run() {
1581        println!("{}", err);
1582        process::exit(1);
1583    }
1584}
1585```
1586
1587If we compile and run this program with a query of `MA` on `uspop.csv`, we'll
1588see that only one record matches:
1589
1590```text
1591$ cargo build
1592$ ./csvtutor MA < uspop.csv
1593City,State,Population,Latitude,Longitude
1594Reading,MA,23441,42.5255556,-71.0958333
1595```
1596
1597This example doesn't actually introduce anything new. It merely combines what
1598you've already learned about CSV readers and writers from previous sections.
1599
1600Let's add a twist to this example. In the real world, you're often faced with
1601messy CSV data that might not be encoded correctly. One example you might come
1602across is CSV data encoded in
1603[Latin-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1).
1604Unfortunately, for the examples we've seen so far, our CSV reader assumes that
1605all of the data is UTF-8. Since all of the data we've worked on has been
1606ASCII---which is a subset of both Latin-1 and UTF-8---we haven't had any
1607problems. But let's introduce a slightly tweaked version of our `uspop.csv`
1608file that contains an encoding of a Latin-1 character that is invalid UTF-8.
1609You can get the data like so:
1610
1611```text
1612$ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop-latin1.csv'
1613```
1614
1615Even though I've already given away the problem, let's see what happen when
1616we try to run our previous example on this new data:
1617
1618```text
1619$ ./csvtutor MA < uspop-latin1.csv
1620City,State,Population,Latitude,Longitude
1621CSV parse error: record 3 (line 4, field: 0, byte: 125): invalid utf-8: invalid UTF-8 in field 0 near byte index 0
1622```
1623
1624The error message tells us exactly what's wrong. Let's take a look at line 4
1625to see what we're dealing with:
1626
1627```text
1628$ head -n4 uspop-latin1.csv | tail -n1
1629Õakman,AL,,33.7133333,-87.3886111
1630```
1631
1632In this case, the very first character is the Latin-1 `Õ`, which is encoded as
1633the byte `0xD5`, which is in turn invalid UTF-8. So what do we do now that our
1634CSV parser has choked on our data? You have two choices. The first is to go in
1635and fix up your CSV data so that it's valid UTF-8. This is probably a good
1636idea anyway, and tools like `iconv` can help with the task of transcoding.
1637But if you can't or don't want to do that, then you can instead read CSV data
1638in a way that is mostly encoding agnostic (so long as ASCII is still a valid
1639subset). The trick is to use *byte records* instead of *string records*.
1640
1641Thus far, we haven't actually talked much about the type of a record in this
1642library, but now is a good time to introduce them. There are two of them,
1643[`StringRecord`](../struct.StringRecord.html)
1644and
1645[`ByteRecord`](../struct.ByteRecord.html).
1646Each them represent a single record in CSV data, where a record is a sequence
1647of an arbitrary number of fields. The only difference between `StringRecord`
1648and `ByteRecord` is that `StringRecord` is guaranteed to be valid UTF-8, where
1649as `ByteRecord` contains arbitrary bytes.
1650
1651Armed with that knowledge, we can now begin to understand why we saw an error
1652when we ran the last example on data that wasn't UTF-8. Namely, when we call
1653`records`, we get back an iterator of `StringRecord`. Since `StringRecord` is
1654guaranteed to be valid UTF-8, trying to build a `StringRecord` with invalid
1655UTF-8 will result in the error that we see.
1656
1657All we need to do to make our example work is to switch from a `StringRecord`
1658to a `ByteRecord`. This means using `byte_records` to create our iterator
1659instead of `records`, and similarly using `byte_headers` instead of `headers`
1660if we think our header data might contain invalid UTF-8 as well. Here's the
1661change:
1662
1663```no_run
1664//tutorial-pipeline-search-02.rs
1665# use std::{env, error::Error, io, process};
1666#
1667fn run() -> Result<(), Box<dyn Error>> {
1668    let query = match env::args().nth(1) {
1669        None => return Err(From::from("expected 1 argument, but got none")),
1670        Some(query) => query,
1671    };
1672
1673    let mut rdr = csv::Reader::from_reader(io::stdin());
1674    let mut wtr = csv::Writer::from_writer(io::stdout());
1675
1676    wtr.write_record(rdr.byte_headers()?)?;
1677
1678    for result in rdr.byte_records() {
1679        let record = result?;
1680        // `query` is a `String` while `field` is now a `&[u8]`, so we'll
1681        // need to convert `query` to `&[u8]` before doing a comparison.
1682        if record.iter().any(|field| field == query.as_bytes()) {
1683            wtr.write_record(&record)?;
1684        }
1685    }
1686
1687    wtr.flush()?;
1688    Ok(())
1689}
1690#
1691# fn main() {
1692#     if let Err(err) = run() {
1693#         println!("{}", err);
1694#         process::exit(1);
1695#     }
1696# }
1697```
1698
1699Compiling and running this now yields the same results as our first example,
1700but this time it works on data that isn't valid UTF-8.
1701
1702```text
1703$ cargo build
1704$ ./csvtutor MA < uspop-latin1.csv
1705City,State,Population,Latitude,Longitude
1706Reading,MA,23441,42.5255556,-71.0958333
1707```
1708
1709## Filter by population count
1710
1711In this section, we will show another example program that both reads and
1712writes CSV data, but instead of dealing with arbitrary records, we will use
1713Serde to deserialize and serialize records with specific types.
1714
1715For this program, we'd like to be able to filter records in our population data
1716by population count. Specifically, we'd like to see which records meet a
1717certain population threshold. In addition to using a simple inequality, we must
1718also account for records that have a missing population count. This is where
1719types like `Option<T>` come in handy, because the compiler will force us to
1720consider the case when the population count is missing.
1721
1722Since we're using Serde in this example, don't forget to add the Serde
1723dependencies to your `Cargo.toml` in your `[dependencies]` section if they
1724aren't already there:
1725
1726```text
1727serde = { version = "1", features = ["derive"] }
1728```
1729
1730Now here's the code:
1731
1732```no_run
1733//tutorial-pipeline-pop-01.rs
1734# use std::{env, error::Error, io, process};
1735
1736use serde::{Deserialize, Serialize};
1737
1738// Unlike previous examples, we derive both Deserialize and Serialize. This
1739// means we'll be able to automatically deserialize and serialize this type.
1740#[derive(Debug, Deserialize, Serialize)]
1741#[serde(rename_all = "PascalCase")]
1742struct Record {
1743    city: String,
1744    state: String,
1745    population: Option<u64>,
1746    latitude: f64,
1747    longitude: f64,
1748}
1749
1750fn run() -> Result<(), Box<dyn Error>> {
1751    // Get the query from the positional arguments.
1752    // If one doesn't exist or isn't an integer, return an error.
1753    let minimum_pop: u64 = match env::args().nth(1) {
1754        None => return Err(From::from("expected 1 argument, but got none")),
1755        Some(arg) => arg.parse()?,
1756    };
1757
1758    // Build CSV readers and writers to stdin and stdout, respectively.
1759    // Note that we don't need to write headers explicitly. Since we're
1760    // serializing a custom struct, that's done for us automatically.
1761    let mut rdr = csv::Reader::from_reader(io::stdin());
1762    let mut wtr = csv::Writer::from_writer(io::stdout());
1763
1764    // Iterate over all the records in `rdr`, and write only records containing
1765    // a population that is greater than or equal to `minimum_pop`.
1766    for result in rdr.deserialize() {
1767        // Remember that when deserializing, we must use a type hint to
1768        // indicate which type we want to deserialize our record into.
1769        let record: Record = result?;
1770
1771        // `is_some_and` is a combinator on `Option`. It takes a closure that
1772        // returns `bool` when the `Option` is `Some`. When the `Option` is
1773        // `None`, `false` is always returned. In this case, we test it against
1774        // our minimum population count that we got from the command line.
1775        if record.population.is_some_and(|pop| pop >= minimum_pop) {
1776            wtr.serialize(record)?;
1777        }
1778    }
1779
1780    // CSV writers use an internal buffer, so we should always flush when done.
1781    wtr.flush()?;
1782    Ok(())
1783}
1784
1785fn main() {
1786    if let Err(err) = run() {
1787        println!("{}", err);
1788        process::exit(1);
1789    }
1790}
1791```
1792
1793If we compile and run our program with a minimum threshold of `100000`, we
1794should see three matching records. Notice that the headers were added even
1795though we never explicitly wrote them!
1796
1797```text
1798$ cargo build
1799$ ./target/debug/csvtutor 100000 < uspop.csv
1800City,State,Population,Latitude,Longitude
1801Fontana,CA,169160,34.0922222,-117.4341667
1802Bridgeport,CT,139090,41.1669444,-73.2052778
1803Indianapolis,IN,773283,39.7683333,-86.1580556
1804```
1805
1806# Performance
1807
1808In this section, we'll go over how to squeeze the most juice out of our CSV
1809reader. As it happens, most of the APIs we've seen so far were designed with
1810high level convenience in mind, and that often comes with some costs. For the
1811most part, those costs revolve around unnecessary allocations. Therefore, most
1812of the section will show how to do CSV parsing with as little allocation as
1813possible.
1814
1815There are two critical preliminaries we must cover.
1816
1817Firstly, when you care about performance, you should compile your code
1818with `cargo build --release` instead of `cargo build`. The `--release`
1819flag instructs the compiler to spend more time optimizing your code. When
1820compiling with the `--release` flag, you'll find your compiled program at
1821`target/release/csvtutor` instead of `target/debug/csvtutor`. Throughout this
1822tutorial, we've used `cargo build` because our dataset was small and we weren't
1823focused on speed. The downside of `cargo build --release` is that it will take
1824longer than `cargo build`.
1825
1826Secondly, the dataset we've used throughout this tutorial only has 100 records.
1827We'd have to try really hard to cause our program to run slowly on 100 records,
1828even when we compile without the `--release` flag. Therefore, in order to
1829actually witness a performance difference, we need a bigger dataset. To get
1830such a dataset, we'll use the original source of `uspop.csv`. **Warning: the
1831download is 41MB compressed and decompresses to 145MB.**
1832
1833```text
1834$ curl -LO http://burntsushi.net/stuff/worldcitiespop.csv.gz
1835$ gunzip worldcitiespop.csv.gz
1836$ wc worldcitiespop.csv
1837  3173959   5681543 151492068 worldcitiespop.csv
1838$ md5sum worldcitiespop.csv
18396198bd180b6d6586626ecbf044c1cca5  worldcitiespop.csv
1840```
1841
1842Finally, it's worth pointing out that this section is not attempting to
1843present a rigorous set of benchmarks. We will stay away from rigorous analysis
1844and instead rely a bit more on wall clock times and intuition.
1845
1846## Amortizing allocations
1847
1848In order to measure performance, we must be careful about what it is we're
1849measuring. We must also be careful to not change the thing we're measuring as
1850we make improvements to the code. For this reason, we will focus on measuring
1851how long it takes to count the number of records corresponding to city
1852population counts in Massachusetts. This represents a very small amount of work
1853that requires us to visit every record, and therefore represents a decent way
1854to measure how long it takes to do CSV parsing.
1855
1856Before diving into our first optimization, let's start with a baseline by
1857adapting a previous example to count the number of records in
1858`worldcitiespop.csv`:
1859
1860```no_run
1861//tutorial-perf-alloc-01.rs
1862use std::{error::Error, io, process};
1863
1864fn run() -> Result<u64, Box<dyn Error>> {
1865    let mut rdr = csv::Reader::from_reader(io::stdin());
1866
1867    let mut count = 0;
1868    for result in rdr.records() {
1869        let record = result?;
1870        if &record[0] == "us" && &record[3] == "MA" {
1871            count += 1;
1872        }
1873    }
1874    Ok(count)
1875}
1876
1877fn main() {
1878    match run() {
1879        Ok(count) => {
1880            println!("{}", count);
1881        }
1882        Err(err) => {
1883            println!("{}", err);
1884            process::exit(1);
1885        }
1886    }
1887}
1888```
1889
1890Now let's compile and run it and see what kind of timing we get. Don't forget
1891to compile with the `--release` flag. (For grins, try compiling without the
1892`--release` flag and see how long it takes to run the program!)
1893
1894```text
1895$ cargo build --release
1896$ time ./target/release/csvtutor < worldcitiespop.csv
18972176
1898
1899real    0m0.645s
1900user    0m0.627s
1901sys     0m0.017s
1902```
1903
1904All right, so what's the first thing we can do to make this faster? This
1905section promised to speed things up by amortizing allocation, but we can do
1906something even simpler first: iterate over
1907[`ByteRecord`](../struct.ByteRecord.html)s
1908instead of
1909[`StringRecord`](../struct.StringRecord.html)s.
1910If you recall from a previous section, a `StringRecord` is guaranteed to be
1911valid UTF-8, and therefore must validate that its contents is actually UTF-8.
1912(If validation fails, then the CSV reader will return an error.) If we remove
1913that validation from our program, then we can realize a nice speed boost as
1914shown in the next example:
1915
1916```no_run
1917//tutorial-perf-alloc-02.rs
1918# use std::{error::Error, io, process};
1919#
1920fn run() -> Result<u64, Box<dyn Error>> {
1921    let mut rdr = csv::Reader::from_reader(io::stdin());
1922
1923    let mut count = 0;
1924    for result in rdr.byte_records() {
1925        let record = result?;
1926        if &record[0] == b"us" && &record[3] == b"MA" {
1927            count += 1;
1928        }
1929    }
1930    Ok(count)
1931}
1932#
1933# fn main() {
1934#     match run() {
1935#         Ok(count) => {
1936#             println!("{}", count);
1937#         }
1938#         Err(err) => {
1939#             println!("{}", err);
1940#             process::exit(1);
1941#         }
1942#     }
1943# }
1944```
1945
1946And now compile and run:
1947
1948```text
1949$ cargo build --release
1950$ time ./target/release/csvtutor < worldcitiespop.csv
19512176
1952
1953real    0m0.429s
1954user    0m0.403s
1955sys     0m0.023s
1956```
1957
1958Our program is now approximately 30% faster, all because we removed UTF-8
1959validation. But was it actually okay to remove UTF-8 validation? What have we
1960lost? In this case, it is perfectly acceptable to drop UTF-8 validation and use
1961`ByteRecord` instead because all we're doing with the data in the record is
1962comparing two of its fields to raw bytes:
1963
1964```ignore
1965if &record[0] == b"us" && &record[3] == b"MA" {
1966    count += 1;
1967}
1968```
1969
1970In particular, it doesn't matter whether `record` is valid UTF-8 or not, since
1971we're checking for equality on the raw bytes themselves.
1972
1973UTF-8 validation via `StringRecord` is useful because it provides access to
1974fields as `&str` types, where as `ByteRecord` provides fields as `&[u8]` types.
1975`&str` is the type of a borrowed string in Rust, which provides convenient
1976access to string APIs like substring search. Strings are also frequently used
1977in other areas, so they tend to be a useful thing to have. Therefore, sticking
1978with `StringRecord` is a good default, but if you need the extra speed and can
1979deal with arbitrary bytes, then switching to `ByteRecord` might be a good idea.
1980
1981Moving on, let's try to get another speed boost by amortizing allocation.
1982Amortizing allocation is the technique that creates an allocation once (or
1983very rarely), and then attempts to reuse it instead of creating additional
1984allocations. In the case of the previous examples, we used iterators created
1985by the `records` and `byte_records` methods on a CSV reader. These iterators
1986allocate a new record for every item that it yields, which in turn corresponds
1987to a new allocation. It does this because iterators cannot yield items that
1988borrow from the iterator itself, and because creating new allocations tends to
1989be a lot more convenient.
1990
1991If we're willing to forgo use of iterators, then we can amortize allocations
1992by creating a *single* `ByteRecord` and asking the CSV reader to read into it.
1993We do this by using the
1994[`Reader::read_byte_record`](../struct.Reader.html#method.read_byte_record)
1995method.
1996
1997```no_run
1998//tutorial-perf-alloc-03.rs
1999# use std::{error::Error, io, process};
2000#
2001fn run() -> Result<u64, Box<dyn Error>> {
2002    let mut rdr = csv::Reader::from_reader(io::stdin());
2003    let mut record = csv::ByteRecord::new();
2004
2005    let mut count = 0;
2006    while rdr.read_byte_record(&mut record)? {
2007        if &record[0] == b"us" && &record[3] == b"MA" {
2008            count += 1;
2009        }
2010    }
2011    Ok(count)
2012}
2013#
2014# fn main() {
2015#     match run() {
2016#         Ok(count) => {
2017#             println!("{}", count);
2018#         }
2019#         Err(err) => {
2020#             println!("{}", err);
2021#             process::exit(1);
2022#         }
2023#     }
2024# }
2025```
2026
2027Compile and run:
2028
2029```text
2030$ cargo build --release
2031$ time ./target/release/csvtutor < worldcitiespop.csv
20322176
2033
2034real    0m0.308s
2035user    0m0.283s
2036sys     0m0.023s
2037```
2038
2039Woohoo! This represents *another* 30% boost over the previous example, which is
2040a 50% boost over the first example.
2041
2042Let's dissect this code by taking a look at the type signature of the
2043`read_byte_record` method:
2044
2045```ignore
2046fn read_byte_record(&mut self, record: &mut ByteRecord) -> csv::Result<bool>;
2047```
2048
2049This method takes as input a CSV reader (the `self` parameter) and a *mutable
2050borrow* of a `ByteRecord`, and returns a `csv::Result<bool>`. (The
2051`csv::Result<bool>` is equivalent to `Result<bool, csv::Error>`.) The return
2052value is `true` if and only if a record was read. When it's `false`, that means
2053the reader has exhausted its input. This method works by copying the contents
2054of the next record into the provided `ByteRecord`. Since the same `ByteRecord`
2055is used to read every record, it will already have space allocated for data.
2056When `read_byte_record` runs, it will overwrite the contents that were there
2057with the new record, which means that it can reuse the space that was
2058allocated. Thus, we have *amortized allocation*.
2059
2060An exercise you might consider doing is to use a `StringRecord` instead of a
2061`ByteRecord`, and therefore
2062[`Reader::read_record`](../struct.Reader.html#method.read_record)
2063instead of `read_byte_record`. This will give you easy access to Rust strings
2064at the cost of UTF-8 validation but *without* the cost of allocating a new
2065`StringRecord` for every record.
2066
2067## Serde and zero allocation
2068
2069In this section, we are going to briefly examine how we use Serde and what we
2070can do to speed it up. The key optimization we'll want to make is to---you
2071guessed it---amortize allocation.
2072
2073As with the previous section, let's start with a simple baseline based off an
2074example using Serde in a previous section:
2075
2076```no_run
2077//tutorial-perf-serde-01.rs
2078# #![allow(dead_code)]
2079use std::{error::Error, io, process};
2080
2081use serde::Deserialize;
2082
2083#[derive(Debug, Deserialize)]
2084#[serde(rename_all = "PascalCase")]
2085struct Record {
2086    country: String,
2087    city: String,
2088    accent_city: String,
2089    region: String,
2090    population: Option<u64>,
2091    latitude: f64,
2092    longitude: f64,
2093}
2094
2095fn run() -> Result<u64, Box<dyn Error>> {
2096    let mut rdr = csv::Reader::from_reader(io::stdin());
2097
2098    let mut count = 0;
2099    for result in rdr.deserialize() {
2100        let record: Record = result?;
2101        if record.country == "us" && record.region == "MA" {
2102            count += 1;
2103        }
2104    }
2105    Ok(count)
2106}
2107
2108fn main() {
2109    match run() {
2110        Ok(count) => {
2111            println!("{}", count);
2112        }
2113        Err(err) => {
2114            println!("{}", err);
2115            process::exit(1);
2116        }
2117    }
2118}
2119```
2120
2121Now compile and run this program:
2122
2123```text
2124$ cargo build --release
2125$ ./target/release/csvtutor < worldcitiespop.csv
21262176
2127
2128real    0m1.381s
2129user    0m1.367s
2130sys     0m0.013s
2131```
2132
2133The first thing you might notice is that this is quite a bit slower than our
2134programs in the previous section. This is because deserializing each record
2135has a certain amount of overhead to it. In particular, some of the fields need
2136to be parsed as integers or floating point numbers, which isn't free. However,
2137there is hope yet, because we can speed up this program!
2138
2139Our first attempt to speed up the program will be to amortize allocation. Doing
2140this with Serde is a bit trickier than before, because we need to change our
2141`Record` type and use the manual deserialization API. Let's see what that looks
2142like:
2143
2144```no_run
2145//tutorial-perf-serde-02.rs
2146# #![allow(dead_code)]
2147# use std::{error::Error, io, process};
2148# use serde::Deserialize;
2149#
2150#[derive(Debug, Deserialize)]
2151#[serde(rename_all = "PascalCase")]
2152struct Record<'a> {
2153    country: &'a str,
2154    city: &'a str,
2155    accent_city: &'a str,
2156    region: &'a str,
2157    population: Option<u64>,
2158    latitude: f64,
2159    longitude: f64,
2160}
2161
2162fn run() -> Result<u64, Box<dyn Error>> {
2163    let mut rdr = csv::Reader::from_reader(io::stdin());
2164    let mut raw_record = csv::StringRecord::new();
2165    let headers = rdr.headers()?.clone();
2166
2167    let mut count = 0;
2168    while rdr.read_record(&mut raw_record)? {
2169        let record: Record = raw_record.deserialize(Some(&headers))?;
2170        if record.country == "us" && record.region == "MA" {
2171            count += 1;
2172        }
2173    }
2174    Ok(count)
2175}
2176#
2177# fn main() {
2178#     match run() {
2179#         Ok(count) => {
2180#             println!("{}", count);
2181#         }
2182#         Err(err) => {
2183#             println!("{}", err);
2184#             process::exit(1);
2185#         }
2186#     }
2187# }
2188```
2189
2190Compile and run:
2191
2192```text
2193$ cargo build --release
2194$ ./target/release/csvtutor < worldcitiespop.csv
21952176
2196
2197real    0m1.055s
2198user    0m1.040s
2199sys     0m0.013s
2200```
2201
2202This corresponds to an approximately 24% increase in performance. To achieve
2203this, we had to make two important changes.
2204
2205The first was to make our `Record` type contain `&str` fields instead of
2206`String` fields. If you recall from a previous section, `&str` is a *borrowed*
2207string where a `String` is an *owned* string. A borrowed string points to
2208a already existing allocation where as a `String` always implies a new
2209allocation. In this case, our `&str` is borrowing from the CSV record itself.
2210
2211The second change we had to make was to stop using the
2212[`Reader::deserialize`](../struct.Reader.html#method.deserialize)
2213iterator, and instead deserialize our record into a `StringRecord` explicitly
2214and then use the
2215[`StringRecord::deserialize`](../struct.StringRecord.html#method.deserialize)
2216method to deserialize a single record.
2217
2218The second change is a bit tricky, because in order for it to work, our
2219`Record` type needs to borrow from the data inside the `StringRecord`. That
2220means that our `Record` value cannot outlive the `StringRecord` that it was
2221created from. Since we overwrite the same `StringRecord` on each iteration
2222(in order to amortize allocation), that means our `Record` value must evaporate
2223before the next iteration of the loop. Indeed, the compiler will enforce this!
2224
2225There is one more optimization we can make: remove UTF-8 validation. In
2226general, this means using `&[u8]` instead of `&str` and `ByteRecord` instead
2227of `StringRecord`:
2228
2229```no_run
2230//tutorial-perf-serde-03.rs
2231# #![allow(dead_code)]
2232# use std::{error::Error, io, process};
2233#
2234# use serde::Deserialize;
2235#
2236#[derive(Debug, Deserialize)]
2237#[serde(rename_all = "PascalCase")]
2238struct Record<'a> {
2239    country: &'a [u8],
2240    city: &'a [u8],
2241    accent_city: &'a [u8],
2242    region: &'a [u8],
2243    population: Option<u64>,
2244    latitude: f64,
2245    longitude: f64,
2246}
2247
2248fn run() -> Result<u64, Box<dyn Error>> {
2249    let mut rdr = csv::Reader::from_reader(io::stdin());
2250    let mut raw_record = csv::ByteRecord::new();
2251    let headers = rdr.byte_headers()?.clone();
2252
2253    let mut count = 0;
2254    while rdr.read_byte_record(&mut raw_record)? {
2255        let record: Record = raw_record.deserialize(Some(&headers))?;
2256        if record.country == b"us" && record.region == b"MA" {
2257            count += 1;
2258        }
2259    }
2260    Ok(count)
2261}
2262#
2263# fn main() {
2264#     match run() {
2265#         Ok(count) => {
2266#             println!("{}", count);
2267#         }
2268#         Err(err) => {
2269#             println!("{}", err);
2270#             process::exit(1);
2271#         }
2272#     }
2273# }
2274```
2275
2276Compile and run:
2277
2278```text
2279$ cargo build --release
2280$ ./target/release/csvtutor < worldcitiespop.csv
22812176
2282
2283real    0m0.873s
2284user    0m0.850s
2285sys     0m0.023s
2286```
2287
2288This corresponds to a 17% increase over the previous example and a 37% increase
2289over the first example.
2290
2291In sum, Serde parsing is still quite fast, but will generally not be the
2292fastest way to parse CSV since it necessarily needs to do more work.
2293
2294## CSV parsing without the standard library
2295
2296In this section, we will explore a niche use case: parsing CSV without the
2297standard library. While the `csv` crate itself requires the standard library,
2298the underlying parser is actually part of the
2299[`csv-core`](https://docs.rs/csv-core)
2300crate, which does not depend on the standard library. The downside of not
2301depending on the standard library is that CSV parsing becomes a lot more
2302inconvenient.
2303
2304The `csv-core` crate is structured similarly to the `csv` crate. There is a
2305[`Reader`](../../csv_core/struct.Reader.html)
2306and a
2307[`Writer`](../../csv_core/struct.Writer.html),
2308as well as corresponding builders
2309[`ReaderBuilder`](../../csv_core/struct.ReaderBuilder.html)
2310and
2311[`WriterBuilder`](../../csv_core/struct.WriterBuilder.html).
2312The `csv-core` crate has no record types or iterators. Instead, CSV data
2313can either be read one field at a time or one record at a time. In this
2314section, we'll focus on reading a field at a time since it is simpler, but it
2315is generally faster to read a record at a time since it does more work per
2316function call.
2317
2318In keeping with this section on performance, let's write a program using only
2319`csv-core` that counts the number of records in the state of Massachusetts.
2320
2321(Note that we unfortunately use the standard library in this example even
2322though `csv-core` doesn't technically require it. We do this for convenient
2323access to I/O, which would be harder without the standard library.)
2324
2325```no_run
2326//tutorial-perf-core-01.rs
2327use std::io::{self, Read};
2328use std::process;
2329
2330use csv_core::{Reader, ReadFieldResult};
2331
2332fn run(mut data: &[u8]) -> Option<u64> {
2333    let mut rdr = Reader::new();
2334
2335    // Count the number of records in Massachusetts.
2336    let mut count = 0;
2337    // Indicates the current field index. Reset to 0 at start of each record.
2338    let mut fieldidx = 0;
2339    // True when the current record is in the United States.
2340    let mut inus = false;
2341    // Buffer for field data. Must be big enough to hold the largest field.
2342    let mut field = [0; 1024];
2343    loop {
2344        // Attempt to incrementally read the next CSV field.
2345        let (result, nread, nwrite) = rdr.read_field(data, &mut field);
2346        // nread is the number of bytes read from our input. We should never
2347        // pass those bytes to read_field again.
2348        data = &data[nread..];
2349        // nwrite is the number of bytes written to the output buffer `field`.
2350        // The contents of the buffer after this point is unspecified.
2351        let field = &field[..nwrite];
2352
2353        match result {
2354            // We don't need to handle this case because we read all of the
2355            // data up front. If we were reading data incrementally, then this
2356            // would be a signal to read more.
2357            ReadFieldResult::InputEmpty => {}
2358            // If we get this case, then we found a field that contains more
2359            // than 1024 bytes. We keep this example simple and just fail.
2360            ReadFieldResult::OutputFull => {
2361                return None;
2362            }
2363            // This case happens when we've successfully read a field. If the
2364            // field is the last field in a record, then `record_end` is true.
2365            ReadFieldResult::Field { record_end } => {
2366                if fieldidx == 0 && field == b"us" {
2367                    inus = true;
2368                } else if inus && fieldidx == 3 && field == b"MA" {
2369                    count += 1;
2370                }
2371                if record_end {
2372                    fieldidx = 0;
2373                    inus = false;
2374                } else {
2375                    fieldidx += 1;
2376                }
2377            }
2378            // This case happens when the CSV reader has successfully exhausted
2379            // all input.
2380            ReadFieldResult::End => {
2381                break;
2382            }
2383        }
2384    }
2385    Some(count)
2386}
2387
2388fn main() {
2389    // Read the entire contents of stdin up front.
2390    let mut data = vec![];
2391    if let Err(err) = io::stdin().read_to_end(&mut data) {
2392        println!("{}", err);
2393        process::exit(1);
2394    }
2395    match run(&data) {
2396        None => {
2397            println!("error: could not count records, buffer too small");
2398            process::exit(1);
2399        }
2400        Some(count) => {
2401            println!("{}", count);
2402        }
2403    }
2404}
2405```
2406
2407And compile and run it:
2408
2409```text
2410$ cargo build --release
2411$ time ./target/release/csvtutor < worldcitiespop.csv
24122176
2413
2414real    0m0.572s
2415user    0m0.513s
2416sys     0m0.057s
2417```
2418
2419This isn't as fast as some of our previous examples where we used the `csv`
2420crate to read into a `StringRecord` or a `ByteRecord`. This is mostly because
2421this example reads a field at a time, which incurs more overhead than reading a
2422record at a time. To fix this, you would want to use the
2423[`Reader::read_record`](../../csv_core/struct.Reader.html#method.read_record)
2424method instead, which is defined on `csv_core::Reader`.
2425
2426The other thing to notice here is that the example is considerably longer than
2427the other examples. This is because we need to do more book keeping to keep
2428track of which field we're reading and how much data we've already fed to the
2429reader. There are basically two reasons to use the `csv_core` crate:
2430
24311. If you're in an environment where the standard library is not usable.
24322. If you wanted to build your own csv-like library, you could build it on top
2433   of `csv-core`.
2434
2435# Closing thoughts
2436
2437Congratulations on making it to the end! It seems incredible that one could
2438write so many words on something as basic as CSV parsing. I wanted this
2439guide to be accessible not only to Rust beginners, but to inexperienced
2440programmers as well. My hope is that the large number of examples will help
2441push you in the right direction.
2442
2443With that said, here are a few more things you might want to look at:
2444
2445* The [API documentation for the `csv` crate](../index.html) documents all
2446  facets of the library, and is itself littered with even more examples.
2447* The [`csv-index` crate](https://docs.rs/csv-index) provides data structures
2448  that can index CSV data that are amenable to writing to disk. (This library
2449  is still a work in progress.)
2450* The [`xsv` command line tool](https://github.com/BurntSushi/xsv) is a high
2451  performance CSV swiss army knife. It can slice, select, search, sort, join,
2452  concatenate, index, format and compute statistics on arbitrary CSV data. Give
2453  it a try!
2454
2455*/
csv/tutorial.rs

csv/
tutorial.rs