regex_automata/nfa/thompson/
range_trie.rs

1/*
2I've called the primary data structure in this module a "range trie." As far
3as I can tell, there is no prior art on a data structure like this, however,
4it's likely someone somewhere has built something like it. Searching for
5"range trie" turns up the paper "Range Tries for Scalable Address Lookup,"
6but it does not appear relevant.
7
8The range trie is just like a trie in that it is a special case of a
9deterministic finite state machine. It has states and each state has a set
10of transitions to other states. It is acyclic, and, like a normal trie,
11it makes no attempt to reuse common suffixes among its elements. The key
12difference between a normal trie and a range trie below is that a range trie
13operates on *contiguous sequences* of bytes instead of singleton bytes.
14One could say say that our alphabet is ranges of bytes instead of bytes
15themselves, except a key part of range trie construction is splitting ranges
16apart to ensure there is at most one transition that can be taken for any
17byte in a given state.
18
19I've tried to explain the details of how the range trie works below, so
20for now, we are left with trying to understand what problem we're trying to
21solve. Which is itself fairly involved!
22
23At the highest level, here's what we want to do. We want to convert a
24sequence of Unicode codepoints into a finite state machine whose transitions
25are over *bytes* and *not* Unicode codepoints. We want this because it makes
26said finite state machines much smaller and much faster to execute. As a
27simple example, consider a byte oriented automaton for all Unicode scalar
28values (0x00 through 0x10FFFF, not including surrogate codepoints):
29
30    [00-7F]
31    [C2-DF][80-BF]
32    [E0-E0][A0-BF][80-BF]
33    [E1-EC][80-BF][80-BF]
34    [ED-ED][80-9F][80-BF]
35    [EE-EF][80-BF][80-BF]
36    [F0-F0][90-BF][80-BF][80-BF]
37    [F1-F3][80-BF][80-BF][80-BF]
38    [F4-F4][80-8F][80-BF][80-BF]
39
40(These byte ranges are generated via the regex-syntax::utf8 module, which
41was based on Russ Cox's code in RE2, which was in turn based on Ken
42Thompson's implementation of the same idea in his Plan9 implementation of
43grep.)
44
45It should be fairly straight-forward to see how one could compile this into
46a DFA. The sequences are sorted and non-overlapping. Essentially, you could
47build a trie from this fairly easy. The problem comes when your initial
48range (in this case, 0x00-0x10FFFF) isn't so nice. For example, the class
49represented by '\w' contains only a tenth of the codepoints that
500x00-0x10FFFF contains, but if we were to write out the byte based ranges
51as we did above, the list would stretch to 892 entries! This turns into
52quite a large NFA with a few thousand states. Turning this beast into a DFA
53takes quite a bit of time. We are thus left with trying to trim down the
54number of states we produce as early as possible.
55
56One approach (used by RE2 and still by the regex crate, at time of writing)
57is to try to find common suffixes while building NFA states for the above
58and reuse them. This is very cheap to do and one can control precisely how
59much extra memory you want to use for the cache.
60
61Another approach, however, is to reuse an algorithm for constructing a
62*minimal* DFA from a sorted sequence of inputs. I don't want to go into
63the full details here, but I explain it in more depth in my blog post on
64FSTs[1]. Note that the algorithm was not invented by me, but was published
65in paper by Daciuk et al. in 2000 called "Incremental Construction of
66MinimalAcyclic Finite-State Automata." Like the suffix cache approach above,
67it is also possible to control the amount of extra memory one uses, although
68this usually comes with the cost of sacrificing true minimality. (But it's
69typically close enough with a reasonably sized cache of states.)
70
71The catch is that Daciuk's algorithm only works if you add your keys in
72lexicographic ascending order. In our case, since we're dealing with ranges,
73we also need the additional requirement that ranges are either equivalent
74or do not overlap at all. For example, if one were given the following byte
75ranges:
76
77    [BC-BF][80-BF]
78    [BC-BF][90-BF]
79
80Then Daciuk's algorithm would not work, since there is nothing to handle the
81fact that the ranges overlap. They would need to be split apart. Thankfully,
82Thompson's algorithm for producing byte ranges for Unicode codepoint ranges
83meets both of our requirements. (A proof for this eludes me, but it appears
84true.)
85
86... however, we would also like to be able to compile UTF-8 automata in
87reverse. We want this because in order to find the starting location of a
88match using a DFA, we need to run a second DFA---a reversed version of the
89forward DFA---backwards to discover the match location. Unfortunately, if
90we reverse our byte sequences for 0x00-0x10FFFF, we get sequences that are
91can overlap, even if they are sorted:
92
93    [00-7F]
94    [80-BF][80-9F][ED-ED]
95    [80-BF][80-BF][80-8F][F4-F4]
96    [80-BF][80-BF][80-BF][F1-F3]
97    [80-BF][80-BF][90-BF][F0-F0]
98    [80-BF][80-BF][E1-EC]
99    [80-BF][80-BF][EE-EF]
100    [80-BF][A0-BF][E0-E0]
101    [80-BF][C2-DF]
102
103For example, '[80-BF][80-BF][EE-EF]' and '[80-BF][A0-BF][E0-E0]' have
104overlapping ranges between '[80-BF]' and '[A0-BF]'. Thus, there is no
105simple way to apply Daciuk's algorithm.
106
107And thus, the range trie was born. The range trie's only purpose is to take
108sequences of byte ranges like the ones above, collect them into a trie and then
109spit them out in a sorted fashion with no overlapping ranges. For example,
1100x00-0x10FFFF gets translated to:
111
112    [0-7F]
113    [80-BF][80-9F][80-8F][F1-F3]
114    [80-BF][80-9F][80-8F][F4]
115    [80-BF][80-9F][90-BF][F0]
116    [80-BF][80-9F][90-BF][F1-F3]
117    [80-BF][80-9F][E1-EC]
118    [80-BF][80-9F][ED]
119    [80-BF][80-9F][EE-EF]
120    [80-BF][A0-BF][80-8F][F1-F3]
121    [80-BF][A0-BF][80-8F][F4]
122    [80-BF][A0-BF][90-BF][F0]
123    [80-BF][A0-BF][90-BF][F1-F3]
124    [80-BF][A0-BF][E0]
125    [80-BF][A0-BF][E1-EC]
126    [80-BF][A0-BF][EE-EF]
127    [80-BF][C2-DF]
128
129We've thus satisfied our requirements for running Daciuk's algorithm. All
130sequences of ranges are sorted, and any corresponding ranges are either
131exactly equivalent or non-overlapping.
132
133In effect, a range trie is building a DFA from a sequence of arbitrary byte
134ranges. But it uses an algorithm custom tailored to its input, so it is not as
135costly as traditional DFA construction. While it is still quite a bit more
136costly than the forward case (which only needs Daciuk's algorithm), it winds
137up saving a substantial amount of time if one is doing a full DFA powerset
138construction later by virtue of producing a much much smaller NFA.
139
140[1] - https://blog.burntsushi.net/transducers/
141[2] - https://www.mitpressjournals.org/doi/pdfplus/10.1162/089120100561601
142*/
143
144use core::{cell::RefCell, fmt, mem, ops::RangeInclusive};
145
146use alloc::{format, string::String, vec, vec::Vec};
147
148use regex_syntax::utf8::Utf8Range;
149
150use crate::util::primitives::StateID;
151
152/// There is only one final state in this trie. Every sequence of byte ranges
153/// added shares the same final state.
154const FINAL: StateID = StateID::ZERO;
155
156/// The root state of the trie.
157const ROOT: StateID = StateID::new_unchecked(1);
158
159/// A range trie represents an ordered set of sequences of bytes.
160///
161/// A range trie accepts as input a sequence of byte ranges and merges
162/// them into the existing set such that the trie can produce a sorted
163/// non-overlapping sequence of byte ranges. The sequence emitted corresponds
164/// precisely to the sequence of bytes matched by the given keys, although the
165/// byte ranges themselves may be split at different boundaries.
166///
167/// The order complexity of this data structure seems difficult to analyze.
168/// If the size of a byte is held as a constant, then insertion is clearly
169/// O(n) where n is the number of byte ranges in the input key. However, if
170/// k=256 is our alphabet size, then insertion could be O(k^2 * n). In
171/// particular it seems possible for pathological inputs to cause insertion
172/// to do a lot of work. However, for what we use this data structure for,
173/// there should be no pathological inputs since the ultimate source is always
174/// a sorted set of Unicode scalar value ranges.
175///
176/// Internally, this trie is setup like a finite state machine. Note though
177/// that it is acyclic.
178#[derive(Clone)]
179pub struct RangeTrie {
180    /// The states in this trie. The first is always the shared final state.
181    /// The second is always the root state. Otherwise, there is no
182    /// particular order.
183    states: Vec<State>,
184    /// A free-list of states. When a range trie is cleared, all of its states
185    /// are added to this list. Creating a new state reuses states from this
186    /// list before allocating a new one.
187    free: Vec<State>,
188    /// A stack for traversing this trie to yield sequences of byte ranges in
189    /// lexicographic order.
190    iter_stack: RefCell<Vec<NextIter>>,
191    /// A buffer that stores the current sequence during iteration.
192    iter_ranges: RefCell<Vec<Utf8Range>>,
193    /// A stack used for traversing the trie in order to (deeply) duplicate
194    /// a state. States are recursively duplicated when ranges are split.
195    dupe_stack: Vec<NextDupe>,
196    /// A stack used for traversing the trie during insertion of a new
197    /// sequence of byte ranges.
198    insert_stack: Vec<NextInsert>,
199}
200
201/// A single state in this trie.
202#[derive(Clone)]
203struct State {
204    /// A sorted sequence of non-overlapping transitions to other states. Each
205    /// transition corresponds to a single range of bytes.
206    transitions: Vec<Transition>,
207}
208
209/// A transition is a single range of bytes. If a particular byte is in this
210/// range, then the corresponding machine may transition to the state pointed
211/// to by `next_id`.
212#[derive(Clone)]
213struct Transition {
214    /// The byte range.
215    range: Utf8Range,
216    /// The next state to transition to.
217    next_id: StateID,
218}
219
220impl RangeTrie {
221    /// Create a new empty range trie.
222    pub fn new() -> RangeTrie {
223        let mut trie = RangeTrie {
224            states: vec![],
225            free: vec![],
226            iter_stack: RefCell::new(vec![]),
227            iter_ranges: RefCell::new(vec![]),
228            dupe_stack: vec![],
229            insert_stack: vec![],
230        };
231        trie.clear();
232        trie
233    }
234
235    /// Clear this range trie such that it is empty. Clearing a range trie
236    /// and reusing it can beneficial because this may reuse allocations.
237    pub fn clear(&mut self) {
238        self.free.append(&mut self.states);
239        self.add_empty(); // final
240        self.add_empty(); // root
241    }
242
243    /// Iterate over all of the sequences of byte ranges in this trie, and
244    /// call the provided function for each sequence. Iteration occurs in
245    /// lexicographic order.
246    pub fn iter<E, F: FnMut(&[Utf8Range]) -> Result<(), E>>(
247        &self,
248        mut f: F,
249    ) -> Result<(), E> {
250        let mut stack = self.iter_stack.borrow_mut();
251        stack.clear();
252        let mut ranges = self.iter_ranges.borrow_mut();
253        ranges.clear();
254
255        // We do iteration in a way that permits us to use a single buffer
256        // for our keys. We iterate in a depth first fashion, while being
257        // careful to expand our frontier as we move deeper in the trie.
258        stack.push(NextIter { state_id: ROOT, tidx: 0 });
259        while let Some(NextIter { mut state_id, mut tidx }) = stack.pop() {
260            // This could be implemented more simply without an inner loop
261            // here, but at the cost of more stack pushes.
262            loop {
263                let state = self.state(state_id);
264                // If we've visited all transitions in this state, then pop
265                // back to the parent state.
266                if tidx >= state.transitions.len() {
267                    ranges.pop();
268                    break;
269                }
270
271                let t = &state.transitions[tidx];
272                ranges.push(t.range);
273                if t.next_id == FINAL {
274                    f(&ranges)?;
275                    ranges.pop();
276                    tidx += 1;
277                } else {
278                    // Expand our frontier. Once we come back to this state
279                    // via the stack, start in on the next transition.
280                    stack.push(NextIter { state_id, tidx: tidx + 1 });
281                    // Otherwise, move to the first transition of the next
282                    // state.
283                    state_id = t.next_id;
284                    tidx = 0;
285                }
286            }
287        }
288        Ok(())
289    }
290
291    /// Inserts a new sequence of ranges into this trie.
292    ///
293    /// The sequence given must be non-empty and must not have a length
294    /// exceeding 4.
295    pub fn insert(&mut self, ranges: &[Utf8Range]) {
296        assert!(!ranges.is_empty());
297        assert!(ranges.len() <= 4);
298
299        let mut stack = core::mem::replace(&mut self.insert_stack, vec![]);
300        stack.clear();
301
302        stack.push(NextInsert::new(ROOT, ranges));
303        while let Some(next) = stack.pop() {
304            let (state_id, ranges) = (next.state_id(), next.ranges());
305            assert!(!ranges.is_empty());
306
307            let (mut new, rest) = (ranges[0], &ranges[1..]);
308
309            // i corresponds to the position of the existing transition on
310            // which we are operating. Typically, the result is to remove the
311            // transition and replace it with two or more new transitions
312            // corresponding to the partitions generated by splitting the
313            // 'new' with the ith transition's range.
314            let mut i = self.state(state_id).find(new);
315
316            // In this case, there is no overlap *and* the new range is greater
317            // than all existing ranges. So we can just add it to the end.
318            if i == self.state(state_id).transitions.len() {
319                let next_id = NextInsert::push(self, &mut stack, rest);
320                self.add_transition(state_id, new, next_id);
321                continue;
322            }
323
324            // The need for this loop is a bit subtle, buf basically, after
325            // we've handled the partitions from our initial split, it's
326            // possible that there will be a partition leftover that overlaps
327            // with a subsequent transition. If so, then we have to repeat
328            // the split process again with the leftovers and that subsequent
329            // transition.
330            'OUTER: loop {
331                let old = self.state(state_id).transitions[i].clone();
332                let split = match Split::new(old.range, new) {
333                    Some(split) => split,
334                    None => {
335                        let next_id = NextInsert::push(self, &mut stack, rest);
336                        self.add_transition_at(i, state_id, new, next_id);
337                        continue;
338                    }
339                };
340                let splits = split.as_slice();
341                // If we only have one partition, then the ranges must be
342                // equivalent. There's nothing to do here for this state, so
343                // just move on to the next one.
344                if splits.len() == 1 {
345                    // ... but only if we have anything left to do.
346                    if !rest.is_empty() {
347                        stack.push(NextInsert::new(old.next_id, rest));
348                    }
349                    break;
350                }
351                // At this point, we know that 'split' is non-empty and there
352                // must be some overlap AND that the two ranges are not
353                // equivalent. Therefore, the existing range MUST be removed
354                // and split up somehow. Instead of actually doing the removal
355                // and then a subsequent insertion---with all the memory
356                // shuffling that entails---we simply overwrite the transition
357                // at position `i` for the first new transition we want to
358                // insert. After that, we're forced to do expensive inserts.
359                let mut first = true;
360                let mut add_trans =
361                    |trie: &mut RangeTrie, pos, from, range, to| {
362                        if first {
363                            trie.set_transition_at(pos, from, range, to);
364                            first = false;
365                        } else {
366                            trie.add_transition_at(pos, from, range, to);
367                        }
368                    };
369                for (j, &srange) in splits.iter().enumerate() {
370                    match srange {
371                        SplitRange::Old(r) => {
372                            // Deep clone the state pointed to by the ith
373                            // transition. This is always necessary since 'old'
374                            // is always coupled with at least a 'both'
375                            // partition. We don't want any new changes made
376                            // via the 'both' partition to impact the part of
377                            // the transition that doesn't overlap with the
378                            // new range.
379                            let dup_id = self.duplicate(old.next_id);
380                            add_trans(self, i, state_id, r, dup_id);
381                        }
382                        SplitRange::New(r) => {
383                            // This is a bit subtle, but if this happens to be
384                            // the last partition in our split, it is possible
385                            // that this overlaps with a subsequent transition.
386                            // If it does, then we must repeat the whole
387                            // splitting process over again with `r` and the
388                            // subsequent transition.
389                            {
390                                let trans = &self.state(state_id).transitions;
391                                if j + 1 == splits.len()
392                                    && i < trans.len()
393                                    && intersects(r, trans[i].range)
394                                {
395                                    new = r;
396                                    continue 'OUTER;
397                                }
398                            }
399
400                            // ... otherwise, setup exploration for a new
401                            // empty state and add a brand new transition for
402                            // this new range.
403                            let next_id =
404                                NextInsert::push(self, &mut stack, rest);
405                            add_trans(self, i, state_id, r, next_id);
406                        }
407                        SplitRange::Both(r) => {
408                            // Continue adding the remaining ranges on this
409                            // path and update the transition with the new
410                            // range.
411                            if !rest.is_empty() {
412                                stack.push(NextInsert::new(old.next_id, rest));
413                            }
414                            add_trans(self, i, state_id, r, old.next_id);
415                        }
416                    }
417                    i += 1;
418                }
419                // If we've reached this point, then we know that there are
420                // no subsequent transitions with any overlap. Therefore, we
421                // can stop processing this range and move on to the next one.
422                break;
423            }
424        }
425        self.insert_stack = stack;
426    }
427
428    pub fn add_empty(&mut self) -> StateID {
429        let id = match StateID::try_from(self.states.len()) {
430            Ok(id) => id,
431            Err(_) => {
432                // This generally should not happen since a range trie is
433                // only ever used to compile a single sequence of Unicode
434                // scalar values. If we ever got to this point, we would, at
435                // *minimum*, be using 96GB in just the range trie alone.
436                panic!("too many sequences added to range trie");
437            }
438        };
439        // If we have some free states available, then use them to avoid
440        // more allocations.
441        if let Some(mut state) = self.free.pop() {
442            state.clear();
443            self.states.push(state);
444        } else {
445            self.states.push(State { transitions: vec![] });
446        }
447        id
448    }
449
450    /// Performs a deep clone of the given state and returns the duplicate's
451    /// state ID.
452    ///
453    /// A "deep clone" in this context means that the state given along with
454    /// recursively all states that it points to are copied. Once complete,
455    /// the given state ID and the returned state ID share nothing.
456    ///
457    /// This is useful during range trie insertion when a new range overlaps
458    /// with an existing range that is bigger than the new one. The part
459    /// of the existing range that does *not* overlap with the new one is
460    /// duplicated so that adding the new range to the overlap doesn't disturb
461    /// the non-overlapping portion.
462    ///
463    /// There's one exception: if old_id is the final state, then it is not
464    /// duplicated and the same final state is returned. This is because all
465    /// final states in this trie are equivalent.
466    fn duplicate(&mut self, old_id: StateID) -> StateID {
467        if old_id == FINAL {
468            return FINAL;
469        }
470
471        let mut stack = mem::replace(&mut self.dupe_stack, vec![]);
472        stack.clear();
473
474        let new_id = self.add_empty();
475        // old_id is the state we're cloning and new_id is the ID of the
476        // duplicated state for old_id.
477        stack.push(NextDupe { old_id, new_id });
478        while let Some(NextDupe { old_id, new_id }) = stack.pop() {
479            for i in 0..self.state(old_id).transitions.len() {
480                let t = self.state(old_id).transitions[i].clone();
481                if t.next_id == FINAL {
482                    // All final states are the same, so there's no need to
483                    // duplicate it.
484                    self.add_transition(new_id, t.range, FINAL);
485                    continue;
486                }
487
488                let new_child_id = self.add_empty();
489                self.add_transition(new_id, t.range, new_child_id);
490                stack.push(NextDupe {
491                    old_id: t.next_id,
492                    new_id: new_child_id,
493                });
494            }
495        }
496        self.dupe_stack = stack;
497        new_id
498    }
499
500    /// Adds the given transition to the given state.
501    ///
502    /// Callers must ensure that all previous transitions in this state
503    /// are lexicographically smaller than the given range.
504    fn add_transition(
505        &mut self,
506        from_id: StateID,
507        range: Utf8Range,
508        next_id: StateID,
509    ) {
510        self.state_mut(from_id)
511            .transitions
512            .push(Transition { range, next_id });
513    }
514
515    /// Like `add_transition`, except this inserts the transition just before
516    /// the ith transition.
517    fn add_transition_at(
518        &mut self,
519        i: usize,
520        from_id: StateID,
521        range: Utf8Range,
522        next_id: StateID,
523    ) {
524        self.state_mut(from_id)
525            .transitions
526            .insert(i, Transition { range, next_id });
527    }
528
529    /// Overwrites the transition at position i with the given transition.
530    fn set_transition_at(
531        &mut self,
532        i: usize,
533        from_id: StateID,
534        range: Utf8Range,
535        next_id: StateID,
536    ) {
537        self.state_mut(from_id).transitions[i] = Transition { range, next_id };
538    }
539
540    /// Return an immutable borrow for the state with the given ID.
541    fn state(&self, id: StateID) -> &State {
542        &self.states[id]
543    }
544
545    /// Return a mutable borrow for the state with the given ID.
546    fn state_mut(&mut self, id: StateID) -> &mut State {
547        &mut self.states[id]
548    }
549}
550
551impl State {
552    /// Find the position at which the given range should be inserted in this
553    /// state.
554    ///
555    /// The position returned is always in the inclusive range
556    /// [0, transitions.len()]. If 'transitions.len()' is returned, then the
557    /// given range overlaps with no other range in this state *and* is greater
558    /// than all of them.
559    ///
560    /// For all other possible positions, the given range either overlaps
561    /// with the transition at that position or is otherwise less than it
562    /// with no overlap (and is greater than the previous transition). In the
563    /// former case, careful attention must be paid to inserting this range
564    /// as a new transition. In the latter case, the range can be inserted as
565    /// a new transition at the given position without disrupting any other
566    /// transitions.
567    fn find(&self, range: Utf8Range) -> usize {
568        /// Returns the position `i` at which `pred(xs[i])` first returns true
569        /// such that for all `j >= i`, `pred(xs[j]) == true`. If `pred` never
570        /// returns true, then `xs.len()` is returned.
571        ///
572        /// We roll our own binary search because it doesn't seem like the
573        /// standard library's binary search can be used here. Namely, if
574        /// there is an overlapping range, then we want to find the first such
575        /// occurrence, but there may be many. Or at least, it's not quite
576        /// clear to me how to do it.
577        fn binary_search<T, F>(xs: &[T], mut pred: F) -> usize
578        where
579            F: FnMut(&T) -> bool,
580        {
581            let (mut left, mut right) = (0, xs.len());
582            while left < right {
583                // Overflow is impossible because xs.len() <= 256.
584                let mid = (left + right) / 2;
585                if pred(&xs[mid]) {
586                    right = mid;
587                } else {
588                    left = mid + 1;
589                }
590            }
591            left
592        }
593
594        // Benchmarks suggest that binary search is just a bit faster than
595        // straight linear search. Specifically when using the debug tool:
596        //
597        //   hyperfine "regex-cli debug thompson -qr --captures none '\w{90} ecurB'"
598        binary_search(&self.transitions, |t| range.start <= t.range.end)
599    }
600
601    /// Clear this state such that it has zero transitions.
602    fn clear(&mut self) {
603        self.transitions.clear();
604    }
605}
606
607/// The next state to process during duplication.
608#[derive(Clone, Debug)]
609struct NextDupe {
610    /// The state we want to duplicate.
611    old_id: StateID,
612    /// The ID of the new state that is a duplicate of old_id.
613    new_id: StateID,
614}
615
616/// The next state (and its corresponding transition) that we want to visit
617/// during iteration in lexicographic order.
618#[derive(Clone, Debug)]
619struct NextIter {
620    state_id: StateID,
621    tidx: usize,
622}
623
624/// The next state to process during insertion and any remaining ranges that we
625/// want to add for a particular sequence of ranges. The first such instance
626/// is always the root state along with all ranges given.
627#[derive(Clone, Debug)]
628struct NextInsert {
629    /// The next state to begin inserting ranges. This state should be the
630    /// state at which `ranges[0]` should be inserted.
631    state_id: StateID,
632    /// The ranges to insert. We used a fixed-size array here to avoid an
633    /// allocation.
634    ranges: [Utf8Range; 4],
635    /// The number of valid ranges in the above array.
636    len: u8,
637}
638
639impl NextInsert {
640    /// Create the next item to visit. The given state ID should correspond
641    /// to the state at which the first range in the given slice should be
642    /// inserted. The slice given must not be empty and it must be no longer
643    /// than 4.
644    fn new(state_id: StateID, ranges: &[Utf8Range]) -> NextInsert {
645        let len = ranges.len();
646        assert!(len > 0);
647        assert!(len <= 4);
648
649        let mut tmp = [Utf8Range { start: 0, end: 0 }; 4];
650        tmp[..len].copy_from_slice(ranges);
651        NextInsert { state_id, ranges: tmp, len: u8::try_from(len).unwrap() }
652    }
653
654    /// Push a new empty state to visit along with any remaining ranges that
655    /// still need to be inserted. The ID of the new empty state is returned.
656    ///
657    /// If ranges is empty, then no new state is created and FINAL is returned.
658    fn push(
659        trie: &mut RangeTrie,
660        stack: &mut Vec<NextInsert>,
661        ranges: &[Utf8Range],
662    ) -> StateID {
663        if ranges.is_empty() {
664            FINAL
665        } else {
666            let next_id = trie.add_empty();
667            stack.push(NextInsert::new(next_id, ranges));
668            next_id
669        }
670    }
671
672    /// Return the ID of the state to visit.
673    fn state_id(&self) -> StateID {
674        self.state_id
675    }
676
677    /// Return the remaining ranges to insert.
678    fn ranges(&self) -> &[Utf8Range] {
679        &self.ranges[..usize::try_from(self.len).unwrap()]
680    }
681}
682
683/// Split represents a partitioning of two ranges into one or more ranges. This
684/// is the secret sauce that makes a range trie work, as it's what tells us
685/// how to deal with two overlapping but unequal ranges during insertion.
686///
687/// Essentially, either two ranges overlap or they don't. If they don't, then
688/// handling insertion is easy: just insert the new range into its
689/// lexicographically correct position. Since it does not overlap with anything
690/// else, no other transitions are impacted by the new range.
691///
692/// If they do overlap though, there are generally three possible cases to
693/// handle:
694///
695/// 1. The part where the two ranges actually overlap. i.e., The intersection.
696/// 2. The part of the existing range that is not in the new range.
697/// 3. The part of the new range that is not in the old range.
698///
699/// (1) is guaranteed to always occur since all overlapping ranges have a
700/// non-empty intersection. If the two ranges are not equivalent, then at
701/// least one of (2) or (3) is guaranteed to occur as well. In some cases,
702/// e.g., `[0-4]` and `[4-9]`, all three cases will occur.
703///
704/// This `Split` type is responsible for providing (1), (2) and (3) for any
705/// possible pair of byte ranges.
706///
707/// As for insertion, for the overlap in (1), the remaining ranges to insert
708/// should be added by following the corresponding transition. However, this
709/// should only be done for the overlapping parts of the range. If there was
710/// a part of the existing range that was not in the new range, then that
711/// existing part must be split off from the transition and duplicated. The
712/// remaining parts of the overlap can then be added to using the new ranges
713/// without disturbing the existing range.
714///
715/// Handling the case for the part of a new range that is not in an existing
716/// range is seemingly easy. Just treat it as if it were a non-overlapping
717/// range. The problem here is that if this new non-overlapping range occurs
718/// after both (1) and (2), then it's possible that it can overlap with the
719/// next transition in the current state. If it does, then the whole process
720/// must be repeated!
721///
722/// # Details of the 3 cases
723///
724/// The following details the various cases that are implemented in code
725/// below. It's plausible that the number of cases is not actually minimal,
726/// but it's important for this code to remain at least somewhat readable.
727///
728/// Given [a,b] and [x,y], where a <= b, x <= y, b < 256 and y < 256, we define
729/// the follow distinct relationships where at least one must apply. The order
730/// of these matters, since multiple can match. The first to match applies.
731///
732///   1. b < x <=> [a,b] < [x,y]
733///   2. y < a <=> [x,y] < [a,b]
734///
735/// In the case of (1) and (2), these are the only cases where there is no
736/// overlap. Or otherwise, the intersection of [a,b] and [x,y] is empty. In
737/// order to compute the intersection, one can do [max(a,x), min(b,y)]. The
738/// intersection in all of the following cases is non-empty.
739///
740///    3. a = x && b = y <=> [a,b] == [x,y]
741///    4. a = x && b < y <=> [x,y] right-extends [a,b]
742///    5. b = y && a > x <=> [x,y] left-extends [a,b]
743///    6. x = a && y < b <=> [a,b] right-extends [x,y]
744///    7. y = b && x > a <=> [a,b] left-extends [x,y]
745///    8. a > x && b < y <=> [x,y] covers [a,b]
746///    9. x > a && y < b <=> [a,b] covers [x,y]
747///   10. b = x && a < y <=> [a,b] is left-adjacent to [x,y]
748///   11. y = a && x < b <=> [x,y] is left-adjacent to [a,b]
749///   12. b > x && b < y <=> [a,b] left-overlaps [x,y]
750///   13. y > a && y < b <=> [x,y] left-overlaps [a,b]
751///
752/// In cases 3-13, we can form rules that partition the ranges into a
753/// non-overlapping ordered sequence of ranges:
754///
755///    3. [a,b]
756///    4. [a,b], [b+1,y]
757///    5. [x,a-1], [a,b]
758///    6. [x,y], [y+1,b]
759///    7. [a,x-1], [x,y]
760///    8. [x,a-1], [a,b], [b+1,y]
761///    9. [a,x-1], [x,y], [y+1,b]
762///   10. [a,b-1], [b,b], [b+1,y]
763///   11. [x,y-1], [y,y], [y+1,b]
764///   12. [a,x-1], [x,b], [b+1,y]
765///   13. [x,a-1], [a,y], [y+1,b]
766///
767/// In the code below, we go a step further and identify each of the above
768/// outputs as belonging either to the overlap of the two ranges or to one
769/// of [a,b] or [x,y] exclusively.
770#[derive(Clone, Debug, Eq, PartialEq)]
771struct Split {
772    partitions: [SplitRange; 3],
773    len: usize,
774}
775
776/// A tagged range indicating how it was derived from a pair of ranges.
777#[derive(Clone, Copy, Debug, Eq, PartialEq)]
778enum SplitRange {
779    Old(Utf8Range),
780    New(Utf8Range),
781    Both(Utf8Range),
782}
783
784impl Split {
785    /// Create a partitioning of the given ranges.
786    ///
787    /// If the given ranges have an empty intersection, then None is returned.
788    fn new(o: Utf8Range, n: Utf8Range) -> Option<Split> {
789        let range = |r: RangeInclusive<u8>| Utf8Range {
790            start: *r.start(),
791            end: *r.end(),
792        };
793        let old = |r| SplitRange::Old(range(r));
794        let new = |r| SplitRange::New(range(r));
795        let both = |r| SplitRange::Both(range(r));
796
797        // Use same names as the comment above to make it easier to compare.
798        let (a, b, x, y) = (o.start, o.end, n.start, n.end);
799
800        if b < x || y < a {
801            // case 1, case 2
802            None
803        } else if a == x && b == y {
804            // case 3
805            Some(Split::parts1(both(a..=b)))
806        } else if a == x && b < y {
807            // case 4
808            Some(Split::parts2(both(a..=b), new(b + 1..=y)))
809        } else if b == y && a > x {
810            // case 5
811            Some(Split::parts2(new(x..=a - 1), both(a..=b)))
812        } else if x == a && y < b {
813            // case 6
814            Some(Split::parts2(both(x..=y), old(y + 1..=b)))
815        } else if y == b && x > a {
816            // case 7
817            Some(Split::parts2(old(a..=x - 1), both(x..=y)))
818        } else if a > x && b < y {
819            // case 8
820            Some(Split::parts3(new(x..=a - 1), both(a..=b), new(b + 1..=y)))
821        } else if x > a && y < b {
822            // case 9
823            Some(Split::parts3(old(a..=x - 1), both(x..=y), old(y + 1..=b)))
824        } else if b == x && a < y {
825            // case 10
826            Some(Split::parts3(old(a..=b - 1), both(b..=b), new(b + 1..=y)))
827        } else if y == a && x < b {
828            // case 11
829            Some(Split::parts3(new(x..=y - 1), both(y..=y), old(y + 1..=b)))
830        } else if b > x && b < y {
831            // case 12
832            Some(Split::parts3(old(a..=x - 1), both(x..=b), new(b + 1..=y)))
833        } else if y > a && y < b {
834            // case 13
835            Some(Split::parts3(new(x..=a - 1), both(a..=y), old(y + 1..=b)))
836        } else {
837            unreachable!()
838        }
839    }
840
841    /// Create a new split with a single partition. This only occurs when two
842    /// ranges are equivalent.
843    fn parts1(r1: SplitRange) -> Split {
844        // This value doesn't matter since it is never accessed.
845        let nada = SplitRange::Old(Utf8Range { start: 0, end: 0 });
846        Split { partitions: [r1, nada, nada], len: 1 }
847    }
848
849    /// Create a new split with two partitions.
850    fn parts2(r1: SplitRange, r2: SplitRange) -> Split {
851        // This value doesn't matter since it is never accessed.
852        let nada = SplitRange::Old(Utf8Range { start: 0, end: 0 });
853        Split { partitions: [r1, r2, nada], len: 2 }
854    }
855
856    /// Create a new split with three partitions.
857    fn parts3(r1: SplitRange, r2: SplitRange, r3: SplitRange) -> Split {
858        Split { partitions: [r1, r2, r3], len: 3 }
859    }
860
861    /// Return the partitions in this split as a slice.
862    fn as_slice(&self) -> &[SplitRange] {
863        &self.partitions[..self.len]
864    }
865}
866
867impl fmt::Debug for RangeTrie {
868    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
869        writeln!(f)?;
870        for (i, state) in self.states.iter().enumerate() {
871            let status = if i == FINAL.as_usize() { '*' } else { ' ' };
872            writeln!(f, "{}{:06}: {:?}", status, i, state)?;
873        }
874        Ok(())
875    }
876}
877
878impl fmt::Debug for State {
879    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
880        let rs = self
881            .transitions
882            .iter()
883            .map(|t| format!("{:?}", t))
884            .collect::<Vec<String>>()
885            .join(", ");
886        write!(f, "{}", rs)
887    }
888}
889
890impl fmt::Debug for Transition {
891    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
892        if self.range.start == self.range.end {
893            write!(
894                f,
895                "{:02X} => {:02X}",
896                self.range.start,
897                self.next_id.as_usize(),
898            )
899        } else {
900            write!(
901                f,
902                "{:02X}-{:02X} => {:02X}",
903                self.range.start,
904                self.range.end,
905                self.next_id.as_usize(),
906            )
907        }
908    }
909}
910
911/// Returns true if and only if the given ranges intersect.
912fn intersects(r1: Utf8Range, r2: Utf8Range) -> bool {
913    !(r1.end < r2.start || r2.end < r1.start)
914}
915
916#[cfg(test)]
917mod tests {
918    use super::*;
919
920    fn r(range: RangeInclusive<u8>) -> Utf8Range {
921        Utf8Range { start: *range.start(), end: *range.end() }
922    }
923
924    fn split_maybe(
925        old: RangeInclusive<u8>,
926        new: RangeInclusive<u8>,
927    ) -> Option<Split> {
928        Split::new(r(old), r(new))
929    }
930
931    fn split(
932        old: RangeInclusive<u8>,
933        new: RangeInclusive<u8>,
934    ) -> Vec<SplitRange> {
935        split_maybe(old, new).unwrap().as_slice().to_vec()
936    }
937
938    #[test]
939    fn no_splits() {
940        // case 1
941        assert_eq!(None, split_maybe(0..=1, 2..=3));
942        // case 2
943        assert_eq!(None, split_maybe(2..=3, 0..=1));
944    }
945
946    #[test]
947    fn splits() {
948        let range = |r: RangeInclusive<u8>| Utf8Range {
949            start: *r.start(),
950            end: *r.end(),
951        };
952        let old = |r| SplitRange::Old(range(r));
953        let new = |r| SplitRange::New(range(r));
954        let both = |r| SplitRange::Both(range(r));
955
956        // case 3
957        assert_eq!(split(0..=0, 0..=0), vec![both(0..=0)]);
958        assert_eq!(split(9..=9, 9..=9), vec![both(9..=9)]);
959
960        // case 4
961        assert_eq!(split(0..=5, 0..=6), vec![both(0..=5), new(6..=6)]);
962        assert_eq!(split(0..=5, 0..=8), vec![both(0..=5), new(6..=8)]);
963        assert_eq!(split(5..=5, 5..=8), vec![both(5..=5), new(6..=8)]);
964
965        // case 5
966        assert_eq!(split(1..=5, 0..=5), vec![new(0..=0), both(1..=5)]);
967        assert_eq!(split(3..=5, 0..=5), vec![new(0..=2), both(3..=5)]);
968        assert_eq!(split(5..=5, 0..=5), vec![new(0..=4), both(5..=5)]);
969
970        // case 6
971        assert_eq!(split(0..=6, 0..=5), vec![both(0..=5), old(6..=6)]);
972        assert_eq!(split(0..=8, 0..=5), vec![both(0..=5), old(6..=8)]);
973        assert_eq!(split(5..=8, 5..=5), vec![both(5..=5), old(6..=8)]);
974
975        // case 7
976        assert_eq!(split(0..=5, 1..=5), vec![old(0..=0), both(1..=5)]);
977        assert_eq!(split(0..=5, 3..=5), vec![old(0..=2), both(3..=5)]);
978        assert_eq!(split(0..=5, 5..=5), vec![old(0..=4), both(5..=5)]);
979
980        // case 8
981        assert_eq!(
982            split(3..=6, 2..=7),
983            vec![new(2..=2), both(3..=6), new(7..=7)],
984        );
985        assert_eq!(
986            split(3..=6, 1..=8),
987            vec![new(1..=2), both(3..=6), new(7..=8)],
988        );
989
990        // case 9
991        assert_eq!(
992            split(2..=7, 3..=6),
993            vec![old(2..=2), both(3..=6), old(7..=7)],
994        );
995        assert_eq!(
996            split(1..=8, 3..=6),
997            vec![old(1..=2), both(3..=6), old(7..=8)],
998        );
999
1000        // case 10
1001        assert_eq!(
1002            split(3..=6, 6..=7),
1003            vec![old(3..=5), both(6..=6), new(7..=7)],
1004        );
1005        assert_eq!(
1006            split(3..=6, 6..=8),
1007            vec![old(3..=5), both(6..=6), new(7..=8)],
1008        );
1009        assert_eq!(
1010            split(5..=6, 6..=7),
1011            vec![old(5..=5), both(6..=6), new(7..=7)],
1012        );
1013
1014        // case 11
1015        assert_eq!(
1016            split(6..=7, 3..=6),
1017            vec![new(3..=5), both(6..=6), old(7..=7)],
1018        );
1019        assert_eq!(
1020            split(6..=8, 3..=6),
1021            vec![new(3..=5), both(6..=6), old(7..=8)],
1022        );
1023        assert_eq!(
1024            split(6..=7, 5..=6),
1025            vec![new(5..=5), both(6..=6), old(7..=7)],
1026        );
1027
1028        // case 12
1029        assert_eq!(
1030            split(3..=7, 5..=9),
1031            vec![old(3..=4), both(5..=7), new(8..=9)],
1032        );
1033        assert_eq!(
1034            split(3..=5, 4..=6),
1035            vec![old(3..=3), both(4..=5), new(6..=6)],
1036        );
1037
1038        // case 13
1039        assert_eq!(
1040            split(5..=9, 3..=7),
1041            vec![new(3..=4), both(5..=7), old(8..=9)],
1042        );
1043        assert_eq!(
1044            split(4..=6, 3..=5),
1045            vec![new(3..=3), both(4..=5), old(6..=6)],
1046        );
1047    }
1048
1049    // Arguably there should be more tests here, but in practice, this data
1050    // structure is well covered by the huge number of regex tests.
1051}
regex_automata/nfa/thompson/range_trie.rs

regex_automata/nfa/thompson/
range_trie.rs