Struct regex::bytes::RegexSet

source ·
pub struct RegexSet(/* private fields */);
Expand description

Match multiple (possibly overlapping) regular expressions in a single scan.

A regex set corresponds to the union of two or more regular expressions. That is, a regex set will match text where at least one of its constituent regular expressions matches. A regex set as its formulated here provides a touch more power: it will also report which regular expressions in the set match. Indeed, this is the key difference between regex sets and a single Regex with many alternates, since only one alternate can match at a time.

For example, consider regular expressions to match email addresses and domains: [a-z]+@[a-z]+\.(com|org|net) and [a-z]+\.(com|org|net). If a regex set is constructed from those regexes, then searching the text foo@example.com will report both regexes as matching. Of course, one could accomplish this by compiling each regex on its own and doing two searches over the text. The key advantage of using a regex set is that it will report the matching regexes using a single pass through the text. If one has hundreds or thousands of regexes to match repeatedly (like a URL router for a complex web application or a user agent matcher), then a regex set can realize huge performance gains.

Example

This shows how the above two regexes (for matching email addresses and domains) might work:

let set = RegexSet::new(&[
    r"[a-z]+@[a-z]+\.(com|org|net)",
    r"[a-z]+\.(com|org|net)",
]).unwrap();

// Ask whether any regexes in the set match.
assert!(set.is_match(b"foo@example.com"));

// Identify which regexes in the set match.
let matches: Vec<_> = set.matches(b"foo@example.com").into_iter().collect();
assert_eq!(vec![0, 1], matches);

// Try again, but with text that only matches one of the regexes.
let matches: Vec<_> = set.matches(b"example.com").into_iter().collect();
assert_eq!(vec![1], matches);

// Try again, but with text that doesn't match any regex in the set.
let matches: Vec<_> = set.matches(b"example").into_iter().collect();
assert!(matches.is_empty());

Note that it would be possible to adapt the above example to using Regex with an expression like:

(?P<email>[a-z]+@(?P<email_domain>[a-z]+[.](com|org|net)))|(?P<domain>[a-z]+[.](com|org|net))

After a match, one could then inspect the capture groups to figure out which alternates matched. The problem is that it is hard to make this approach scale when there are many regexes since the overlap between each alternate isn’t always obvious to reason about.

Limitations

Regex sets are limited to answering the following two questions:

  1. Does any regex in the set match?
  2. If so, which regexes in the set match?

As with the main Regex type, it is cheaper to ask (1) instead of (2) since the matching engines can stop after the first match is found.

You cannot directly extract Match or Captures objects from a regex set. If you need these operations, the recommended approach is to compile each pattern in the set independently and scan the exact same input a second time with those independently compiled patterns:

use regex::{Regex, RegexSet};

let patterns = ["foo", "bar"];
// Both patterns will match different ranges of this string.
let text = "barfoo";

// Compile a set matching any of our patterns.
let set = RegexSet::new(&patterns).unwrap();
// Compile each pattern independently.
let regexes: Vec<_> = set.patterns().iter()
    .map(|pat| Regex::new(pat).unwrap())
    .collect();

// Match against the whole set first and identify the individual
// matching patterns.
let matches: Vec<&str> = set.matches(text).into_iter()
    // Dereference the match index to get the corresponding
    // compiled pattern.
    .map(|match_idx| &regexes[match_idx])
    // To get match locations or any other info, we then have to search
    // the exact same text again, using our separately-compiled pattern.
    .map(|pat| pat.find(text).unwrap().as_str())
    .collect();

// Matches arrive in the order the constituent patterns were declared,
// not the order they appear in the input.
assert_eq!(vec!["foo", "bar"], matches);

Performance

A RegexSet has the same performance characteristics as Regex. Namely, search takes O(mn) time, where m is proportional to the size of the regex set and n is proportional to the length of the search text.

Implementations§

source§

impl RegexSet

source

pub fn new<I, S>(exprs: I) -> Result<RegexSet, Error>where S: AsRef<str>, I: IntoIterator<Item = S>,

Create a new regex set with the given regular expressions.

This takes an iterator of S, where S is something that can produce a &str. If any of the strings in the iterator are not valid regular expressions, then an error is returned.

Example

Create a new regex set from an iterator of strings:

let set = RegexSet::new(&[r"\w+", r"\d+"]).unwrap();
assert!(set.is_match("foo"));
source

pub fn empty() -> RegexSet

Create a new empty regex set.

Example
let set = RegexSet::empty();
assert!(set.is_empty());
source

pub fn is_match(&self, text: &[u8]) -> bool

Returns true if and only if one of the regexes in this set matches the text given.

This method should be preferred if you only need to test whether any of the regexes in the set should match, but don’t care about which regexes matched. This is because the underlying matching engine will quit immediately after seeing the first match instead of continuing to find all matches.

Note that as with searches using Regex, the expression is unanchored by default. That is, if the regex does not start with ^ or \A, or end with $ or \z, then it is permitted to match anywhere in the text.

Example

Tests whether a set matches some text:

let set = RegexSet::new(&[r"\w+", r"\d+"]).unwrap();
assert!(set.is_match("foo"));
assert!(!set.is_match("☃"));
source

pub fn matches(&self, text: &[u8]) -> SetMatches

Returns the set of regular expressions that match in the given text.

The set returned contains the index of each regular expression that matches in the given text. The index is in correspondence with the order of regular expressions given to RegexSet’s constructor.

The set can also be used to iterate over the matched indices.

Note that as with searches using Regex, the expression is unanchored by default. That is, if the regex does not start with ^ or \A, or end with $ or \z, then it is permitted to match anywhere in the text.

Example

Tests which regular expressions match the given text:

let set = RegexSet::new(&[
    r"\w+",
    r"\d+",
    r"\pL+",
    r"foo",
    r"bar",
    r"barfoo",
    r"foobar",
]).unwrap();
let matches: Vec<_> = set.matches("foobar").into_iter().collect();
assert_eq!(matches, vec![0, 2, 3, 4, 6]);

// You can also test whether a particular regex matched:
let matches = set.matches("foobar");
assert!(!matches.matched(5));
assert!(matches.matched(6));
source

pub fn len(&self) -> usize

Returns the total number of regular expressions in this set.

source

pub fn is_empty(&self) -> bool

Returns true if this set contains no regular expressions.

source

pub fn patterns(&self) -> &[String]

Returns the patterns that this set will match on.

This function can be used to determine the pattern for a match. The slice returned has exactly as many patterns givens to this regex set, and the order of the slice is the same as the order of the patterns provided to the set.

Example
let set = RegexSet::new(&[
    r"\w+",
    r"\d+",
    r"\pL+",
    r"foo",
    r"bar",
    r"barfoo",
    r"foobar",
]).unwrap();
let matches: Vec<_> = set
    .matches("foobar")
    .into_iter()
    .map(|match_idx| &set.patterns()[match_idx])
    .collect();
assert_eq!(matches, vec![r"\w+", r"\pL+", r"foo", r"bar", r"foobar"]);

Trait Implementations§

source§

impl Clone for RegexSet

source§

fn clone(&self) -> RegexSet

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for RegexSet

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for Twhere T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for Twhere T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for Twhere T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for Twhere U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for Twhere T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.