summaryrefslogtreecommitdiffstats
path: root/vendor/regex/src/bytes.rs
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-06-19 09:25:53 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-06-19 09:25:53 +0000
commit73e0a5b7696ea019ba35b89f38fc8e7b285d99cb (patch)
tree0d2e175af6f114cb50a675bec0bc76e12e1bceb4 /vendor/regex/src/bytes.rs
parentAdding upstream version 1.75.0+dfsg1. (diff)
downloadrustc-upstream.tar.xz
rustc-upstream.zip
Adding upstream version 1.76.0+dfsg1.upstream/1.76.0+dfsg1upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'vendor/regex/src/bytes.rs')
-rw-r--r--vendor/regex/src/bytes.rs91
1 files changed, 91 insertions, 0 deletions
diff --git a/vendor/regex/src/bytes.rs b/vendor/regex/src/bytes.rs
new file mode 100644
index 000000000..383ac4a5b
--- /dev/null
+++ b/vendor/regex/src/bytes.rs
@@ -0,0 +1,91 @@
+/*!
+Search for regex matches in `&[u8]` haystacks.
+
+This module provides a nearly identical API via [`Regex`] to the one found in
+the top-level of this crate. There are two important differences:
+
+1. Matching is done on `&[u8]` instead of `&str`. Additionally, `Vec<u8>`
+is used where `String` would have been used in the top-level API.
+2. Unicode support can be disabled even when disabling it would result in
+matching invalid UTF-8 bytes.
+
+# Example: match null terminated string
+
+This shows how to find all null-terminated strings in a slice of bytes. This
+works even if a C string contains invalid UTF-8.
+
+```rust
+use regex::bytes::Regex;
+
+let re = Regex::new(r"(?-u)(?<cstr>[^\x00]+)\x00").unwrap();
+let hay = b"foo\x00qu\xFFux\x00baz\x00";
+
+// Extract all of the strings without the NUL terminator from each match.
+// The unwrap is OK here since a match requires the `cstr` capture to match.
+let cstrs: Vec<&[u8]> =
+ re.captures_iter(hay)
+ .map(|c| c.name("cstr").unwrap().as_bytes())
+ .collect();
+assert_eq!(cstrs, vec![&b"foo"[..], &b"qu\xFFux"[..], &b"baz"[..]]);
+```
+
+# Example: selectively enable Unicode support
+
+This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded
+string (e.g., to extract a title from a Matroska file):
+
+```rust
+use regex::bytes::Regex;
+
+let re = Regex::new(
+ r"(?-u)\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))"
+).unwrap();
+let hay = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65";
+
+// Notice that despite the `.*` at the end, it will only match valid UTF-8
+// because Unicode mode was enabled with the `u` flag. Without the `u` flag,
+// the `.*` would match the rest of the bytes regardless of whether they were
+// valid UTF-8.
+let (_, [title]) = re.captures(hay).unwrap().extract();
+assert_eq!(title, b"\xE2\x98\x83");
+// We can UTF-8 decode the title now. And the unwrap here
+// is correct because the existence of a match guarantees
+// that `title` is valid UTF-8.
+let title = std::str::from_utf8(title).unwrap();
+assert_eq!(title, "☃");
+```
+
+In general, if the Unicode flag is enabled in a capture group and that capture
+is part of the overall match, then the capture is *guaranteed* to be valid
+UTF-8.
+
+# Syntax
+
+The supported syntax is pretty much the same as the syntax for Unicode
+regular expressions with a few changes that make sense for matching arbitrary
+bytes:
+
+1. The `u` flag can be disabled even when disabling it might cause the regex to
+match invalid UTF-8. When the `u` flag is disabled, the regex is said to be in
+"ASCII compatible" mode.
+2. In ASCII compatible mode, Unicode character classes are not allowed. Literal
+Unicode scalar values outside of character classes are allowed.
+3. In ASCII compatible mode, Perl character classes (`\w`, `\d` and `\s`)
+revert to their typical ASCII definition. `\w` maps to `[[:word:]]`, `\d` maps
+to `[[:digit:]]` and `\s` maps to `[[:space:]]`.
+4. In ASCII compatible mode, word boundaries use the ASCII compatible `\w` to
+determine whether a byte is a word byte or not.
+5. Hexadecimal notation can be used to specify arbitrary bytes instead of
+Unicode codepoints. For example, in ASCII compatible mode, `\xFF` matches the
+literal byte `\xFF`, while in Unicode mode, `\xFF` is the Unicode codepoint
+`U+00FF` that matches its UTF-8 encoding of `\xC3\xBF`. Similarly for octal
+notation when enabled.
+6. In ASCII compatible mode, `.` matches any *byte* except for `\n`. When the
+`s` flag is additionally enabled, `.` matches any byte.
+
+# Performance
+
+In general, one should expect performance on `&[u8]` to be roughly similar to
+performance on `&str`.
+*/
+pub use crate::{builders::bytes::*, regex::bytes::*, regexset::bytes::*};