encoding_rs

Struct Encoding

source
pub struct Encoding { /* private fields */ }
Expand description

An encoding as defined in the Encoding Standard.

An encoding defines a mapping from a u8 sequence to a char sequence and, in most cases, vice versa. Each encoding has a name, an output encoding, and one or more labels.

Labels are ASCII-case-insensitive strings that are used to identify an encoding in formats and protocols. The name of the encoding is the preferred label in the case appropriate for returning from the characterSet property of the Document DOM interface.

The output encoding is the encoding used for form submission and URL parsing on Web pages in the encoding. This is UTF-8 for the replacement, UTF-16LE and UTF-16BE encodings and the encoding itself for other encodings.

§Streaming vs. Non-Streaming

When you have the entire input in a single buffer, you can use the methods decode(), decode_with_bom_removal(), decode_without_bom_handling(), decode_without_bom_handling_and_without_replacement() and encode(). (These methods are available to Rust callers only and are not available in the C API.) Unlike the rest of the API available to Rust, these methods perform heap allocations. You should the Decoder and Encoder objects when your input is split into multiple buffers or when you want to control the allocation of the output buffers.

§Instances

All instances of Encoding are statically allocated and have the 'static lifetime. There is precisely one unique Encoding instance for each encoding defined in the Encoding Standard.

To obtain a reference to a particular encoding whose identity you know at compile time, use a static that refers to encoding. There is a static for each encoding. The statics are named in all caps with hyphens replaced with underscores (and in C/C++ have _ENCODING appended to the name). For example, if you know at compile time that you will want to decode using the UTF-8 encoding, use the UTF_8 static (UTF_8_ENCODING in C/C++).

Additionally, there are non-reference-typed forms ending with _INIT to work around the problem that statics of the type &'static Encoding cannot be used to initialize items of an array whose type is [&'static Encoding; N].

If you don’t know what encoding you need at compile time and need to dynamically get an encoding by label, use Encoding::for_label(label).

Instances of Encoding can be compared with == (in both Rust and in C/C++).

Implementations§

source§

impl Encoding

source

pub fn for_label(label: &[u8]) -> Option<&'static Encoding>

Implements the get an encoding algorithm.

If, after ASCII-lowercasing and removing leading and trailing whitespace, the argument matches a label defined in the Encoding Standard, Some(&'static Encoding) representing the corresponding encoding is returned. If there is no match, None is returned.

This is the right method to use if the action upon the method returning None is to use a fallback encoding (e.g. WINDOWS_1252) instead. When the action upon the method returning None is not to proceed with a fallback but to refuse processing, for_label_no_replacement() is more appropriate.

The argument is of type &[u8] instead of &str to save callers that are extracting the label from a non-UTF-8 protocol the trouble of conversion to UTF-8. (If you have a &str, just call .as_bytes() on it.)

Available via the C wrapper.

§Example
use encoding_rs::Encoding;

assert_eq!(Some(encoding_rs::UTF_8), Encoding::for_label(b"utf-8"));
assert_eq!(Some(encoding_rs::UTF_8), Encoding::for_label(b"unicode11utf8"));

assert_eq!(Some(encoding_rs::ISO_8859_2), Encoding::for_label(b"latin2"));

assert_eq!(Some(encoding_rs::UTF_16BE), Encoding::for_label(b"utf-16be"));

assert_eq!(None, Encoding::for_label(b"unrecognized label"));
source

pub fn for_label_no_replacement(label: &[u8]) -> Option<&'static Encoding>

This method behaves the same as for_label(), except when for_label() would return Some(REPLACEMENT), this method returns None instead.

This method is useful in scenarios where a fatal error is required upon invalid label, because in those cases the caller typically wishes to treat the labels that map to the replacement encoding as fatal errors, too.

It is not OK to use this method when the action upon the method returning None is to use a fallback encoding (e.g. WINDOWS_1252). In such a case, the for_label() method should be used instead in order to avoid unsafe fallback for labels that for_label() maps to Some(REPLACEMENT).

Available via the C wrapper.

source

pub fn for_bom(buffer: &[u8]) -> Option<(&'static Encoding, usize)>

Performs non-incremental BOM sniffing.

The argument must either be a buffer representing the entire input stream (non-streaming case) or a buffer representing at least the first three bytes of the input stream (streaming case).

Returns Some((UTF_8, 3)), Some((UTF_16LE, 2)) or Some((UTF_16BE, 2)) if the argument starts with the UTF-8, UTF-16LE or UTF-16BE BOM or None otherwise.

Available via the C wrapper.

source

pub fn name(&'static self) -> &'static str

Returns the name of this encoding.

This name is appropriate to return as-is from the DOM document.characterSet property.

Available via the C wrapper.

source

pub fn can_encode_everything(&'static self) -> bool

Checks whether the output encoding of this encoding can encode every char. (Only true if the output encoding is UTF-8.)

Available via the C wrapper.

source

pub fn is_ascii_compatible(&'static self) -> bool

Checks whether the bytes 0x00…0x7F map exclusively to the characters U+0000…U+007F and vice versa.

Available via the C wrapper.

source

pub fn is_single_byte(&'static self) -> bool

Checks whether this encoding maps one byte to one Basic Multilingual Plane code point (i.e. byte length equals decoded UTF-16 length) and vice versa (for mappable characters).

true iff this encoding is on the list of Legacy single-byte encodings in the spec or x-user-defined.

Available via the C wrapper.

source

pub fn output_encoding(&'static self) -> &'static Encoding

Returns the output encoding of this encoding. This is UTF-8 for UTF-16BE, UTF-16LE, and replacement and the encoding itself otherwise.

Note: The output encoding concept is needed for form submission and error handling in the query strings of URLs in the Web Platform.

Available via the C wrapper.

source

pub fn decode<'a>( &'static self, bytes: &'a [u8], ) -> (Cow<'a, str>, &'static Encoding, bool)

Decode complete input to Cow<'a, str> with BOM sniffing and with malformed sequences replaced with the REPLACEMENT CHARACTER when the entire input is available as a single buffer (i.e. the end of the buffer marks the end of the stream).

The BOM, if any, does not appear in the output.

This method implements the (non-streaming version of) the decode spec concept.

The second item in the returned tuple is the encoding that was actually used (which may differ from this encoding thanks to BOM sniffing).

The third item in the returned tuple indicates whether there were malformed sequences (that were replaced with the REPLACEMENT CHARACTER).

Note: It is wrong to use this when the input buffer represents only a segment of the input instead of the whole input. Use new_decoder() when decoding segmented input.

This method performs a one or two heap allocations for the backing buffer of the String when unable to borrow. (One allocation if not errors and potentially another one in the presence of errors.) The first allocation assumes jemalloc and may not be optimal with allocators that do not use power-of-two buckets. A borrow is performed if decoding UTF-8 and the input is valid UTF-8, if decoding an ASCII-compatible encoding and the input is ASCII-only, or when decoding ISO-2022-JP and the input is entirely in the ASCII state without state transitions.

§Panics

If the size calculation for a heap-allocated backing buffer overflows usize.

Available to Rust only and only with the alloc feature enabled (enabled by default).

source

pub fn decode_with_bom_removal<'a>( &'static self, bytes: &'a [u8], ) -> (Cow<'a, str>, bool)

Decode complete input to Cow<'a, str> with BOM removal and with malformed sequences replaced with the REPLACEMENT CHARACTER when the entire input is available as a single buffer (i.e. the end of the buffer marks the end of the stream).

Only an initial byte sequence that is a BOM for this encoding is removed.

When invoked on UTF_8, this method implements the (non-streaming version of) the UTF-8 decode spec concept.

The second item in the returned pair indicates whether there were malformed sequences (that were replaced with the REPLACEMENT CHARACTER).

Note: It is wrong to use this when the input buffer represents only a segment of the input instead of the whole input. Use new_decoder_with_bom_removal() when decoding segmented input.

This method performs a one or two heap allocations for the backing buffer of the String when unable to borrow. (One allocation if not errors and potentially another one in the presence of errors.) The first allocation assumes jemalloc and may not be optimal with allocators that do not use power-of-two buckets. A borrow is performed if decoding UTF-8 and the input is valid UTF-8, if decoding an ASCII-compatible encoding and the input is ASCII-only, or when decoding ISO-2022-JP and the input is entirely in the ASCII state without state transitions.

§Panics

If the size calculation for a heap-allocated backing buffer overflows usize.

Available to Rust only and only with the alloc feature enabled (enabled by default).

source

pub fn decode_without_bom_handling<'a>( &'static self, bytes: &'a [u8], ) -> (Cow<'a, str>, bool)

Decode complete input to Cow<'a, str> without BOM handling and with malformed sequences replaced with the REPLACEMENT CHARACTER when the entire input is available as a single buffer (i.e. the end of the buffer marks the end of the stream).

When invoked on UTF_8, this method implements the (non-streaming version of) the UTF-8 decode without BOM spec concept.

The second item in the returned pair indicates whether there were malformed sequences (that were replaced with the REPLACEMENT CHARACTER).

Note: It is wrong to use this when the input buffer represents only a segment of the input instead of the whole input. Use new_decoder_without_bom_handling() when decoding segmented input.

This method performs a one or two heap allocations for the backing buffer of the String when unable to borrow. (One allocation if not errors and potentially another one in the presence of errors.) The first allocation assumes jemalloc and may not be optimal with allocators that do not use power-of-two buckets. A borrow is performed if decoding UTF-8 and the input is valid UTF-8, if decoding an ASCII-compatible encoding and the input is ASCII-only, or when decoding ISO-2022-JP and the input is entirely in the ASCII state without state transitions.

§Panics

If the size calculation for a heap-allocated backing buffer overflows usize.

Available to Rust only and only with the alloc feature enabled (enabled by default).

source

pub fn decode_without_bom_handling_and_without_replacement<'a>( &'static self, bytes: &'a [u8], ) -> Option<Cow<'a, str>>

Decode complete input to Cow<'a, str> without BOM handling and with malformed sequences treated as fatal when the entire input is available as a single buffer (i.e. the end of the buffer marks the end of the stream).

When invoked on UTF_8, this method implements the (non-streaming version of) the UTF-8 decode without BOM or fail spec concept.

Returns None if a malformed sequence was encountered and the result of the decode as Some(String) otherwise.

Note: It is wrong to use this when the input buffer represents only a segment of the input instead of the whole input. Use new_decoder_without_bom_handling() when decoding segmented input.

This method performs a single heap allocation for the backing buffer of the String when unable to borrow. A borrow is performed if decoding UTF-8 and the input is valid UTF-8, if decoding an ASCII-compatible encoding and the input is ASCII-only, or when decoding ISO-2022-JP and the input is entirely in the ASCII state without state transitions.

§Panics

If the size calculation for a heap-allocated backing buffer overflows usize.

Available to Rust only and only with the alloc feature enabled (enabled by default).

source

pub fn encode<'a>( &'static self, string: &'a str, ) -> (Cow<'a, [u8]>, &'static Encoding, bool)

Encode complete input to Cow<'a, [u8]> using the output encoding of this encoding with unmappable characters replaced with decimal numeric character references when the entire input is available as a single buffer (i.e. the end of the buffer marks the end of the stream).

This method implements the (non-streaming version of) the encode spec concept. For the UTF-8 encode spec concept, it is slightly more efficient to use string.as_bytes() instead of invoking this method on UTF_8.

The second item in the returned tuple is the encoding that was actually used (which may differ from this encoding thanks to some encodings having UTF-8 as their output encoding).

The third item in the returned tuple indicates whether there were unmappable characters (that were replaced with HTML numeric character references).

Note: It is wrong to use this when the input buffer represents only a segment of the input instead of the whole input. Use new_encoder() when encoding segmented output.

When encoding to UTF-8 or when encoding an ASCII-only input to a ASCII-compatible encoding, this method returns a borrow of the input without a heap allocation. Otherwise, this method performs a single heap allocation for the backing buffer of the Vec<u8> if there are no unmappable characters and potentially multiple heap allocations if there are. These allocations are tuned for jemalloc and may not be optimal when using a different allocator that doesn’t use power-of-two buckets.

§Panics

If the size calculation for a heap-allocated backing buffer overflows usize.

Available to Rust only and only with the alloc feature enabled (enabled by default).

source

pub fn new_decoder(&'static self) -> Decoder

Instantiates a new decoder for this encoding with BOM sniffing enabled.

BOM sniffing may cause the returned decoder to morph into a decoder for UTF-8, UTF-16LE or UTF-16BE instead of this encoding. The BOM does not appear in the output.

Available via the C wrapper.

source

pub fn new_decoder_with_bom_removal(&'static self) -> Decoder

Instantiates a new decoder for this encoding with BOM removal.

If the input starts with bytes that are the BOM for this encoding, those bytes are removed. However, the decoder never morphs into a decoder for another encoding: A BOM for another encoding is treated as (potentially malformed) input to the decoding algorithm for this encoding.

Available via the C wrapper.

source

pub fn new_decoder_without_bom_handling(&'static self) -> Decoder

Instantiates a new decoder for this encoding with BOM handling disabled.

If the input starts with bytes that look like a BOM, those bytes are not treated as a BOM. (Hence, the decoder never morphs into a decoder for another encoding.)

Note: If the caller has performed BOM sniffing on its own but has not removed the BOM, the caller should use new_decoder_with_bom_removal() instead of this method to cause the BOM to be removed.

Available via the C wrapper.

source

pub fn new_encoder(&'static self) -> Encoder

Instantiates a new encoder for the output encoding of this encoding.

Note: The output encoding of UTF-16BE, UTF-16LE, and replacement is UTF-8. There is no encoder for UTF-16BE, UTF-16LE, and replacement themselves.

Available via the C wrapper.

source

pub fn utf8_valid_up_to(bytes: &[u8]) -> usize

Validates UTF-8.

Returns the index of the first byte that makes the input malformed as UTF-8 or the length of the slice if the slice is entirely valid.

This is currently faster than the corresponding standard library functionality. If this implementation gets upstreamed to the standard library, this method may be removed in the future.

Available via the C wrapper.

source

pub fn ascii_valid_up_to(bytes: &[u8]) -> usize

Validates ASCII.

Returns the index of the first byte that makes the input malformed as ASCII or the length of the slice if the slice is entirely valid.

Available via the C wrapper.

source

pub fn iso_2022_jp_ascii_valid_up_to(bytes: &[u8]) -> usize

Validates ISO-2022-JP ASCII-state data.

Returns the index of the first byte that makes the input not representable in the ASCII state of ISO-2022-JP or the length of the slice if the slice is entirely representable in the ASCII state of ISO-2022-JP.

Available via the C wrapper.

Trait Implementations§

source§

impl Debug for Encoding

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl Hash for Encoding

source§

fn hash<H: Hasher>(&self, state: &mut H)

Feeds this value into the given Hasher. Read more
1.3.0 · source§

fn hash_slice<H>(data: &[Self], state: &mut H)
where H: Hasher, Self: Sized,

Feeds a slice of this type into the given Hasher. Read more
source§

impl PartialEq for Encoding

source§

fn eq(&self, other: &Encoding) -> bool

Tests for self and other values to be equal, and is used by ==.
1.0.0 · source§

fn ne(&self, other: &Rhs) -> bool

Tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
source§

impl Eq for Encoding

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

source§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.