Miscellaneous functions for manipulating text¶
Collection of text functions that don’t fit in another category.
Changed in version kitchen: 1.2.0, API: kitchen.text 2.2.0
Added isbasestring()
,
isbytestring()
, and
isunicodestring()
to help tell which string type
is which on python2 and python3
-
kitchen.text.misc.
byte_string_valid_encoding
(byte_string, encoding='utf-8')¶ Detect if a byte
str
is valid in a specific encodingParameters: - byte_string – Byte
str
to test for bytes not valid in this encoding - encoding – encoding to test against. Defaults to UTF-8.
Returns: True
if there are no invalid UTF-8 characters.False
if an invalid character is detected.Note
This function checks whether the byte
str
is valid in the specified encoding. It does not detect whether the bytestr
actually was encoded in that encoding. If you want that sort of functionality, you probably want to useguess_encoding()
instead.- byte_string – Byte
-
kitchen.text.misc.
byte_string_valid_xml
(byte_string, encoding='utf-8')¶ Check that a byte
str
would be valid in xmlParameters: - byte_string – Byte
str
to check - encoding – Encoding of the xml file. Default: UTF-8
Returns: True
if the string is valid.False
if it would be invalid in the xml fileIn some cases you’ll have a whole bunch of byte strings and rather than transforming them to
unicode
and back to bytestr
for output to xml, you will just want to make sure they work with the xml file you’re constructing. This function will help you do that. Example:ARRAY_OF_MOSTLY_UTF8_STRINGS = [...] processed_array = [] for string in ARRAY_OF_MOSTLY_UTF8_STRINGS: if byte_string_valid_xml(string, 'utf-8'): processed_array.append(string) else: processed_array.append(guess_bytes_to_xml(string, encoding='utf-8')) output_xml(processed_array)
- byte_string – Byte
-
kitchen.text.misc.
guess_encoding
(byte_string, disable_chardet=False)¶ Try to guess the encoding of a byte
str
Parameters: - byte_string – byte
str
to guess the encoding of - disable_chardet – If this is True, we never attempt to use
chardet
to guess the encoding. This is useful if you need to have reproducibility whetherchardet
is installed or not. Default:False
.
Raises: TypeError – if
byte_string
is not a bytestr
typeReturns: string containing a guess at the encoding of
byte_string
. This is appropriate to pass as the encoding argument when encoding and decoding unicode strings.We start by attempting to decode the byte
str
as UTF-8. If this succeeds we tell the world it’s UTF-8 text. If it doesn’t andchardet
is installed on the system anddisable_chardet
is False this function will use it to try detecting the encoding ofbyte_string
. If it is not installed orchardet
cannot determine the encoding with a high enough confidence then we rather arbitrarily claim that it islatin-1
. Sincelatin-1
will encode to every byte, decoding fromlatin-1
tounicode
will not causeUnicodeErrors
although the output might be mangled.- byte_string – byte
-
kitchen.text.misc.
html_entities_unescape
(string)¶ Substitute unicode characters for HTML entities
Parameters: string – unicode
string to substitute out html entitiesRaises: TypeError – if something other than a unicode
string is givenReturn type: unicode
stringReturns: The plain text without html entities
-
kitchen.text.misc.
isbasestring
(obj)¶ Determine if obj is a byte
str
orunicode
stringIn python2 this is eqiuvalent to isinstance(obj, basestring). In python3 it checks whether the object is an instance of str, bytes, or bytearray. This is an aid to porting code that needed to test whether an object was derived from basestring in python2 (commonly used in unicode-bytes conversion functions)
Parameters: obj – Object to test Returns: True if the object is a basestring
. Otherwise False.New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0
-
kitchen.text.misc.
isbytestring
(obj)¶ Determine if obj is a byte
str
In python2 this is equivalent to isinstance(obj, str). In python3 it checks whether the object is an instance of bytes or bytearray.
Parameters: obj – Object to test Returns: True if the object is a byte str
. Otherwise, False.New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0
-
kitchen.text.misc.
isunicodestring
(obj)¶ Determine if obj is a
unicode
stringIn python2 this is equivalent to isinstance(obj, unicode). In python3 it checks whether the object is an instance of
str
.Parameters: obj – Object to test Returns: True if the object is a unicode
string. Otherwise, False.New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0
-
kitchen.text.misc.
process_control_chars
(string, strategy='replace')¶ Look for and transform control characters in a string
Parameters: - string – string to search for and transform control characters within
- strategy –
XML does not allow ASCII control characters. When we encounter those we need to know what to do. Valid options are:
replace: (default) Replace the control characters with "?"
ignore: Remove the characters altogether from the output strict: Raise a ControlCharError
when we encounter a control character
Raises: - TypeError – if
string
is not a unicode string. - ValueError – if the strategy is not one of replace, ignore, or strict.
- kitchen.text.exceptions.ControlCharError – if the strategy is
strict
and a control character is present in thestring
Returns: unicode
string with no control characters in it.Changed in version kitchen: 1.2.0, API: kitchen.text 2.2.0 Strip out the C1 control characters in addition to the C0 control characters.
-
kitchen.text.misc.
str_eq
(str1, str2, encoding='utf-8', errors='replace')¶ Compare two strings, converting to byte
str
if one isunicode
Parameters: - str1 – First string to compare
- str2 – Second string to compare
- encoding – If we need to convert one string into a byte
str
to compare, the encoding to use. Default is utf-8. - errors – What to do if we encounter errors when encoding the string.
See the
kitchen.text.converters.to_bytes()
documentation for possible values. The default isreplace
.
This function prevents
UnicodeError
(python-2.4 or less) andUnicodeWarning
(python 2.5 and higher) when we compare aunicode
string to a bytestr
. The errors normally arise because the conversion is done to ASCII. This function lets you convert to utf-8 or another encoding instead.Note
When we need to convert one of the strings from
unicode
in order to compare them we convert theunicode
string into a bytestr
. That means that strings can compare differently if you use different encodings for each.Note that
str1 == str2
is faster than this function if you can accept the following limitations:- Limited to python-2.5+ (otherwise a
UnicodeDecodeError
may be thrown) - Will generate a
UnicodeWarning
if non-ASCII bytestr
is compared tounicode
string.