slide

slide.exceptions

exception slide.exceptions.SlideCastError[source]

Bases: slide.exceptions.SlideException

Type casting exception

exception slide.exceptions.SlideException[source]

Bases: Exception

General Slide level exception

exception slide.exceptions.SlideIndexIncompatibleError[source]

Bases: slide.exceptions.SlideException

Dataframe index incompatible exception

exception slide.exceptions.SlideInvalidOperation[source]

Bases: slide.exceptions.SlideException

Invalid operations

slide.utils

class slide.utils.SlideUtils(*args, **kwds)[source]

Bases: Generic[slide.utils.TDf, slide.utils.TCol]

A collection of utils for general pandas like dataframes

as_array(df, schema, columns=None, type_safe=False)[source]

Parameters

df (slide.utils.TDf) –
columns (Optional[List[str]]) –
type_safe (bool) –

Return type

List[List[Any]]

as_array_iterable(df, schema, columns=None, type_safe=False)[source]

Convert pandas like dataframe to iterable of rows in the format of list.

Parameters

df (slide.utils.TDf) – pandas like dataframe
schema (pyarrow.lib.Schema) – schema of the input
columns (Optional[List[str]]) – columns to output, None for all columns
type_safe (bool) – whether to enforce the types in schema, if False, it will return the original values from the dataframe

Returns

iterable of rows, each row is a list

Return type

Iterable[List[Any]]

If there are nested types in schema, the conversion can be slower

as_arrow(df, schema, type_safe=True)[source]

Convert the dataframe to pyarrow table

Parameters

df (slide.utils.TDf) – pandas like dataframe
schema (pyarrow.lib.Schema) – if specified, it will be used to construct pyarrow table, defaults to None
type_safe (bool) – check for overflows or other unsafe conversions

Returns

pyarrow table

Return type

pyarrow.lib.Table

as_pandas(df)[source]

Convert the dataframe to pandas dataframe

Returns: the pandas dataframe
Parameters: df (slide.utils.TDf) –
Return type: pandas.core.frame.DataFrame

binary_arithmetic_op(col1, col2, op)[source]

Binary arithmetic operations +, -, *, /

Parameters

col1 (Any) – the first column (series or constant)
col2 (Any) – the second column (series or constant)
op (str) – +, -, *, /

Returns

the result after the operation (series or constant)

Raises

NotImplementedError – if op is not supported

Return type

Any

All behaviors should be consistent with SQL correspondent operations.

binary_logical_op(col1, col2, op)[source]

Binary logical operations and, or

Parameters

col1 (Any) – the first column (series or constant)
col2 (Any) – the second column (series or constant)
op (str) – and, or

Returns

the result after the operation (series or constant)

Raises

NotImplementedError – if op is not supported

Return type

Any

All behaviors should be consistent with SQL correspondent operations.

case_when(*pairs, default=None)[source]

SQL CASE WHEN

Parameters

pairs (Tuple[Any, Any]) – condition and value pairs, both can be either a series or a constant
default (Optional[Any]) – default value if none of the conditions satisfies, defaults to None

Returns

the final series or constant

Return type

Any

This behavior should be consistent with SQL CASE WHEN

cast(col, type_obj, input_type=None)[source]

Cast col to a new type. type_obj must be able to be converted by to_safe_pa_type().

Parameters

col (Any) – a series or a constant
type_obj (Any) – an objected that can be accepted by to_safe_pa_type()
input_type (Optional[Any]) – an objected that is either None or to be accepted by to_safe_pa_type(), defaults to None.

Returns

the new column or constant

Return type

Any

If input_type is not None, then it can be used to determine the casting behavior. This can be useful when the input is boolean with nulls or strings, where the pandas dtype may not provide the accurate type information.

cast_df(df, schema, input_schema=None)[source]

Cast a dataframe to comply with schema.

Parameters

df (slide.utils.TDf) – pandas like dataframe
schema (pyarrow.lib.Schema) – pyarrow schema to convert to
input_schema (Optional[pyarrow.lib.Schema]) – the known input pyarrow schema, defaults to None

Returns

converted dataframe

Return type

slide.utils.TDf

input_schema is important because sometimes the column types can be different from expected. For example if a boolean series contains Nones, the dtype will be object, without a input type hint, the function can’t do the conversion correctly.

coalesce(cols)[source]

Coalesce multiple series and constants

Parameters: cols (List[Any]) – the collection of series and constants in order
Returns: the coalesced series or constant
Return type: Any

This behavior should be consistent with SQL COALESCE

cols_to_df(cols, names=None)[source]

Construct the dataframe from a list of columns (series)

Parameters

cols (List[Any]) – the collection of series or constants, at least one value must be a series
names (Optional[List[str]]) – the correspondent column names, defaults to None

Returns

the dataframe

Return type

slide.utils.TDf

If names is not provided, then every series in cols must be named. Otherise, names must align with cols. But whether names have duplications or invalid chars will not be verified by this method

comparison_op(col1, col2, op)[source]

Binary comparison <, <=, ==, >, >=

Parameters

col1 (Any) – the first column (series or constant)
col2 (Any) – the second column (series or constant)
op (str) – <, <=, ==, >, >=

Returns

the result after the operation (series or constant)

Raises

NotImplementedError – if op is not supported

Return type

Any

All behaviors should be consistent with SQL correspondent operations.

drop_duplicates(df)[source]

Select distinct rows from dataframe

raise SlideIndexIncompatibleError(
“pandas like datafame index can’t have name”

)

Returns: the result with only distinct rows
Parameters: df (slide.utils.TDf) –
Return type: slide.utils.TDf

empty(df)[source]

Check if the dataframe is empty

Parameters: df (slide.utils.TDf) – pandas like dataframe
Returns: if it is empty
Return type: bool

ensure_compatible(df)[source]

Check whether the datafame is compatible with the operations inside this utils collection, if not, it will raise ValueError

Parameters: df (slide.utils.TDf) – pandas like dataframe
Raises: ValueError – if not compatible
Return type: None

except_df(df1, df2, unique, anti_indicator_col='__anti_indicator__')[source]

Exclude df2 from df1

Parameters

df1 (slide.utils.TDf) – the first dataframe
df2 (slide.utils.TDf) – the second dataframe
unique (bool) – whether return only unique rows
anti_indicator_col (str) –

Returns

df1 - df2

Return type

slide.utils.TDf

The behavior is not well defined when unique is False

filter_df(df, cond)[source]

Filter dataframe by a boolean series or a constant

Parameters

df (slide.utils.TDf) – the dataframe
cond (Any) – a boolean seris or a constant

Returns

the filtered dataframe

Return type

slide.utils.TDf

Filtering behavior should be consistent with SQL.

get_col_pa_type(col)[source]

Get column or constant pyarrow data type

Parameters: col (Any) – the column or the constant
Returns: pyarrow data type
Return type: pyarrow.lib.DataType

intersect(df1, df2, unique)[source]

Intersect two dataframes

Parameters

ndf1 – the first dataframe
ndf2 – the second dataframe
unique (bool) – whether return only unique rows
df1 (slide.utils.TDf) –
df2 (slide.utils.TDf) –

Returns

intersected dataframe

Return type

slide.utils.TDf

is_between(col, lower, upper, positive)[source]

Check if a series or a constant is >=lower and <=upper

Parameters

col (Any) – the series or the constant
lower (Any) – the lower bound, which can be series or a constant
upper (Any) – the upper bound, which can be series or a constant
positive (bool) – is between or is not between

Returns

the correspondent boolean series or constant

Return type

Any

This behavior should be consistent with SQL BETWEEN and NOT BETWEEN. The return values can be True, False and None

is_compatile_index(df)[source]

Check whether the datafame is compatible with the operations inside this utils collection

Parameters: df (slide.utils.TDf) – pandas like dataframe
Returns: if it is compatible
Return type: bool

is_in(col, values, positive)[source]

Check if a series or a constant is in values

Parameters

col (Any) – the series or the constant
values (List[Any]) – a list of constants and series (can mix)
positive (bool) – is in or is not in

Returns

the correspondent boolean series or constant

Return type

Any

This behavior should be consistent with SQL IN and NOT IN. The return values can be True, False and None

is_series(obj)[source]

Check whether is a series type

Parameters: obj (Any) – the object
Returns: whether it is a series
Return type: bool

is_value(col, value, positive=True)[source]

Check if the series or constant is value

Parameters

col (Any) – the series or constant
value (Any) – None, True or False
positive (bool) – check is value or is not value, defaults to True (is value)

Raises

NotImplementedError – if value is not supported

Returns

a bool value or a series

Return type

Any

join(ndf1, ndf2, join_type, on, anti_indicator_col='__anti_indicator__', cross_indicator_col='__corss_indicator__')[source]

Join two dataframes.

Parameters

ndf1 (slide.utils.TDf) – the first dataframe
ndf2 (slide.utils.TDf) – the second dataframe
join_type (str) – see parse_join_type()
on (List[str]) – join keys for pandas like merge to use
anti_indicator_col (str) – temporary column name for anti join, defaults to _ANTI_INDICATOR
cross_indicator_col (str) – temporary column name for cross join, defaults to _CROSS_INDICATOR

Raises

NotImplementedError – if join type is not supported

Returns

the joined dataframe

Return type

slide.utils.TDf

All join behaviors should be consistent with SQL correspondent joins.

like(col, expr, ignore_case=False, positive=True)[source]

SQL LIKE

Parameters

col (Any) – a series or a constant
expr (Any) – a pattern expression
ignore_case (bool) – whether to ignore case, defaults to False
positive (bool) – LIKE or NOT LIKE, defaults to True

Returns

the correspondent boolean series or constant

Return type

Any

This behavior should be consistent with SQL LIKE

logical_not(col)[source]

Logical NOT

All behaviors should be consistent with SQL correspondent operations.

Parameters: col (Any) –
Return type: Any

series_to_array(col)[source]

Convert a series to numpy array

Parameters: col (slide.utils.TCol) – the series
Returns: the numpy array
Return type: List[Any]

sql_groupby_apply(df, cols, func, output_schema=None, **kwargs)[source]

Safe groupby apply operation on pandas like dataframes. In pandas like groupby apply, if any key is null, the whole group is dropped. This method makes sure those groups are included.

Parameters

df (slide.utils.TDf) – pandas like dataframe
cols (List[str]) – columns to group on, can be empty
func (Callable[[slide.utils.TDf], slide.utils.TDf]) – apply function, df in, df out
output_schema (Optional[pyarrow.lib.Schema]) – output schema hint for the apply
kwargs (Any) –

Returns

output dataframe

Return type

slide.utils.TDf

The dataframe must be either empty, or with type pd.RangeIndex, pd.Int64Index or pd.UInt64Index and without a name, otherwise, ValueError will raise.

to_constant_series(constant, from_series, dtype=None, name=None)[source]

Convert a constant to a series with the same index of from_series

Parameters

constant (Any) – the constant
from_series (slide.utils.TCol) – the reference series for index
dtype (Optional[Any]) – default data type, defaults to None
name (Optional[str]) – name of the series, defaults to None

Returns

the series

Return type

slide.utils.TCol

to_safe_pa_type(tp)[source]

Parameters: tp (Any) –
Return type: pyarrow.lib.DataType

to_schema(df)[source]

Extract pandas dataframe schema as pyarrow schema. This is a replacement of pyarrow.Schema.from_pandas, and it can correctly handle string type and empty dataframes

Parameters: df (slide.utils.TDf) – pandas dataframe
Raises: ValueError – if pandas dataframe does not have named schema
Returns: pyarrow.Schema
Return type: pyarrow.lib.Schema

The dataframe must be either empty, or with type pd.RangeIndex, pd.Int64Index or pd.UInt64Index and without a name, otherwise, ValueError will raise.

to_series(obj, name=None)[source]

Convert an object to series

Parameters

obj (Any) – the object
name (Optional[str]) – name of the series, defaults to None

Returns

the series

Return type

slide.utils.TCol

unary_arithmetic_op(col, op)[source]

Unary arithmetic operator on series/constants

Parameters

col (Any) – a series or a constant
op (str) – can be + or -

Returns

the transformed series or constant

Raises

NotImplementedError – if op is not supported

Return type

Any

All behaviors should be consistent with SQL correspondent operations.

union(df1, df2, unique)[source]

Union two dataframes

Parameters

df1 (slide.utils.TDf) – the first dataframe
df2 (slide.utils.TDf) – the second dataframe
unique (bool) – whether return only unique rows

Returns

unioned dataframe

Return type

slide.utils.TDf

slide.utils.parse_join_type(join_type)[source]

Parse and normalize join type string. The normalization will lower the string, remove all space and _, and then map to the limited options.

Here are the options after normalization: inner, cross, left_semi, left_anti, left_outer, right_outer, full_outer.

Parameters: join_type (str) – the raw join type string
Raises: NotImplementedError – if not supported
Returns: the normalized join type string
Return type: str