Friesian Feature API¶

friesian.feature.table¶

class zoo.friesian.feature.table.Table(df)[source]¶

Bases: object

compute()[source]¶: Trigger computation of the Table.

to_spark_df()[source]¶

Convert the current Table to a Spark DataFrame.

Returns: The converted Spark DataFrame.

size()[source]¶

Returns the number of rows in this Table.

Returns: The number of rows in the current Table.

broadcast()[source]¶: Marks the Table as small enough for use in broadcast join.

select(*cols)[source]¶

Select specific columns.

Parameters: cols – a string or a list of strings that specifies column names. If it is ‘*’, select all the columns.
Returns: A new Table that contains the specified columns.

drop(*cols)[source]¶

Returns a new Table that drops the specified column. This is a no-op if schema doesn’t contain the given column name(s).

Parameters: cols – a string name of the column to drop, or a list of string name of the columns to drop.
Returns: A new Table that drops the specified column.

fillna(value, columns)[source]¶

Replace null values.

Parameters

value – int, long, float, string, or boolean. Value to replace null values with.
columns – list of str, the target columns to be filled. If columns=None and value is int, all columns of integer type will be filled. If columns=None and value is long, float, string or boolean, all columns will be filled.

Returns

A new Table that replaced the null values with specified value

dropna(columns, how='any', thresh=None)[source]¶

Drops the rows containing null values in the specified columns.

Parameters

columns – a string or a list of strings that specifies column names. If it is None, it will operate on all columns.
how – If how is “any”, then drop rows containing any null values in columns. If how is “all”, then drop rows only if every column in columns is null for that row.
thresh – int, if specified, drop rows that have less than thresh non-null values. Default is None.

Returns

A new Table that drops the rows containing null values in the specified columns.

distinct()[source]¶

Select the distinct rows of the Table.

Returns: A new Table that only contains distinct rows.

filter(condition)[source]¶

Filters the rows that satisfy condition. For instance, filter(“col_1 == 1”) will filter the rows that has value 1 at column col_1.

Parameters: condition – a string that gives the condition for filtering.
Returns: A new Table with filtered rows.

clip(columns, min=None, max=None)[source]¶

Clips continuous values so that they are within the range [min, max]. For instance, by setting the min value to 0, all negative values in columns will be replaced with 0.

Parameters

columns – str or list of str, the target columns to be clipped.
min – numeric, the minimum value to clip values to. Values less than this will be replaced with this value.
max – numeric, the maximum value to clip values to. Values greater than this will be replaced with this value.

Returns

A new Table that replaced the value less than min with specified min and the value greater than max with specified max.

log(columns, clipping=True)[source]¶

Calculates the log of continuous columns.

Parameters

columns – str or list of str, the target columns to calculate log.
clipping – boolean, if clipping=True, the negative values in columns will be clipped to 0 and log(x+1) will be calculated. If clipping=False, log(x) will be calculated.

Returns

A new Table that replaced value in columns with logged value.

fill_median(columns)[source]¶

Replaces null values with the median in the specified numeric columns. Any column to be filled should not contain only null values.

Parameters: columns – a string or a list of strings that specifies column names. If it is None, it will operate on all numeric columns.
Returns: A new Table that replaces null values with the median in the specified numeric columns.

median(columns)[source]¶

Returns a new Table that has two columns, column and median, containing the column names and the medians of the specified numeric columns.

Parameters: columns – a string or a list of strings that specifies column names. If it is None, it will operate on all numeric columns.
Returns: A new Table that contains the medians of the specified columns.

merge_cols(columns, target)[source]¶

Merge column values as a list to a new col.

Parameters

columns – list of str, the target columns to be merged.
target – str, the new column name of the merged column.

Returns

A new Table that replaced columns with a new target column of merged list value.

rename(columns)[source]¶

Rename columns with new column names

Parameters: columns – dict. Name pairs. For instance, {‘old_name1’: ‘new_name1’, ‘old_name2’: ‘new_name2’}”.
Returns: A new Table with new column names.

show(n=20, truncate=True)[source]¶

Prints the first n rows to the console.

Parameters

n – int, the number of rows to show.
truncate – If set to True, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length truncate and align cells right.

write_parquet(path, mode='overwrite')[source]¶

cast(columns, type)[source]¶

Cast columns to the specified type.

Parameters

columns – a string or a list of strings that specifies column names. If it is None, then cast all of the columns.
type – a string (“string”, “boolean”, “int”, “long”, “short”, “float”, “double”) that specifies the type.

Returns

A new Table that casts all of the specified columns to the specified type.

col(name)[source]¶

class zoo.friesian.feature.table.FeatureTable(df)[source]¶

Bases: zoo.friesian.feature.table.Table

classmethod read_parquet(paths)[source]¶

Loads Parquet files as a FeatureTable.

Parameters: paths – str or a list of str. The path/paths to Parquet file(s).
Returns: A FeatureTable for recommendation data.

classmethod read_json(paths, cols=None)[source]¶

encode_string(columns, indices)[source]¶

Encode columns with provided list of StringIndex.

Parameters

columns – str or a list of str, target columns to be encoded.
indices – StringIndex or a list of StringIndex, StringIndexes of target columns. The StringIndex should at least have two columns: id and the corresponding categorical column. Or it can be a dict or a list of dicts. In this case, the keys of the dict should be within the categorical column and the values are the target ids to be encoded.

Returns

A new FeatureTable which transforms categorical features into unique integer values with provided StringIndexes.

gen_string_idx(columns, freq_limit)[source]¶

Generate unique index value of categorical features.

Parameters

columns – str or a list of str, target columns to generate StringIndex.
freq_limit – int, dict or None. Categories with a count/frequency below freq_limit will be omitted from the encoding. Can be represented as both an integer, dict or None. For instance, 15, {‘col_4’: 10, ‘col_5’: 2} etc. None means all the categories that appear will be encoded.

Returns

A list of StringIndex.

gen_ind2ind(cols, indices)[source]¶

Generate a mapping between of indices

Parameters

cols – a list of str, target columns to generate StringIndex.
indices – list of StringIndex

Returns

FeatureTable

cross_columns(crossed_columns, bucket_sizes)[source]¶: Cross columns and hashed to specified bucket size :param crossed_columns: list of column name pairs to be crossed. i.e. [[‘a’, ‘b’], [‘c’, ‘d’]] :param bucket_sizes: hash bucket size for crossed pairs. i.e. [1000, 300] :return: FeatureTable include crossed columns(i.e. ‘a_b’, ‘c_d’)

normalize(columns)[source]¶: Normalize numeric columns :param columns: list of column names :return: FeatureTable

add_negative_samples(item_size, item_col='item', label_col='label', neg_num=1)[source]¶

Generate negative item visits for each positive item visit

Parameters

item_size – integer, max of item.
item_col – string, name of item column
label_col – string, name of label column
neg_num – integer, for each positive record, add neg_num of negative samples

Returns

FeatureTable

add_hist_seq(user_col, cols, sort_col='time', min_len=1, max_len=100)[source]¶

Generate a list of item visits in history

Parameters

user_col – string, user column.
cols – list of string, ctolumns need to be aggragated
sort_col – string, sort by sort_col
min_len – int, minimal length of a history list
max_len – int, maximal length of a history list

Returns

FeatureTable

add_neg_hist_seq(item_size, item_history_col, neg_num)[source]¶

Generate a list negative samples for each item in item_history_col

Parameters

item_size – int, max of item.
item2cat – FeatureTable with a dataframe of item to catgory mapping
item_history_col – string, this column should be a list of visits in history
neg_num – int, for each positive record, add neg_num of negative samples

Returns

FeatureTable

pad(padding_cols, seq_len=100)[source]¶

Post padding padding columns

Parameters

padding_cols – list of string, columns need to be padded with 0s.
seq_len – int, length of padded column

Returns

FeatureTable

mask(mask_cols, seq_len=100)[source]¶

Mask mask_cols columns

Parameters

mask_cols – list of string, columns need to be masked with 1s and 0s.
seq_len – int, length of masked column

Returns

FeatureTable

add_length(col_name)[source]¶

Generate the length of a columb.

Parameters: col_name – string.
Returns: FeatureTable

mask_pad(padding_cols, mask_cols, seq_len=100)[source]¶

Mask and pad columns

Parameters

padding_cols – list of string, columns need to be padded with 0s.
mask_cols – list of string, columns need to be masked with 1s and 0s.
seq_len – int, length of masked column

Returns

FeatureTable

transform_python_udf(in_col, out_col, udf_func)[source]¶

Transform a FeatureTable using a python udf

Parameters

in_col – string, name of column needed to be transformed.
out_col – string, output column.
udf_func – user defined python function

Returns

FeatureTable

join(table, on=None, how=None)[source]¶

Join a FeatureTable with another FeatureTable, it is wrapper of spark dataframe join

Parameters

table – FeatureTable
on – string, join on this column
how – string

Returns

FeatureTable

add_feature(item_cols, feature_tbl, default_value)[source]¶

Get the category or other field from another map like FeatureTable

Parameters

item_cols – list[string]
feature_tbl – FeatureTable with two columns [category, item]
defalut_cat_index – default value for category if key does not exist

Returns

FeatureTable

class zoo.friesian.feature.table.StringIndex(df, col_name)[source]¶

Bases: zoo.friesian.feature.table.Table

classmethod read_parquet(paths, col_name=None)[source]¶

Loads Parquet files as a StringIndex.

Parameters

paths – str or a list of str. The path/paths to Parquet file(s).
col_name – str. The column name of the corresponding categorical column. If col_name is None, the file name will be used as col_name.

Returns

A StringIndex.

classmethod from_dict(indices, col_name)[source]¶

Create the StringIndex from a dict of indices.

Parameters

indices – dict. The key is the categorical column, the value is the corresponding index. We assume that the key is a str and the value is a int.
col_name – str. The column name of the categorical column.

Returns

A StringIndex.

write_parquet(path, mode='overwrite')[source]¶

Write StringIndex to Parquet file.

Parameters

path – str. The path to the folder of the Parquet file. Note that the col_name will be used as basename of the Parquet file.
mode – str. append, overwrite, error or ignore. append: Append contents of this StringIndex to existing data. overwrite: Overwrite existing data. error: Throw an exception if data already exists. ignore: Silently ignore this operation if data already exists.