Friesian Feature API

friesian.feature.table

class zoo.friesian.feature.table.Table(df)[source]

Bases: object

compute()[source]

Trigger computation of the Table.

to_spark_df()[source]

Convert the current Table to a Spark DataFrame.

Returns

The converted Spark DataFrame.

size()[source]

Returns the number of rows in this Table.

Returns

The number of rows in the current Table.

broadcast()[source]

Marks the Table as small enough for use in broadcast join.

select(*cols)[source]

Select specific columns.

Parameters

cols – a string or a list of strings that specifies column names. If it is ‘*’, select all the columns.

Returns

A new Table that contains the specified columns.

drop(*cols)[source]

Returns a new Table that drops the specified column. This is a no-op if schema doesn’t contain the given column name(s).

Parameters

cols – a string name of the column to drop, or a list of string name of the columns to drop.

Returns

A new Table that drops the specified column.

fillna(value, columns)[source]

Replace null values.

Parameters
  • value – int, long, float, string, or boolean. Value to replace null values with.

  • columns – list of str, the target columns to be filled. If columns=None and value is int, all columns of integer type will be filled. If columns=None and value is long, float, string or boolean, all columns will be filled.

Returns

A new Table that replaced the null values with specified value

dropna(columns, how='any', thresh=None)[source]

Drops the rows containing null values in the specified columns.

Parameters
  • columns – a string or a list of strings that specifies column names. If it is None, it will operate on all columns.

  • how – If how is “any”, then drop rows containing any null values in columns. If how is “all”, then drop rows only if every column in columns is null for that row.

  • thresh – int, if specified, drop rows that have less than thresh non-null values. Default is None.

Returns

A new Table that drops the rows containing null values in the specified columns.

distinct()[source]

Select the distinct rows of the Table.

Returns

A new Table that only contains distinct rows.

filter(condition)[source]

Filters the rows that satisfy condition. For instance, filter(“col_1 == 1”) will filter the rows that has value 1 at column col_1.

Parameters

condition – a string that gives the condition for filtering.

Returns

A new Table with filtered rows.

clip(columns, min=None, max=None)[source]

Clips continuous values so that they are within the range [min, max]. For instance, by setting the min value to 0, all negative values in columns will be replaced with 0.

Parameters
  • columns – str or list of str, the target columns to be clipped.

  • min – numeric, the minimum value to clip values to. Values less than this will be replaced with this value.

  • max – numeric, the maximum value to clip values to. Values greater than this will be replaced with this value.

Returns

A new Table that replaced the value less than min with specified min and the value greater than max with specified max.

log(columns, clipping=True)[source]

Calculates the log of continuous columns.

Parameters
  • columns – str or list of str, the target columns to calculate log.

  • clipping – boolean, if clipping=True, the negative values in columns will be clipped to 0 and log(x+1) will be calculated. If clipping=False, log(x) will be calculated.

Returns

A new Table that replaced value in columns with logged value.

fill_median(columns)[source]

Replaces null values with the median in the specified numeric columns. Any column to be filled should not contain only null values.

Parameters

columns – a string or a list of strings that specifies column names. If it is None, it will operate on all numeric columns.

Returns

A new Table that replaces null values with the median in the specified numeric columns.

median(columns)[source]

Returns a new Table that has two columns, column and median, containing the column names and the medians of the specified numeric columns.

Parameters

columns – a string or a list of strings that specifies column names. If it is None, it will operate on all numeric columns.

Returns

A new Table that contains the medians of the specified columns.

merge_cols(columns, target)[source]

Merge column values as a list to a new col.

Parameters
  • columns – list of str, the target columns to be merged.

  • target – str, the new column name of the merged column.

Returns

A new Table that replaced columns with a new target column of merged list value.

rename(columns)[source]

Rename columns with new column names

Parameters

columns – dict. Name pairs. For instance, {‘old_name1’: ‘new_name1’, ‘old_name2’: ‘new_name2’}”.

Returns

A new Table with new column names.

show(n=20, truncate=True)[source]

Prints the first n rows to the console.

Parameters
  • n – int, the number of rows to show.

  • truncate – If set to True, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length truncate and align cells right.

write_parquet(path, mode='overwrite')[source]
cast(columns, type)[source]

Cast columns to the specified type.

Parameters
  • columns – a string or a list of strings that specifies column names. If it is None, then cast all of the columns.

  • type – a string (“string”, “boolean”, “int”, “long”, “short”, “float”, “double”) that specifies the type.

Returns

A new Table that casts all of the specified columns to the specified type.

col(name)[source]
class zoo.friesian.feature.table.FeatureTable(df)[source]

Bases: zoo.friesian.feature.table.Table

classmethod read_parquet(paths)[source]

Loads Parquet files as a FeatureTable.

Parameters

paths – str or a list of str. The path/paths to Parquet file(s).

Returns

A FeatureTable for recommendation data.

classmethod read_json(paths, cols=None)[source]
encode_string(columns, indices)[source]

Encode columns with provided list of StringIndex.

Parameters
  • columns – str or a list of str, target columns to be encoded.

  • indices – StringIndex or a list of StringIndex, StringIndexes of target columns. The StringIndex should at least have two columns: id and the corresponding categorical column. Or it can be a dict or a list of dicts. In this case, the keys of the dict should be within the categorical column and the values are the target ids to be encoded.

Returns

A new FeatureTable which transforms categorical features into unique integer values with provided StringIndexes.

gen_string_idx(columns, freq_limit)[source]

Generate unique index value of categorical features.

Parameters
  • columns – str or a list of str, target columns to generate StringIndex.

  • freq_limit – int, dict or None. Categories with a count/frequency below freq_limit will be omitted from the encoding. Can be represented as both an integer, dict or None. For instance, 15, {‘col_4’: 10, ‘col_5’: 2} etc. None means all the categories that appear will be encoded.

Returns

A list of StringIndex.

gen_ind2ind(cols, indices)[source]

Generate a mapping between of indices

Parameters
  • cols – a list of str, target columns to generate StringIndex.

  • indices – list of StringIndex

Returns

FeatureTable

cross_columns(crossed_columns, bucket_sizes)[source]

Cross columns and hashed to specified bucket size :param crossed_columns: list of column name pairs to be crossed. i.e. [[‘a’, ‘b’], [‘c’, ‘d’]] :param bucket_sizes: hash bucket size for crossed pairs. i.e. [1000, 300] :return: FeatureTable include crossed columns(i.e. ‘a_b’, ‘c_d’)

normalize(columns)[source]

Normalize numeric columns :param columns: list of column names :return: FeatureTable

add_negative_samples(item_size, item_col='item', label_col='label', neg_num=1)[source]

Generate negative item visits for each positive item visit

Parameters
  • item_size – integer, max of item.

  • item_col – string, name of item column

  • label_col – string, name of label column

  • neg_num – integer, for each positive record, add neg_num of negative samples

Returns

FeatureTable

add_hist_seq(user_col, cols, sort_col='time', min_len=1, max_len=100)[source]

Generate a list of item visits in history

Parameters
  • user_col – string, user column.

  • cols – list of string, ctolumns need to be aggragated

  • sort_col – string, sort by sort_col

  • min_len – int, minimal length of a history list

  • max_len – int, maximal length of a history list

Returns

FeatureTable

add_neg_hist_seq(item_size, item_history_col, neg_num)[source]

Generate a list negative samples for each item in item_history_col

Parameters
  • item_size – int, max of item.

  • item2cat – FeatureTable with a dataframe of item to catgory mapping

  • item_history_col – string, this column should be a list of visits in history

  • neg_num – int, for each positive record, add neg_num of negative samples

Returns

FeatureTable

pad(padding_cols, seq_len=100)[source]

Post padding padding columns

Parameters
  • padding_cols – list of string, columns need to be padded with 0s.

  • seq_len – int, length of padded column

Returns

FeatureTable

mask(mask_cols, seq_len=100)[source]

Mask mask_cols columns

Parameters
  • mask_cols – list of string, columns need to be masked with 1s and 0s.

  • seq_len – int, length of masked column

Returns

FeatureTable

add_length(col_name)[source]

Generate the length of a columb.

Parameters

col_name – string.

Returns

FeatureTable

mask_pad(padding_cols, mask_cols, seq_len=100)[source]

Mask and pad columns

Parameters
  • padding_cols – list of string, columns need to be padded with 0s.

  • mask_cols – list of string, columns need to be masked with 1s and 0s.

  • seq_len – int, length of masked column

Returns

FeatureTable

transform_python_udf(in_col, out_col, udf_func)[source]

Transform a FeatureTable using a python udf

Parameters
  • in_col – string, name of column needed to be transformed.

  • out_col – string, output column.

  • udf_func – user defined python function

Returns

FeatureTable

join(table, on=None, how=None)[source]

Join a FeatureTable with another FeatureTable, it is wrapper of spark dataframe join

Parameters
  • table – FeatureTable

  • on – string, join on this column

  • how – string

Returns

FeatureTable

add_feature(item_cols, feature_tbl, default_value)[source]

Get the category or other field from another map like FeatureTable

Parameters
  • item_cols – list[string]

  • feature_tbl – FeatureTable with two columns [category, item]

  • defalut_cat_index – default value for category if key does not exist

Returns

FeatureTable

class zoo.friesian.feature.table.StringIndex(df, col_name)[source]

Bases: zoo.friesian.feature.table.Table

classmethod read_parquet(paths, col_name=None)[source]

Loads Parquet files as a StringIndex.

Parameters
  • paths – str or a list of str. The path/paths to Parquet file(s).

  • col_name – str. The column name of the corresponding categorical column. If col_name is None, the file name will be used as col_name.

Returns

A StringIndex.

classmethod from_dict(indices, col_name)[source]

Create the StringIndex from a dict of indices.

Parameters
  • indices – dict. The key is the categorical column, the value is the corresponding index. We assume that the key is a str and the value is a int.

  • col_name – str. The column name of the categorical column.

Returns

A StringIndex.

write_parquet(path, mode='overwrite')[source]

Write StringIndex to Parquet file.

Parameters
  • path – str. The path to the folder of the Parquet file. Note that the col_name will be used as basename of the Parquet file.

  • mode – str. append, overwrite, error or ignore. append: Append contents of this StringIndex to existing data. overwrite: Overwrite existing data. error: Throw an exception if data already exists. ignore: Silently ignore this operation if data already exists.