Friesian Feature API¶
friesian.feature.table¶
- class zoo.friesian.feature.table.Table(df)[source]¶
Bases:
object- to_spark_df()[source]¶
Convert the current Table to a Spark DataFrame.
- Returns
The converted Spark DataFrame.
- size()[source]¶
Returns the number of rows in this Table.
- Returns
The number of rows in the current Table.
- select(*cols)[source]¶
Select specific columns.
- Parameters
cols – a string or a list of strings that specifies column names. If it is ‘*’, select all the columns.
- Returns
A new Table that contains the specified columns.
- drop(*cols)[source]¶
Returns a new Table that drops the specified column. This is a no-op if schema doesn’t contain the given column name(s).
- Parameters
cols – a string name of the column to drop, or a list of string name of the columns to drop.
- Returns
A new Table that drops the specified column.
- fillna(value, columns)[source]¶
Replace null values.
- Parameters
value – int, long, float, string, or boolean. Value to replace null values with.
columns – list of str, the target columns to be filled. If columns=None and value is int, all columns of integer type will be filled. If columns=None and value is long, float, string or boolean, all columns will be filled.
- Returns
A new Table that replaced the null values with specified value
- dropna(columns, how='any', thresh=None)[source]¶
Drops the rows containing null values in the specified columns.
- Parameters
columns – a string or a list of strings that specifies column names. If it is None, it will operate on all columns.
how – If how is “any”, then drop rows containing any null values in columns. If how is “all”, then drop rows only if every column in columns is null for that row.
thresh – int, if specified, drop rows that have less than thresh non-null values. Default is None.
- Returns
A new Table that drops the rows containing null values in the specified columns.
- distinct()[source]¶
Select the distinct rows of the Table.
- Returns
A new Table that only contains distinct rows.
- filter(condition)[source]¶
Filters the rows that satisfy condition. For instance, filter(“col_1 == 1”) will filter the rows that has value 1 at column col_1.
- Parameters
condition – a string that gives the condition for filtering.
- Returns
A new Table with filtered rows.
- clip(columns, min=None, max=None)[source]¶
Clips continuous values so that they are within the range [min, max]. For instance, by setting the min value to 0, all negative values in columns will be replaced with 0.
- Parameters
columns – str or list of str, the target columns to be clipped.
min – numeric, the minimum value to clip values to. Values less than this will be replaced with this value.
max – numeric, the maximum value to clip values to. Values greater than this will be replaced with this value.
- Returns
A new Table that replaced the value less than min with specified min and the value greater than max with specified max.
- log(columns, clipping=True)[source]¶
Calculates the log of continuous columns.
- Parameters
columns – str or list of str, the target columns to calculate log.
clipping – boolean, if clipping=True, the negative values in columns will be clipped to 0 and log(x+1) will be calculated. If clipping=False, log(x) will be calculated.
- Returns
A new Table that replaced value in columns with logged value.
- fill_median(columns)[source]¶
Replaces null values with the median in the specified numeric columns. Any column to be filled should not contain only null values.
- Parameters
columns – a string or a list of strings that specifies column names. If it is None, it will operate on all numeric columns.
- Returns
A new Table that replaces null values with the median in the specified numeric columns.
- median(columns)[source]¶
Returns a new Table that has two columns, column and median, containing the column names and the medians of the specified numeric columns.
- Parameters
columns – a string or a list of strings that specifies column names. If it is None, it will operate on all numeric columns.
- Returns
A new Table that contains the medians of the specified columns.
- merge_cols(columns, target)[source]¶
Merge column values as a list to a new col.
- Parameters
columns – list of str, the target columns to be merged.
target – str, the new column name of the merged column.
- Returns
A new Table that replaced columns with a new target column of merged list value.
- rename(columns)[source]¶
Rename columns with new column names
- Parameters
columns – dict. Name pairs. For instance, {‘old_name1’: ‘new_name1’, ‘old_name2’: ‘new_name2’}”.
- Returns
A new Table with new column names.
- show(n=20, truncate=True)[source]¶
Prints the first n rows to the console.
- Parameters
n – int, the number of rows to show.
truncate – If set to True, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length truncate and align cells right.
- cast(columns, type)[source]¶
Cast columns to the specified type.
- Parameters
columns – a string or a list of strings that specifies column names. If it is None, then cast all of the columns.
type – a string (“string”, “boolean”, “int”, “long”, “short”, “float”, “double”) that specifies the type.
- Returns
A new Table that casts all of the specified columns to the specified type.
- class zoo.friesian.feature.table.FeatureTable(df)[source]¶
Bases:
zoo.friesian.feature.table.Table- classmethod read_parquet(paths)[source]¶
Loads Parquet files as a FeatureTable.
- Parameters
paths – str or a list of str. The path/paths to Parquet file(s).
- Returns
A FeatureTable for recommendation data.
- encode_string(columns, indices)[source]¶
Encode columns with provided list of StringIndex.
- Parameters
columns – str or a list of str, target columns to be encoded.
indices – StringIndex or a list of StringIndex, StringIndexes of target columns. The StringIndex should at least have two columns: id and the corresponding categorical column. Or it can be a dict or a list of dicts. In this case, the keys of the dict should be within the categorical column and the values are the target ids to be encoded.
- Returns
A new FeatureTable which transforms categorical features into unique integer values with provided StringIndexes.
- gen_string_idx(columns, freq_limit)[source]¶
Generate unique index value of categorical features.
- Parameters
columns – str or a list of str, target columns to generate StringIndex.
freq_limit – int, dict or None. Categories with a count/frequency below freq_limit will be omitted from the encoding. Can be represented as both an integer, dict or None. For instance, 15, {‘col_4’: 10, ‘col_5’: 2} etc. None means all the categories that appear will be encoded.
- Returns
A list of StringIndex.
- gen_ind2ind(cols, indices)[source]¶
Generate a mapping between of indices
- Parameters
cols – a list of str, target columns to generate StringIndex.
indices – list of StringIndex
- Returns
FeatureTable
- cross_columns(crossed_columns, bucket_sizes)[source]¶
Cross columns and hashed to specified bucket size :param crossed_columns: list of column name pairs to be crossed. i.e. [[‘a’, ‘b’], [‘c’, ‘d’]] :param bucket_sizes: hash bucket size for crossed pairs. i.e. [1000, 300] :return: FeatureTable include crossed columns(i.e. ‘a_b’, ‘c_d’)
- normalize(columns)[source]¶
Normalize numeric columns :param columns: list of column names :return: FeatureTable
- add_negative_samples(item_size, item_col='item', label_col='label', neg_num=1)[source]¶
Generate negative item visits for each positive item visit
- Parameters
item_size – integer, max of item.
item_col – string, name of item column
label_col – string, name of label column
neg_num – integer, for each positive record, add neg_num of negative samples
- Returns
FeatureTable
- add_hist_seq(user_col, cols, sort_col='time', min_len=1, max_len=100)[source]¶
Generate a list of item visits in history
- Parameters
user_col – string, user column.
cols – list of string, ctolumns need to be aggragated
sort_col – string, sort by sort_col
min_len – int, minimal length of a history list
max_len – int, maximal length of a history list
- Returns
FeatureTable
- add_neg_hist_seq(item_size, item_history_col, neg_num)[source]¶
Generate a list negative samples for each item in item_history_col
- Parameters
item_size – int, max of item.
item2cat – FeatureTable with a dataframe of item to catgory mapping
item_history_col – string, this column should be a list of visits in history
neg_num – int, for each positive record, add neg_num of negative samples
- Returns
FeatureTable
- pad(padding_cols, seq_len=100)[source]¶
Post padding padding columns
- Parameters
padding_cols – list of string, columns need to be padded with 0s.
seq_len – int, length of padded column
- Returns
FeatureTable
- mask(mask_cols, seq_len=100)[source]¶
Mask mask_cols columns
- Parameters
mask_cols – list of string, columns need to be masked with 1s and 0s.
seq_len – int, length of masked column
- Returns
FeatureTable
- add_length(col_name)[source]¶
Generate the length of a columb.
- Parameters
col_name – string.
- Returns
FeatureTable
- mask_pad(padding_cols, mask_cols, seq_len=100)[source]¶
Mask and pad columns
- Parameters
padding_cols – list of string, columns need to be padded with 0s.
mask_cols – list of string, columns need to be masked with 1s and 0s.
seq_len – int, length of masked column
- Returns
FeatureTable
- transform_python_udf(in_col, out_col, udf_func)[source]¶
Transform a FeatureTable using a python udf
- Parameters
in_col – string, name of column needed to be transformed.
out_col – string, output column.
udf_func – user defined python function
- Returns
FeatureTable
- join(table, on=None, how=None)[source]¶
Join a FeatureTable with another FeatureTable, it is wrapper of spark dataframe join
- Parameters
table – FeatureTable
on – string, join on this column
how – string
- Returns
FeatureTable
- add_feature(item_cols, feature_tbl, default_value)[source]¶
Get the category or other field from another map like FeatureTable
- Parameters
item_cols – list[string]
feature_tbl – FeatureTable with two columns [category, item]
defalut_cat_index – default value for category if key does not exist
- Returns
FeatureTable
- class zoo.friesian.feature.table.StringIndex(df, col_name)[source]¶
Bases:
zoo.friesian.feature.table.Table- classmethod read_parquet(paths, col_name=None)[source]¶
Loads Parquet files as a StringIndex.
- Parameters
paths – str or a list of str. The path/paths to Parquet file(s).
col_name – str. The column name of the corresponding categorical column. If col_name is None, the file name will be used as col_name.
- Returns
A StringIndex.
- classmethod from_dict(indices, col_name)[source]¶
Create the StringIndex from a dict of indices.
- Parameters
indices – dict. The key is the categorical column, the value is the corresponding index. We assume that the key is a str and the value is a int.
col_name – str. The column name of the categorical column.
- Returns
A StringIndex.
- write_parquet(path, mode='overwrite')[source]¶
Write StringIndex to Parquet file.
- Parameters
path – str. The path to the folder of the Parquet file. Note that the col_name will be used as basename of the Parquet file.
mode – str. append, overwrite, error or ignore. append: Append contents of this StringIndex to existing data. overwrite: Overwrite existing data. error: Throw an exception if data already exists. ignore: Silently ignore this operation if data already exists.