red_panda package¶

Submodules¶

red_panda.red_panda module¶

class red_panda.red_panda.RedPanda(redshift_config: dict, aws_config: dict, default_bucket: str = None, dryrun: bool = False)¶

Bases: red_panda.aws.redshift.RedshiftUtils, red_panda.aws.s3.S3Utils

Class for operations between Pandas and Redshift/S3.

Parameters:	redshift_config – Redshift configuration. aws_config (optional) – AWS configuration. default_bucket (optional) – Default bucket to store files. dryrun (optional) – If True, queries will be printed instead of executed.

redshift_config¶

Redshift configuration.

Type:	dict

aws_config¶

AWS configuration.

Type:	dict

default_bucket¶

Default bucket to store files.

Type:	str

df_to_redshift(df: pandas.core.frame.DataFrame, table_name: str, bucket: str = None, column_definition: dict = None, append: bool = False, path: str = None, file_name: str = None, cleanup: bool = True, **kwargs)¶

Pandas DataFrame to Redshift table.

Parameters:

df – Source dataframe.
table_name – Redshift table name (optionally include schema name).
bucket (optional) – S3 bucket name, fallback to default_bucket if not present.
column_definition (optional) – Specify the column definition for CREATE TABLE. If not given and append is False, data type will be inferred.
append (optional) – If true, df will be appended to Redshift table, otherwise table will be dropped and recreated.
path (optional) – S3 key excluding file name.
file_name (optional) – If None, file_name will be randomly generated.
cleanup – (optional): Default True, S3 file will be deleted after COPY.
**kwargs – keyword arguments to pass to Pandas to_csv and Redshift COPY.

redshift_to_s3(sql: str, bucket: str = None, path: str = None, prefix: str = None, iam_role: str = None, file_format: str = None, partition_by: List[str] = None, include_partition_column: bool = False, manifest: bool = False, header: bool = False, delimiter: str = None, fixedwidth: Union[str, int] = None, encrypted: bool = False, bzip2: bool = False, gzip: bool = False, zstd: bool = False, addquotes: bool = False, null: str = None, escape: bool = False, allowoverwrite: bool = False, parallel: str = 'ON', maxfilesize: Union[str, int, float] = None, region: str = None)¶

Run sql and unload result to S3.

Parameters:

sql – SQL query.
bucket – S3 bucket name.
key (optional) – S3 key. Create if does not exist.
prefix (optional) – Prefix of the set of files.
iam_role (optional) – IAM Role string. If provided, this will be used as authorization instead of access_key_id/secret_access_key. This feature is untested.
file_format (optional) – CSV or PARQUET.
manifest (optional) – Whether or not to create the manifest file.
header (optional) – Whether or not to include header.
delimiter (optional) – Delimiter charater if the output file is delimited.
fixedwidth (optional) – If not None, it will overwrite delimiter and use fixedwidth format instead.
encrypted (optional) – Whether or not the files should be encrypted.
bzip2 (optional) – Whether or not the files should be compressed with bzip2.
gzip (optional) – Whether or not the files should be compressed with gzip.
zstd (optional) – Whether or not the files should be compressed with zstd.
addquotes (optional) – Whether or not values with delimiter characters should be quoted.
null (optional) – Specify the NULL AS string.
escape (optional) – Whether to include the ESCAPE argument in UNLOAD.
allowoverwrite (optional) – Whether or not existing files should be overwritten. Redshift will fail with error message if this is False and there are existing files.
parallel (optional) – ON or OFF. Whether or not to use parallel and unload into multiple files.
maxfilesize (optional) – Maxfilesize argument for UNLOAD.
region (optional) – AWS region if S3 region is different from Redshift region.

s3_to_redshift(bucket: str, key: str, table_name: str, column_definition: dict = None, append: bool = False, delimiter: str = ', ', ignoreheader: int = 1, quote_character: str = '"', dateformat: str = 'auto', timeformat: str = 'auto', acceptinvchars: str = '?', acceptanydate: bool = False, blanksasnull: bool = False, emptyasnull: bool = False, escape: bool = False, null: str = None, encoding: str = None, explicit_ids: bool = False, fillrecord: bool = False, ignoreblanklines: bool = False, removequotes: bool = False, roundec: bool = False, trimblanks: bool = False, truncatecolumns: bool = False, region: str = None, iam_role: str = None, column_list: list = None)¶

Load S3 file into Redshift.

Parameters:

bucket – S3 bucket name.
key – S3 key.
table_name – Redshift table name (optional include schema name).
column_definition (optional) – Specify the column definition if for CREATE TABLE.
append (optional) – Ff True, df will be appended to Redshift table, otherwise table will be dropped and recreated.
delimiter (optional) – Delimiter of file. Default is “,”.
ignoreheader (optional) – number of header lines to skip when COPY.
quote_character (optional) – QUOTE_CHARACTER for COPY. Only used when delimiter is “,”. Default to ‘”’.
dateformat (optional) – TIMEFORMAT argument for COPY. Default is “auto”.
timeformat (optional) – TIMEFORMAT argument for COPY. Default is “auto”.
acceptinvchars (optional) – Whether to include the ACCEPTINVCHAR argument in COPY.
acceptanydate (optional) – Allows any date format, including invalid formats.
blanksasnull (optional) – Loads blank fields, which consist of only white space characters, as NULL.
emptyasnull (optional) – Indicates that Amazon Redshift should load empty CHAR and VARCHAR fields as NULL.
escape (optional) – Whether to include the ESCAPE argument in COPY.
null (optional) – Specify the NULL AS string.
encoding (optional) – Specifies the encoding type of the load data.
explicit_ids (optional) – Use EXPLICIT_IDS with tables that have IDENTITY column.
fillrecord (optional) – Allows data files to be loaded when contiguous columns are missing at the end of some of the records.
ignoreblanklines (optional) – Ignores blank lines that only contain a line feed in a data file and does not try to load them.
removequotes (optional) – Removes surrounding quotation marks from strings in the incoming data.
roundec (optional) – Rounds up numeric values when the scale of the input value is greater than the scale of the column.
trimblanks (optional) – Removes the trailing white space characters from a VARCHAR string.
truncatecolumns (optional) – Truncates data in columns to the appropriate number of characters so that it fits the column specification.
region (optional) – S3 region.
iam_role (optional) – Use IAM Role for access control.
column_list (optional) – List of columns to COPY.

red_panda.red_panda.check_invalid_columns(columns: list)¶

Check column names against Redshift reserved words.

Parameters:	columns – A list of column names to check.
Raises:	`ValueError` – If the column name is invalid.

red_panda.red_panda.map_types(columns_types: dict) → dict¶

Convert Pandas dtypes to Redshift data types.

Parameters:	cols_types – The return value of dict(df.dtypes), where df is a Pandas Dataframe.
Returns:	A dict of {original column name: mapped redshift data type}
Return type:	dict

red_panda package¶

Subpackages¶

Submodules¶

red_panda.red_panda module¶

red_panda.typing module¶

Module contents¶