red_panda package

Submodules

red_panda.red_panda module

class red_panda.red_panda.RedPanda(redshift_config: dict, aws_config: dict, default_bucket: str = None, dryrun: bool = False)

Bases: red_panda.aws.redshift.RedshiftUtils, red_panda.aws.s3.S3Utils

Class for operations between Pandas and Redshift/S3.

Parameters:
  • redshift_config – Redshift configuration.
  • aws_config (optional) – AWS configuration.
  • default_bucket (optional) – Default bucket to store files.
  • dryrun (optional) – If True, queries will be printed instead of executed.
redshift_config

Redshift configuration.

Type:dict
aws_config

AWS configuration.

Type:dict
default_bucket

Default bucket to store files.

Type:str
df_to_redshift(df: pandas.core.frame.DataFrame, table_name: str, bucket: str = None, column_definition: dict = None, append: bool = False, path: str = None, file_name: str = None, cleanup: bool = True, **kwargs)

Pandas DataFrame to Redshift table.

Parameters:
  • df – Source dataframe.
  • table_name – Redshift table name (optionally include schema name).
  • bucket (optional) – S3 bucket name, fallback to default_bucket if not present.
  • column_definition (optional) – Specify the column definition for CREATE TABLE. If not given and append is False, data type will be inferred.
  • append (optional) – If true, df will be appended to Redshift table, otherwise table will be dropped and recreated.
  • path (optional) – S3 key excluding file name.
  • file_name (optional) – If None, file_name will be randomly generated.
  • cleanup – (optional): Default True, S3 file will be deleted after COPY.
  • **kwargs – keyword arguments to pass to Pandas to_csv and Redshift COPY.
redshift_to_s3(sql: str, bucket: str = None, path: str = None, prefix: str = None, iam_role: str = None, file_format: str = None, partition_by: List[str] = None, include_partition_column: bool = False, manifest: bool = False, header: bool = False, delimiter: str = None, fixedwidth: Union[str, int] = None, encrypted: bool = False, bzip2: bool = False, gzip: bool = False, zstd: bool = False, addquotes: bool = False, null: str = None, escape: bool = False, allowoverwrite: bool = False, parallel: str = 'ON', maxfilesize: Union[str, int, float] = None, region: str = None)

Run sql and unload result to S3.

Parameters:
  • sql – SQL query.
  • bucket – S3 bucket name.
  • key (optional) – S3 key. Create if does not exist.
  • prefix (optional) – Prefix of the set of files.
  • iam_role (optional) – IAM Role string. If provided, this will be used as authorization instead of access_key_id/secret_access_key. This feature is untested.
  • file_format (optional) – CSV or PARQUET.
  • manifest (optional) – Whether or not to create the manifest file.
  • header (optional) – Whether or not to include header.
  • delimiter (optional) – Delimiter charater if the output file is delimited.
  • fixedwidth (optional) – If not None, it will overwrite delimiter and use fixedwidth format instead.
  • encrypted (optional) – Whether or not the files should be encrypted.
  • bzip2 (optional) – Whether or not the files should be compressed with bzip2.
  • gzip (optional) – Whether or not the files should be compressed with gzip.
  • zstd (optional) – Whether or not the files should be compressed with zstd.
  • addquotes (optional) – Whether or not values with delimiter characters should be quoted.
  • null (optional) – Specify the NULL AS string.
  • escape (optional) – Whether to include the ESCAPE argument in UNLOAD.
  • allowoverwrite (optional) – Whether or not existing files should be overwritten. Redshift will fail with error message if this is False and there are existing files.
  • parallel (optional) – ON or OFF. Whether or not to use parallel and unload into multiple files.
  • maxfilesize (optional) – Maxfilesize argument for UNLOAD.
  • region (optional) – AWS region if S3 region is different from Redshift region.
s3_to_redshift(bucket: str, key: str, table_name: str, column_definition: dict = None, append: bool = False, delimiter: str = ', ', ignoreheader: int = 1, quote_character: str = '"', dateformat: str = 'auto', timeformat: str = 'auto', acceptinvchars: str = '?', acceptanydate: bool = False, blanksasnull: bool = False, emptyasnull: bool = False, escape: bool = False, null: str = None, encoding: str = None, explicit_ids: bool = False, fillrecord: bool = False, ignoreblanklines: bool = False, removequotes: bool = False, roundec: bool = False, trimblanks: bool = False, truncatecolumns: bool = False, region: str = None, iam_role: str = None, column_list: list = None)

Load S3 file into Redshift.

Parameters:
  • bucket – S3 bucket name.
  • key – S3 key.
  • table_name – Redshift table name (optional include schema name).
  • column_definition (optional) – Specify the column definition if for CREATE TABLE.
  • append (optional) – Ff True, df will be appended to Redshift table, otherwise table will be dropped and recreated.
  • delimiter (optional) – Delimiter of file. Default is “,”.
  • ignoreheader (optional) – number of header lines to skip when COPY.
  • quote_character (optional) – QUOTE_CHARACTER for COPY. Only used when delimiter is “,”. Default to ‘”’.
  • dateformat (optional) – TIMEFORMAT argument for COPY. Default is “auto”.
  • timeformat (optional) – TIMEFORMAT argument for COPY. Default is “auto”.
  • acceptinvchars (optional) – Whether to include the ACCEPTINVCHAR argument in COPY.
  • acceptanydate (optional) – Allows any date format, including invalid formats.
  • blanksasnull (optional) – Loads blank fields, which consist of only white space characters, as NULL.
  • emptyasnull (optional) – Indicates that Amazon Redshift should load empty CHAR and VARCHAR fields as NULL.
  • escape (optional) – Whether to include the ESCAPE argument in COPY.
  • null (optional) – Specify the NULL AS string.
  • encoding (optional) – Specifies the encoding type of the load data.
  • explicit_ids (optional) – Use EXPLICIT_IDS with tables that have IDENTITY column.
  • fillrecord (optional) – Allows data files to be loaded when contiguous columns are missing at the end of some of the records.
  • ignoreblanklines (optional) – Ignores blank lines that only contain a line feed in a data file and does not try to load them.
  • removequotes (optional) – Removes surrounding quotation marks from strings in the incoming data.
  • roundec (optional) – Rounds up numeric values when the scale of the input value is greater than the scale of the column.
  • trimblanks (optional) – Removes the trailing white space characters from a VARCHAR string.
  • truncatecolumns (optional) – Truncates data in columns to the appropriate number of characters so that it fits the column specification.
  • region (optional) – S3 region.
  • iam_role (optional) – Use IAM Role for access control.
  • column_list (optional) – List of columns to COPY.
red_panda.red_panda.check_invalid_columns(columns: list)

Check column names against Redshift reserved words.

Parameters:columns – A list of column names to check.
Raises:ValueError – If the column name is invalid.
red_panda.red_panda.map_types(columns_types: dict) → dict

Convert Pandas dtypes to Redshift data types.

Parameters:cols_types – The return value of dict(df.dtypes), where df is a Pandas Dataframe.
Returns:A dict of {original column name: mapped redshift data type}
Return type:dict

red_panda.typing module

Module contents