red_panda package¶
Submodules¶
red_panda.red_panda module¶
-
class
red_panda.red_panda.
RedPanda
(redshift_config: dict, aws_config: dict, default_bucket: str = None, dryrun: bool = False)¶ Bases:
red_panda.aws.redshift.RedshiftUtils
,red_panda.aws.s3.S3Utils
Class for operations between Pandas and Redshift/S3.
Parameters: - redshift_config – Redshift configuration.
- aws_config (optional) – AWS configuration.
- default_bucket (optional) – Default bucket to store files.
- dryrun (optional) – If True, queries will be printed instead of executed.
-
redshift_config
¶ Redshift configuration.
Type: dict
-
aws_config
¶ AWS configuration.
Type: dict
-
default_bucket
¶ Default bucket to store files.
Type: str
-
df_to_redshift
(df: pandas.core.frame.DataFrame, table_name: str, bucket: str = None, column_definition: dict = None, append: bool = False, path: str = None, file_name: str = None, cleanup: bool = True, **kwargs)¶ Pandas DataFrame to Redshift table.
Parameters: - df – Source dataframe.
- table_name – Redshift table name (optionally include schema name).
- bucket (optional) – S3 bucket name, fallback to default_bucket if not present.
- column_definition (optional) – Specify the column definition for CREATE TABLE. If not given and append is False, data type will be inferred.
- append (optional) – If true, df will be appended to Redshift table, otherwise table will be dropped and recreated.
- path (optional) – S3 key excluding file name.
- file_name (optional) – If None, file_name will be randomly generated.
- cleanup – (optional): Default True, S3 file will be deleted after COPY.
- **kwargs – keyword arguments to pass to Pandas to_csv and Redshift COPY.
-
redshift_to_s3
(sql: str, bucket: str = None, path: str = None, prefix: str = None, iam_role: str = None, file_format: str = None, partition_by: List[str] = None, include_partition_column: bool = False, manifest: bool = False, header: bool = False, delimiter: str = None, fixedwidth: Union[str, int] = None, encrypted: bool = False, bzip2: bool = False, gzip: bool = False, zstd: bool = False, addquotes: bool = False, null: str = None, escape: bool = False, allowoverwrite: bool = False, parallel: str = 'ON', maxfilesize: Union[str, int, float] = None, region: str = None)¶ Run sql and unload result to S3.
Parameters: - sql – SQL query.
- bucket – S3 bucket name.
- key (optional) – S3 key. Create if does not exist.
- prefix (optional) – Prefix of the set of files.
- iam_role (optional) – IAM Role string. If provided, this will be used as authorization instead of access_key_id/secret_access_key. This feature is untested.
- file_format (optional) – CSV or PARQUET.
- manifest (optional) – Whether or not to create the manifest file.
- header (optional) – Whether or not to include header.
- delimiter (optional) – Delimiter charater if the output file is delimited.
- fixedwidth (optional) – If not None, it will overwrite delimiter and use fixedwidth format instead.
- encrypted (optional) – Whether or not the files should be encrypted.
- bzip2 (optional) – Whether or not the files should be compressed with bzip2.
- gzip (optional) – Whether or not the files should be compressed with gzip.
- zstd (optional) – Whether or not the files should be compressed with zstd.
- addquotes (optional) – Whether or not values with delimiter characters should be quoted.
- null (optional) – Specify the NULL AS string.
- escape (optional) – Whether to include the ESCAPE argument in UNLOAD.
- allowoverwrite (optional) – Whether or not existing files should be overwritten. Redshift will fail with error message if this is False and there are existing files.
- parallel (optional) – ON or OFF. Whether or not to use parallel and unload into multiple files.
- maxfilesize (optional) – Maxfilesize argument for UNLOAD.
- region (optional) – AWS region if S3 region is different from Redshift region.
-
s3_to_redshift
(bucket: str, key: str, table_name: str, column_definition: dict = None, append: bool = False, delimiter: str = ', ', ignoreheader: int = 1, quote_character: str = '"', dateformat: str = 'auto', timeformat: str = 'auto', acceptinvchars: str = '?', acceptanydate: bool = False, blanksasnull: bool = False, emptyasnull: bool = False, escape: bool = False, null: str = None, encoding: str = None, explicit_ids: bool = False, fillrecord: bool = False, ignoreblanklines: bool = False, removequotes: bool = False, roundec: bool = False, trimblanks: bool = False, truncatecolumns: bool = False, region: str = None, iam_role: str = None, column_list: list = None)¶ Load S3 file into Redshift.
Parameters: - bucket – S3 bucket name.
- key – S3 key.
- table_name – Redshift table name (optional include schema name).
- column_definition (optional) – Specify the column definition if for CREATE TABLE.
- append (optional) – Ff True, df will be appended to Redshift table, otherwise table will be dropped and recreated.
- delimiter (optional) – Delimiter of file. Default is “,”.
- ignoreheader (optional) – number of header lines to skip when COPY.
- quote_character (optional) – QUOTE_CHARACTER for COPY. Only used when delimiter is “,”. Default to ‘”’.
- dateformat (optional) – TIMEFORMAT argument for COPY. Default is “auto”.
- timeformat (optional) – TIMEFORMAT argument for COPY. Default is “auto”.
- acceptinvchars (optional) – Whether to include the ACCEPTINVCHAR argument in COPY.
- acceptanydate (optional) – Allows any date format, including invalid formats.
- blanksasnull (optional) – Loads blank fields, which consist of only white space characters, as NULL.
- emptyasnull (optional) – Indicates that Amazon Redshift should load empty CHAR and VARCHAR fields as NULL.
- escape (optional) – Whether to include the ESCAPE argument in COPY.
- null (optional) – Specify the NULL AS string.
- encoding (optional) – Specifies the encoding type of the load data.
- explicit_ids (optional) – Use EXPLICIT_IDS with tables that have IDENTITY column.
- fillrecord (optional) – Allows data files to be loaded when contiguous columns are missing at the end of some of the records.
- ignoreblanklines (optional) – Ignores blank lines that only contain a line feed in a data file and does not try to load them.
- removequotes (optional) – Removes surrounding quotation marks from strings in the incoming data.
- roundec (optional) – Rounds up numeric values when the scale of the input value is greater than the scale of the column.
- trimblanks (optional) – Removes the trailing white space characters from a VARCHAR string.
- truncatecolumns (optional) – Truncates data in columns to the appropriate number of characters so that it fits the column specification.
- region (optional) – S3 region.
- iam_role (optional) – Use IAM Role for access control.
- column_list (optional) – List of columns to COPY.
-
red_panda.red_panda.
check_invalid_columns
(columns: list)¶ Check column names against Redshift reserved words.
Parameters: columns – A list of column names to check. Raises: ValueError
– If the column name is invalid.
-
red_panda.red_panda.
map_types
(columns_types: dict) → dict¶ Convert Pandas dtypes to Redshift data types.
Parameters: cols_types – The return value of dict(df.dtypes), where df is a Pandas Dataframe. Returns: A dict of {original column name: mapped redshift data type} Return type: dict