Before registering a new dataset, please ensure you need the following.
DataFamily id (df_id) of the data family this dataset will belong to. You can find instructions here if you have not registered the data family.
The credential_id that maps to the registered cloud credentials with Markov. You can find instructions to register your cloud credentials here.
The S3 file paths(URI's) of various segments.
NOTE: You can find the complete example at the end of this page here
Register a Dataset Using Python Library
First, define the segments of your dataset (Train, Test, Validate), as shown in the example below. If your data is not segmented, define a single segment path with a segment_type of unsplit.The path points to a file that contains data for that segment type (e.g., train).
import markov
# Define the dataset segments
segment_paths = [
markov.datasegment.DataSegmentPath(
path="s3://PATH_TO_YOUR_TRAIN_FILE.csv",
segment_type=markov.SegmentType.Train,
),
markov.datasegment.DataSegmentPath(
path="s3://PATH_TO_YOUR_TEST_FILE.csv",
segment_type=markov.SegmentType.Test,
),
markov.datasegment.DataSegmentPath(
path="s3://PATH_TO_YOUR_VALIDATE_FILE.csv",
segment_type=markov.SegmentType.Validate,
),
]
# create properties object
property ={
"name": "NAME_OF_YOUR_DATASET",
"data_category": markov.DataCategory.Text,
"datafamily_id": "DATA_FAMILY_ID" #datafamily_id this dataset will belong to
"storage_type":markov.StorageType.S3,# currently only S3 is supported
"credentials": "CREDENTIAL_ID",
"data_segment_path": segment_paths,
"delimiter": "DELIMITER_FOR_YOUR_DATASET", #options, ","/ ";"/":"/"\t"
"notes": "Optional description of this dataset for your records",
"x_col_names": ["NAME_OF_COLS_CONTAINING_FEATURE_DATA"], # list of feature/data column names
"y_col_name": "TARGET_COLUMN_IF_APPLICABLE",
"meta_data": {"YOUR_KEY": "YOUR_VALUE"} # Key value pair to send additional info
}
markov.data.register_dataset(**property)
Complete Example
The code below shows a complete example of registering a new data family, credentials set, and dataset using the MarkovML Python library.
import markov
# Create a new data family for the dataset.
# If you have existing datafamiy please SKIP this step.
# STEP 1: Register DataFamily
df_reg_resp = markov.data.register_datafamily(
name= "Hate Speech Data Family",
notes= "This is a data family for hate speech datasets",
lang= "en-us",
source= "3pInternet",# source of your dataset
)
# STEP 2: Register New Credential
# You can skip STEP 2 if you've already registered your cloud credentials
# and have Markov credential id
cred_resp = markov.credentials.register_s3_credentials(
name="S3TestCredentials",
access_key="S3_ACCESS_KEY",
access_secret="S3_ACCESS_SECRET",
notes="Credentials for S3",
)
# use data an existing data family id or the one created in STEP 1
df_id = df_reg_resp.df_id
# usean existing credential_id registered with Markov or the one created in STEP2
cred_id = cred_resp.credential_id
# STEP 3: Register Dataset with Markov
# Create Segment Paths
segment_paths = [
markov.datasegment.DataSegmentPath(
path="s3://PATH_TO_YOUR_TRAIN_FILE.csv",
segment_type=markov.SegmentType.Train,
),
markov.datasegment.DataSegmentPath(
path="s3://PATH_TO_YOUR_TEST_FILE.csv",
segment_type=markov.SegmentType.Test,
),
markov.datasegment.DataSegmentPath(
path="s3://PATH_TO_YOUR_VALIDATE_FILE.csv",
segment_type=markov.SegmentType.Validate,
),
]
property ={
"name": "NAME_OF_YOUR_DATASET",
"data_category": markov.DataCategory.Text,
"datafamily_id": df_id #datafamily_id this dataset will belong to
"storage_type":markov.StorageType.S3,# currently only S3 is supported
"credentials": cred_id
"data_segment_path": segment_paths,
"delimiter": "DELIMITER_FOR_YOUR_DATASET", #options, ","/ ";"/":"/"\t"
"notes": "Optional description of this dataset for your records",
"x_col_names": ["NAME_OF_COLS_CONTAINING_FEATURE_DATA"], # list of feature/data column names
"y_col_name": "TARGET_COLUMN_IF_APPLICABLE",
"meta_data": {"YOUR_KEY": "YOUR_VALUE"} # Key value pair to send additional info
}
markov.data.register_dataset(**property)