Register a Dataset

Pre-requisites

Before registering a new dataset, please ensure you need the following.

  1. DataFamily id (df_id) of the data family this dataset will belong to. You can find instructions here if you have not registered the data family.

  2. The credential_id that maps to the registered cloud credentials with Markov. You can find instructions to register your cloud credentials here.

  3. The S3 file paths(URI's) of various segments.

NOTE: You can find the complete example at the end of this page here

Register a Dataset Using Python Library

First, define the segments of your dataset (Train, Test, Validate), as shown in the example below. If your data is not segmented, define a single segment path with a segment_type of unsplit.The path points to a file that contains data for that segment type (e.g., train).

import markov

# Define the dataset segments
segment_paths = [
    markov.datasegment.DataSegmentPath(
        path="s3://PATH_TO_YOUR_TRAIN_FILE.csv",
       segment_type=markov.SegmentType.Train,
    ),
    markov.datasegment.DataSegmentPath(
        path="s3://PATH_TO_YOUR_TEST_FILE.csv",
        segment_type=markov.SegmentType.Test,
    ),
    markov.datasegment.DataSegmentPath(
        path="s3://PATH_TO_YOUR_VALIDATE_FILE.csv",
        segment_type=markov.SegmentType.Validate,
    ),
]
# create properties object
property ={
  "name": "NAME_OF_YOUR_DATASET",
  "data_category": markov.DataCategory.Text,
  "datafamily_id": "DATA_FAMILY_ID" #datafamily_id this dataset will belong to
  "storage_type":markov.StorageType.S3,# currently only S3 is supported
  "credentials": "CREDENTIAL_ID",
  "data_segment_path": segment_paths,
  "delimiter": "DELIMITER_FOR_YOUR_DATASET", #options, ","/ ";"/":"/"\t"
  "notes": "Optional description of this dataset for your records",
  "x_col_names": ["NAME_OF_COLS_CONTAINING_FEATURE_DATA"], # list of feature/data column names
  "y_col_name": "TARGET_COLUMN_IF_APPLICABLE",
  "meta_data":  {"YOUR_KEY": "YOUR_VALUE"} # Key value pair to send additional info 
}
markov.data.register_dataset(**property)

Complete Example

The code below shows a complete example of registering a new data family, credentials set, and dataset using the MarkovML Python library.

import markov

# Create a new data family for the dataset. 
# If you have existing datafamiy please SKIP this step.

# STEP 1: Register DataFamily
df_reg_resp = markov.data.register_datafamily(
    name= "Hate Speech Data Family",
    notes= "This is a data family for hate speech datasets",
    lang= "en-us", 
    source= "3pInternet",# source of your dataset 
)

# STEP 2: Register New Credential
# You can skip STEP 2 if you've already registered your cloud credentials
# and have Markov credential id 
cred_resp = markov.credentials.register_s3_credentials(
    name="S3TestCredentials",
    access_key="S3_ACCESS_KEY",
    access_secret="S3_ACCESS_SECRET",
    notes="Credentials for S3",
)

# use data an existing data family id or the one created in STEP 1
df_id = df_reg_resp.df_id 
# usean existing credential_id registered with Markov or the one created in STEP2
cred_id = cred_resp.credential_id

# STEP 3: Register Dataset with Markov
# Create Segment Paths

segment_paths = [
    markov.datasegment.DataSegmentPath(
        path="s3://PATH_TO_YOUR_TRAIN_FILE.csv",
       segment_type=markov.SegmentType.Train,
    ),
    markov.datasegment.DataSegmentPath(
        path="s3://PATH_TO_YOUR_TEST_FILE.csv",
        segment_type=markov.SegmentType.Test,
    ),
    markov.datasegment.DataSegmentPath(
        path="s3://PATH_TO_YOUR_VALIDATE_FILE.csv",
        segment_type=markov.SegmentType.Validate,
    ),
]

property ={
  "name": "NAME_OF_YOUR_DATASET",
  "data_category": markov.DataCategory.Text,
  "datafamily_id": df_id #datafamily_id this dataset will belong to
  "storage_type":markov.StorageType.S3,# currently only S3 is supported
  "credentials": cred_id
  "data_segment_path": segment_paths,
  "delimiter": "DELIMITER_FOR_YOUR_DATASET", #options, ","/ ";"/":"/"\t"
  "notes": "Optional description of this dataset for your records",
  "x_col_names": ["NAME_OF_COLS_CONTAINING_FEATURE_DATA"], # list of feature/data column names
  "y_col_name": "TARGET_COLUMN_IF_APPLICABLE",
  "meta_data":  {"YOUR_KEY": "YOUR_VALUE"} # Key value pair to send additional info 
}
markov.data.register_dataset(**property)

Last updated

Was this helpful?