Skip to main content

How to add dataset to Layer

Open in Layer Open in Colab Layer Examples Github

Layer helps you build, train and track all your machine learning project metadata including ML models and datasets‍ with semantic versioning, extensive artifact logging and dynamic reporting with local↔cloud training.

In this quick walkthrough, we'll take a look at how to register and track datasets with Layer.

Install Layer

Ensure that you have the latest version of Layer installed.

!pip install layer --upgrade -qqq

Authenticate your Layer account

Once Layer is installed, you need to log in to your Layer account. The created data will be stored under this account. Therefore, this step is a must.

import layer
layer.login()

Create a project

The next step is to create a project. The dataset will be saved under this project.

Layer Projects are smart containers to organize your machine learning metadata such as models, datasets, metrics, reports etc. They associate some combination of datasets and models. Layer projects are basically the front page of your ML projects which includes all your ML metadata including ML models, datasets, metric, parameters and more.

In Layer, projects are created using the layer.init command while passing the name of the project.

layer.init("iris")

⬆️Click this link to visit your Layer Project page.

Create your dataset function

The first step is to define a dataset function that will load the data and do any pre-processing that you'd like.

!git clone https://github.com/layerai/examples.git
!mv /content/examples/tutorials/add-datasets-to-layer/iris.csv iris.csv 

def save_iris():
data_file = 'iris.csv'
import pandas as pd
df = pd.read_csv(data_file)
classes = df['Species'].nunique()
# Log data about your data
print(f"Number of classes {classes}")
return df
df = save_iris()
df.head()

Saving the data to Layer

We can interact with Layer using decorators. Layer has built-in decorators for different purposes. In this case, we are interested in the @dataset decorator is used to create new datasets.

Let's demonstrate how to use the @dataset decorator by saving the Iris dataset.

If your dataset depends on a file like a CSV file, you can bundle it with your decorated function with resources decorator. Layer automatically uploads your local file. The decorator expects the path to the data file.

Let's also replace print() with layer.log() to enable experiment tracking.

import layer
from layer.decorators import dataset,pip_requirements
from layer.decorators import resources

data_file = 'iris.csv'
@resources(data_file)
@pip_requirements(packages=["matplotlib","seaborn"])
@dataset('iris_data')
def save_iris():
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(data_file)
classes = df['Species'].nunique()
# Log data about your data
layer.log({"Number of classes": classes})
# Log some data statistics
plt.figure(figsize=(12,8))
plt.title('Species Countplot')
plt.xticks(rotation=90,fontsize=12)
sns.countplot(x='Species',data=df)
layer.log({"Species Countplot":plt.gcf() })

plt.figure(figsize=(12,8))
plt.xticks(rotation=90,fontsize=12)
sns.violinplot(x='Species',y='PetalWidthCm',data=df)
layer.log({"Species violinplot":plt.gcf() })

plt.figure(figsize=(12,8))
plt.xticks(rotation=90,fontsize=12)
sns.boxplot(x="Species", y="PetalLengthCm", data=df)
layer.log({"Boxplot":plt.gcf() })

plt.figure(figsize=(12,8))
sns.scatterplot(x='SepalLengthCm',y='PetalLengthCm',hue='Species',data=df)
layer.log({"Scatterplot":plt.gcf() })

return df

When you execute this function, the data will be stored in Layer under the project you just intitialized.

You can execute this function in two ways.

Run the function localy

Running the function locally uses your local infrastructure. However, the resulting DataFrame will still be saved to Layer. Layer will also print a link that you can use to view the data immediately.

save_iris()

⬆️ Click the above link to see the registered data in your Layer Project.

Run the function on Layer infrastructure

You can also choose to execute the function on Layer's infrastructure. This is useful especially when dealing with large data that require a lot of computation power.

Running functions on Layer infra is done by passing them to the layer.run command. The command expects a list of functions.

# Execute the function on Layer infra
layer.run([save_iris])

⬆️ Click the above link to see the registered data in your Layer Project. You will see that Layer automatically registered and versioned your data.

Data on Layer

How to load and use your data from Layer

Once you register your data to Layer, you can load the data with simple calling layer.get_dataset(DATASET_NAME)

df = layer.get_dataset("layer/iris/datasets/iris_data").to_pandas()
df.head()

Where to go from here?

Now that you have registered your first model to Layer, you can: