Data Schema¶

Data schema contains all the information about the data in the Task, it is created at the beginning and is immutable.

Tip

Data Schema can be easily created starting from a template from the Web App. Go in Data Schema page after you created a Task and see the precompiled version of Data Schema, update and insert new Columns to create your custom version.

A Data schema is composed of a list of objects named Column that represent each data entity in the Task. The number and type of Column objects depend on the task type and task data structure.

Column attributes¶

A Column object has some mandatory attributes and others that depends on its role or data type:

Attribute	Description	Mandatory
Name	Name of the entity used to read it from raw data. For instance, in Tabular tasks, it represents the name of the column of the CSV file.	Mandatory
Data type	Data format of the entity. Possible values are Float: numeric value Categorical: entity that can assume a only specified values. A Categorical Column requires the attribute possible_values to be specified. String: generic textual data like text input or customer id. To not be used for categorical columns. Array 1: one-dimensional array. Requires dims attribute to be defined like a list of 1 element [n] that specifies the number of elements of the array. Array 2: two-dimensional array. Requires dims attribute to be defined like a list of 2 elements [n, m] that specifies the number elements of the each dimension of the array. Array 3: three-dimensional array. Requires dims attribute to be defined like a list of 3 elements [n, m, k] that specifies the number elements of the each dimension of the array.	Mandatory
Role	Defines the role the Column object has in the Task. According to the Task type some roles are required or not allowed. More information in the following sections.	Mandatory
Subrole	Additional specification of the role in the Task. Some entities belong to the same Role but have different meanings, the Subroles allows to distinguish between them. More information in the following sections.	Depends on Task Type
Is Nullable	If the entity allows missing values.	Mandatory
Dims	List with the number of elements each dimension of the array has. The value -1 indicates that that dimension can have an arbitrary number of elements.	Required when Data Type is Array
Tolerance	Specifies the tolerance for image data, defining the acceptable pixel variation in image size. Tol=0: Strict matching, only images of the exact specified size are accepted. Tol > 0: Allows a size variation of up to ±Tol pixels in each dimension. For example, if the expected size is (100, 100) and Tol = 5, images between (95, 95) and (105, 105) are accepted. Tol=none: Fully flexible, images of any size are allowed.	Required when Column Role is Input and Data Structure is Image.
Possible values	List of values the categorical variable can assume. They can be either strings or numbers. When Task Type is Classification Multilabel and Role is Target, possibile values must be [0, 1] indicating the presence or not of that class.	Required when Column Data Type is Categorical
Classes Names	Names of the classes in the Task. The length of this list must match the length of the Dims of the array.	Required when Column Role is Target and Task Type is Classification Multilabel.
Image Mode	Type of image, it can be RGB, RGBA, GRAYSCALE. It also determines the Data Type, which is Array 3 for RGB and RGBA and Array 2 for GRAYSCALE.	Required when Column Role is Input and Data Structure is Image.

Role¶

The Role defines what the Column object represents for the Task. Roles are used by ML cube Platform to correctly use provided data. Some Roles are needed to uniquely identify a sample, other to retrieve the correct information. Moreover, some Roles must be inserted by you when creating the Data Schema the first time, while others, like the model predictions, are created automatically by ML cube Platform.

User defined roles are:

Role	Data Type	Description	Mandatory
ID	String	Unique identifier of the sample. It is used during data validation to avoid duplicates of data and to communicate information about data with you without sending the actual data	It must be always present when sending data to ML cube Platform.
Time ID	Float	Timestamp of the sample expressed in seconds (for that reason it is a Float). It is used to temporally order samples maintaining coherence in the analysis of ML cube Platform.	It must be always present when sending data to ML cube Platform.
Metadata	Float, Categorical and String	Represents additional data that are not used as input by the algorithm but that provide contextual information for each sample. For instance, a metadata column can represents the country code	It is optional since it depends on your choice to upload additional information in ML cube Platform
Input	Any available Data Type	Represents input data like a single feature for Tabular tasks or image in Image tasks or text in Text tasks	According to Task Type the number of Input Column object varies from 1 to illimitate. See Section Data schema templates
Target	Any available Data Type. It must be coherent with Task Type	Represents the true value of the sample in supervised tasks.	It is mandatory for supervised tasks.
Input additional embedding	Array 1	Embedding vector of the Input Column. It is allowed only then Data Structure of Task is Image or Text. When this Column object is present, ML cube Platform uses it as numerical representation of the data, otherwise, it uses an internal embedding algorithm.	It is optional since it depends on your choice to share with ML cube Platform this type of data.
Target additional embedding	Array 1	Embedding vector of the Target Column. It is allowed only then Task Type is RAG. When this Column object is present, ML cube Platform uses it as numerical representation of the data, otherwise, it uses an internal embedding algorithm.	It is optional since it depends on your choice to share with ML cube Platform this type of data.

ML cube Platform defined roles are:

Role	Data Type	Description
Prediction	Same Data Type of Target Column	Prediction Column object automatically created when the Task Model is created. The name has the fixed template: <MODEL_NAME>@<MODEL_VERSION>
Prediction additional embedding	Array 1	Embedding vector of the Prediction Column. It is allowed only then Task Type is RAG. When this Column object is present, ML cube Platform uses it as numerical representation of the data, otherwise, it uses an internal embedding algorithm.

Subrole¶

Some tasks can have different data entities for the same Role, the Column object's attribute Subrole helps to specify the correct type of data.

Subrole	Associated Role	Data Type	Description
RAG User Input	INPUT	String	In RAG Tasks it is the user query submitted to the system.
RAG Retrieved Context	INPUT	String	In RAG Tasks it is the retrieved contexts (separated with the Task attribute context separator) that the retrieval system has selected to answer the query.
Model probability	PREDICTION	Depends on Task Type: RAG: Array 1 Classification Binary: Float Classification Multiclass: Array 1 Classification Multilabel: Array 1 Semantic Segmentation: Array 3	It is automatically created by ML cube Platform when the created Model has the flag additional probabilistic output set as True. The name has fixed template: <MODEL_NAME>_probability@<MODEL_VERSION>.
Object prediction label	PREDICTION	Array 1	It is automatically created when Task Type is Object Detection, Semantic Segmentation, or OCR (`with_labels` mode). It is an array with length equal to the number of predicted entities (bounding boxes for Object Detection and OCR, segmented regions for Semantic Segmentation), where each element contains the class label assigned to the corresponding entity. The name has a fixed template: <MODEL_NAME>_predicted_labels@<MODEL_VERSION>.
Object target label	TARGET	Array 1	It is mandatory when Task Type is Object Detection, Semantic Segmentation, or OCR (`with_labels` mode). It is an array with length equal to the number of ground truth entities (bounding boxes for Object Detection and OCR, annotated regions for Semantic Segmentation), where each element contains the class label assigned to the corresponding entity.
Object prediction text	PREDICTION	Array 1	It is used when Task Type is OCR (`with_labels` mode). It contains the extracted text associated with each detected text region. The name has a fixed template: <MODEL_NAME>_predicted_text@<MODEL_VERSION>.
Object target text	TARGET	Array 1	It is used when Task Type is OCR (`with_labels` mode). It contains the ground truth text associated with each annotated text region.
Seasonality	INPUT	Float	It is used in Timeseries Tasks to represent seasonal components of the signal
Trend	INPUT	Float	It is used in Timeseries Tasks to represent the long-term trend component of the signal.
Regressor	INPUT	Float	It is used in Timeseries Tasks to represent external explanatory variables that influence the target but are not part of the temporal signal itself.

Data schema constraints¶

Each combination of Task Type and Data Structure leads to different Data Schema requirements that must be satisfied when it is created for the Task. For instance, image binary classification tasks requires only one input column object with image data type and target column object must be categorical with only two possible values.

Note

Object Detection and Semantic Segmentation have specific constraints about the dims attribute of the TARGET and PREDICTION columns:

Object Detection [-1, 4]: the first is for identified objects, the second is for bounding box specification: x_min, x_max, y_min, y_max
Semantic Segmentation [-1, -1, 2]: the first is for identified objects, the second is for polygon vertices, the third is for vertices coordinates x, y

Here the list of constraints about quantities for each Role:

RegressionClassification BinaryClassification MulticlassClassification MultilabelRAGObject DetectionSemantic SegmentationClusteringOCR

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING
Regression	Tabular	1	1	\(\ge\) 1	\(\ge\) 0	1	0
Regression	Embedding	1	1	1	\(\ge\) 0	1	0
Regression	Image	1	1	1	\(\ge\) 0	1	\(\le\) 1
Regression	Text	1	1	1	\(\ge\) 0	1	\(\le\) 1

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING
Classification Binary	Tabular	1	1	\(\ge\) 1	\(\ge\) 0	1	0
Classification Binary	Embedding	1	1	1	\(\ge\) 0	1	0
Classification Binary	Image	1	1	1	\(\ge\) 0	1	\(\le\) 1
Classification Binary	Text	1	1	1	\(\ge\) 0	1	\(\le\) 1

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING
Classification Multiclass	Tabular	1	1	\(\ge\) 1	\(\ge\) 0	1	0
Classification Multiclass	Embedding	1	1	1	\(\ge\) 0	1	0
Classification Multiclass	Image	1	1	1	\(\ge\) 0	1	\(\le\) 1
Classification Multiclass	Text	1	1	1	\(\ge\) 0	1	\(\le\) 1

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING
Classification Multilabel	Tabular	1	1	\(\ge\) 1	\(\ge\) 0	1	0
Classification Multilabel	Embedding	1	1	1	\(\ge\) 0	1	0
Classification Multilabel	Image	1	1	1	\(\ge\) 0	1	\(\le\) 1
Classification Multilabel	Text	1	1	1	\(\ge\) 0	1	\(\le\) 1

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING	TARGET ADDITIONAL EMBEDDING	OBJECT LABEL TARGET	OBJECT TEXT TARGET	USER INPUT	RETRIEVED CONTEXT	SEASONALITY	TREND	REGRESSOR
RAG	Text	1	1	2	\(\ge\) 0	0	0	0	0	0	1	1	0	0	0

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING	TARGET ADDITIONAL EMBEDDING	OBJECT LABEL TARGET	OBJECT TEXT TARGET	USER INPUT	RETRIEVED CONTEXT	SEASONALITY	TREND	REGRESSOR
Object Detection	Image	1	1	1	\(\ge\) 0	1	\(\le\) 1	0	1	0	0	0	0	0	0

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING	TARGET ADDITIONAL EMBEDDING	OBJECT LABEL TARGET	OBJECT TEXT TARGET	USER INPUT	RETRIEVED CONTEXT	SEASONALITY	TREND	REGRESSOR
Semantic Segmentation	Image	1	1	1	\(\ge\) 0	1	\(\le\) 1	0	1	0	0	0	0	0	0

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING
Clustering	Tabular	1	1	\(\ge\) 1	\(\ge\) 0	1	0
Clustering	Embedding	1	1	1	\(\ge\) 0	1	0
Clustering	Image	1	1	1	\(\ge\) 0	1	\(\le\) 1
Clustering	Text	1	1	1	\(\ge\) 0	1	\(\le\) 1

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING	TARGET ADDITIONAL EMBEDDING	OBJECT LABEL TARGET	OBJECT TEXT TARGET	USER INPUT	RETRIEVED CONTEXT	SEASONALITY	TREND	REGRESSOR
OCR plain_text	Image	1	1	1	\(\ge\) 0	1	\(\le\) 1	0	0	0	0	0	0	0	0
OCR with_labels	Image	1	1	1	\(\ge\) 0	1	\(\le\) 1	0	1	1	0	0	0	0	0

Here the list of constraints about Data Types for each Role:

RegressionClassification BinaryClassification MulticlassClassification MultilabelRAGObject DetectionSemantic SegmentationTimeseriesClusteringOCR

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING	TARGET ADDITIONAL EMBEDDING	OBJECT LABEL TARGET	OBJECT TEXT TARGET	USER INPUT	RETRIEVED CONTEXT	SEASONALITY	TREND	REGRESSOR
Regression	Tabular	STRING	FLOAT	FLOAT, CATEGORY	FLOAT, CATEGORY, STRING	FLOAT	-	-	-	-	-	-	-	-	-
Regression	Embedding	STRING	FLOAT	ARRAY_1	FLOAT, CATEGORY, STRING	FLOAT	-	-	-	-	-	-	-	-	-
Regression	Image	STRING	FLOAT	ARRAY_3	FLOAT, CATEGORY, STRING	FLOAT	ARRAY_1	-	-	-	-	-	-	-	-
Regression	Text	STRING	FLOAT	STRING	FLOAT, CATEGORY, STRING	FLOAT	ARRAY_1	-	-	-	-	-	-	-	-

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING	TARGET ADDITIONAL EMBEDDING	OBJECT LABEL TARGET	OBJECT TEXT TARGET	USER INPUT	RETRIEVED CONTEXT	SEASONALITY	TREND	REGRESSOR
Classification Binary	Tabular	STRING	FLOAT	FLOAT, CATEGORY	FLOAT, CATEGORY, STRING	CATEGORY	-	-	-	-	-	-	-	-	-
Classification Binary	Embedding	STRING	FLOAT	ARRAY_1	FLOAT, CATEGORY, STRING	CATEGORY	-	-	-	-	-	-	-	-	-
Classification Binary	Image	STRING	FLOAT	ARRAY_3	FLOAT, CATEGORY, STRING	CATEGORY	ARRAY_1	-	-	-	-	-	-	-	-
Classification Binary	Text	STRING	FLOAT	STRING	FLOAT, CATEGORY, STRING	CATEGORY	ARRAY_1	-	-	-	-	-	-	-	-

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING	TARGET ADDITIONAL EMBEDDING	OBJECT LABEL TARGET	OBJECT TEXT TARGET	USER INPUT	RETRIEVED CONTEXT	SEASONALITY	TREND	REGRESSOR
Classification Multiclass	Tabular	STRING	FLOAT	FLOAT, CATEGORY	FLOAT, CATEGORY, STRING	CATEGORY	-	-	-	-	-	-	-	-	-
Classification Multiclass	Embedding	STRING	FLOAT	ARRAY_1	FLOAT, CATEGORY, STRING	CATEGORY	-	-	-	-	-	-	-	-	-
Classification Multiclass	Image	STRING	FLOAT	ARRAY_3	FLOAT, CATEGORY, STRING	CATEGORY	ARRAY_1	-	-	-	-	-	-	-	-
Classification Multiclass	Text	STRING	FLOAT	STRING	FLOAT, CATEGORY, STRING	CATEGORY	ARRAY_1	-	-	-	-	-	-	-	-

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING	TARGET ADDITIONAL EMBEDDING	OBJECT LABEL TARGET	OBJECT TEXT TARGET	USER INPUT	RETRIEVED CONTEXT	SEASONALITY	TREND	REGRESSOR
Classification Multilabel	Tabular	STRING	FLOAT	FLOAT, CATEGORY	FLOAT, CATEGORY, STRING	ARRAY_1	-	-	-	-	-	-	-	-	-
Classification Multilabel	Embedding	STRING	FLOAT	ARRAY_1	FLOAT, CATEGORY, STRING	ARRAY_1	-	-	-	-	-	-	-	-	-
Classification Multilabel	Image	STRING	FLOAT	ARRAY_3	FLOAT, CATEGORY, STRING	ARRAY_1	ARRAY_1	-	-	-	-	-	-	-	-
Classification Multilabel	Text	STRING	FLOAT	STRING	FLOAT, CATEGORY, STRING	ARRAY_1	ARRAY_1	-	-	-	-	-	-	-	-

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING	TARGET ADDITIONAL EMBEDDING	OBJECT LABEL TARGET	OBJECT TEXT TARGET	USER INPUT	RETRIEVED CONTEXT	SEASONALITY	TREND	REGRESSOR
RAG	Text	STRING	FLOAT	STRING	FLOAT, CATEGORY, STRING	-	ARRAY_1	-	-	-	STRING	STRING	-	-	-

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING	TARGET ADDITIONAL EMBEDDING	OBJECT LABEL TARGET	OBJECT TEXT TARGET	USER INPUT	RETRIEVED CONTEXT	SEASONALITY	TREND	REGRESSOR
Object Detection	Image	STRING	FLOAT	ARRAY_3	FLOAT, CATEGORY, STRING	ARRAY_2	ARRAY_1	-	ARRAY_1	-	-	-	-	-	-

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING	TARGET ADDITIONAL EMBEDDING	OBJECT LABEL TARGET	OBJECT TEXT TARGET	USER INPUT	RETRIEVED CONTEXT	SEASONALITY	TREND	REGRESSOR
Semantic Segmentation	Image	STRING	FLOAT	ARRAY_3	FLOAT, CATEGORY, STRING	ARRAY_3	ARRAY_1	-	ARRAY_1	-	-	-	-	-	-

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING	TARGET ADDITIONAL EMBEDDING	OBJECT LABEL TARGET	OBJECT TEXT TARGET	USER INPUT	RETRIEVED CONTEXT	SEASONALITY	TREND	REGRESSOR
Timeseries	Tabular	STRING	FLOAT	FLOAT, CATEGORY	FLOAT, CATEGORY, STRING	FLOAT	-	-	-	-	-	-	FLOAT	FLOAT	FLOAT

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING	TARGET ADDITIONAL EMBEDDING	OBJECT LABEL TARGET	OBJECT TEXT TARGET	USER INPUT	RETRIEVED CONTEXT	SEASONALITY	TREND	REGRESSOR
Clustering	Tabular	STRING	FLOAT	FLOAT, CATEGORY	FLOAT, CATEGORY, STRING	STRING	-	-	-	-	-	-	-	-	-
Clustering	Embedding	STRING	FLOAT	ARRAY_1	FLOAT, CATEGORY, STRING	STRING	-	-	-	-	-	-	-	-	-
Clustering	Image	STRING	FLOAT	ARRAY_3	FLOAT, CATEGORY, STRING	STRING	ARRAY_1	-	-	-	-	-	-	-	-
Clustering	Text	STRING	FLOAT	STRING	FLOAT, CATEGORY, STRING	STRING	ARRAY_1	-	-	-	-	-	-	-	-

Task Type	Data Structure	ID	TIME ID	INPUT	METADATA	TARGET	INPUT ADDITIONAL EMBEDDING	TARGET ADDITIONAL EMBEDDING	OBJECT LABEL TARGET	OBJECT TEXT TARGET	USER INPUT	RETRIEVED CONTEXT	SEASONALITY	TREND	REGRESSOR
OCR plain_text	Image	STRING	FLOAT	ARRAY_3	FLOAT, CATEGORY, STRING	STRING	ARRAY_1	-	-	-	-	-	-	-	-
OCR with_labels	Image	STRING	FLOAT	ARRAY_3	FLOAT, CATEGORY, STRING	ARRAY_3	ARRAY_1	-	ARRAY_1	ARRAY_1	-	-	-	-	-

Data schema templates¶

Task	Variants
Classification (binary)	embedding - image - tabular - text
Classification (multiclass)	embedding - image - tabular - text
Classification (multilabel)	embedding - image - tabular - text
Clustering	embedding - image - tabular - text
Regression	embedding - image - tabular - text
OCR	plain text - with labels
RAG	single turn - multi turn
Other	object detection - semantic segmentation - timeseries