Schema

A schema is the logical representation of a catalog that specifies the types of entities that can be stored and the relationships between them. It allows you to maintain the consistency of your data and is very useful for automatic generation of the web APIs on top of it.

evitaDB internally maintains a schema for each entity collection / catalog, although it supports a relaxed approach, where the schema is automatically built according to data inserted into the database.

The schema is not only crucial for maintaining data consistency, but is also a key source for web API schema generation. It allows us to create Open API and GraphQL schemas. If you pay close attention to the schema definition, you'll be rewarded with nice, understandable, and self-documented APIs. Every single piece of information in the schema affects the way the web APIs look. For example, relation cardinality (zero or one, exactly one, zero or more, one or more) affects whether the API marks the relation as optional, returns a single value/object, or returns an array of them. Filterable attributes are propagated to the documented query language blocks, while non-filterable attributes are not. The data types of the attributes affect which query constraints can be used in relation to this very attribute, and so on. The documentation you write in the evitaDB schema is propagated to all your APIs. You can read more about this projection in the dedicated Web API chapters of the documentation.

Mutations and versioning

The schema can only be changed by what are called mutations. While this is a rather cumbersome approach, it has some big advantages for the system:

mutation represents an isolated change to the schema - this means that the client making the schema change only sends deltas to the server, which saves a lot of network traffic and also implies server-side logic that doesn't need to resolve deltas internally
mutation is directly used as a WAL entry - the mutation represents an atomic operation in the transactional log that is distributed across the cluster, and it also represents a place where conflict resolution takes place (if the server receives similar mutations from two parallel sessions, it easily decides whether to throw a concurrent change exception - if the mutations are equal, there is no conflict; if they are different, the first mutation is accepted and the second is rejected with an exception)

The schema is versioned - each time a schema mutation is performed, its version number is incremented by one. If you have two schema instances on the client side, you can easily tell if they're the same by comparing their version number, and if not, which one is newer.

Hopefully not. We're aware that writing mutations is cumbersome, and provide better support in our drivers. The client drivers wrap the immutable schemas inside the builder objects, so you can just call alter methods on them and the builder will generate the list of mutations at the end. See the example.

However, if you want to use evitaDB on a platform that is not yet supported and covered by a specific client driver, you have to work directly with our web APIs that only accept mutations, and you have no other options than to write the mutations directly or to write your own client driver. But you can open source it and help the community. Let us know about it!

All schema mutations implement interface

Structure

There are following types of schemas:

catalog schema
entity schema
attribute schema
sortable attribute compound schema
associated data schema
reference schema

Catalog

Catalog schema contains list of entity schemas, the name and description of the catalog. It also keeps dictionary of global attribute schemas that can be shared among multiple entity schemas.

Each named data object - catalog, entity, attribute, associated data and reference must be uniquely identifiable by its name within its parent scope.

The name validation logic and reserved words are present in the class

There is also a special property called nameVariants in the schema of each named object. It contains variants of the object name in different "developer" notations such as camelCase, PascalCase, snake_case and so on. See

. for a complete listing.

Top-level mutations:

Within ModifyCatalogSchemaMutation you can use mutations:

And entity top-level mutations.

Global attribute schema

Global attribute schema has the same structure as attribute schema except for one additional characteristic. A global attribute can be made uniqueGlobally, which means that values of such an attribute must be unique across all entities and entity types in the entire catalog.

Well, it is useful for entity URLs that we naturally want to be unique among all entities in the catalog. The globally unique attribute allows us to ask evitaDB for an entity with a specific value without knowing its type in advance. This solves the use case when a new request arrives in your application and you need to check if there is an entity that matches it (no matter if it's a product, category, brand, group or whatever types you have in your project).

A global attribute can also be used as a "dictionary definition" for an attribute that is used in multiple entity collections, and we want to make sure it's named and described the same in all of them. An entity collection cannot define an attribute with the same name as the global attribute. It can only "use" the global attribute with that name and thus share its complete definition.

And of course all standard attribute mutations.

Entity

Entity schema contains information about the name, description and the:

enabling primary key generation
evolution limits
allowed locales and currencies
enabling hierarchical structure
enabling price information
attributes
sortable attribute compound
associated data
references

Entity schema can be made deprecated, which will be propagated to generated web API documentation.

Top-level entity mutations:

Within ModifyEntitySchemaMutation you can use mutations:

Primary key generation

If primary key generation is enabled, evitaDB assigns a unique

int

number to a newly inserted entity. The primary key always starts with 1 and is incremented by 1. evitaDB guarantees its uniqueness within the same entity type. The primary keys generated in this way are optimal for binary operations in the data structures used.

Within ModifyEntitySchemaMutation you can use mutation:

Evolution

We recommend the schema-first approach, but there are cases where you don't want to bother with the schema and just want to insert and query the data (e.g., rapid prototyping). When a new catalog is created, it is set up in "auto evolution" mode, where the schema adapts to the data on first insertion. If you want to control the schema strictly, you have to limit the evolution by changing the default schema. In strict mode, evitaDB throws an exception if the input data violates the schema.

You still need to create entity collections manually, but after that you can immediately insert your data and the schema will be built accordingly. The existing schemas will still be validated on each entity insertion/update - you will not be allowed to store the same attribute as a number type the first time and as a string the next time. The first use will set up the schema, which must be respected from that moment on.

If the first entity has its primary key, evitaDB expects all entities to have their primary key set when inserting. If the first entity has its primary key set to NULL, evitaDB will generate primary keys for you and will reject external primary keys. New attribute schemas are implicitly created as nullable, filterable and non-array data types as sortable. This means that the client is immediately able to filter/sort on almost anything, but the database itself will consume a lot of resources. The references will be created as indexed but not faceted.

There are several partial lax modes between strict and fully automatic evolution mode - see

for details. For example - you can strictly control the entire schema, except for new locale or currency definitions, which are allowed to be added automatically on first use.

Within ModifyEntitySchemaMutation you can use mutations:

Locales and currencies

The schema specifies a list of allowed currencies and locales. We assume that the list of allowed currencies/locales will be relatively small (units, at most lower tens of them) and if the system knows them in advance, it can generate enums for each of them in web APIs. This helps developers write queries with auto-completion. There is another positive effect. E-commerce systems don't often extend the list of used currencies or locales (because there are usually a lot of manual operations involved), and having the allowed set guarded by the system eliminates the possibility of inserting invalid prices or localizations by mistake.

The price lists are closer to "data" than locales or currencies. The set of price lists is expected to change very often, and their numbers can reach high cardinality (thousands, tens of thousands). It wouldn't be practical to generate enumeration values for them and change the Web API schemas every time a price list is added or removed.

Within ModifyEntitySchemaMutation you can use mutations:

Hierarchy placement

When hierarchy placement is enabled, entities of this type can form a tree structure. Each entity can have a maximum of one parent node and zero or more child entities. Neither the depth of the tree nor the number of siblings at each level is limited.

Enabling hierarchy placement implies the creation of a new for the involved entity type. When another entity references a hierarchy entity and the reference is marked as indexed, the special is created for each hierarchical entity. This index will hold reduced attribute and price indices of the referencing entity, allowing quick evaluation of withinHierarchy filter conditions.

Orphan hierarchy nodes

The typical problem associated with creating a tree structure is the order in which nodes are attached to it. In order to have a consistent tree, one should start from the root nodes and gradually descend along the axis of their children. This isn't always easy to do when we need to copy an existing tree to an external system (for scripting purposes, it's much easier and more performance-effective to index in batches using the natural order of records). A similar situation occurs when an intermediate tree node needs to be removed, but its children do not. We can force developers to rewire children to different parents before removing their parent, but they often don't have direct control over the order of operations and can't easily do that.

That's why evitaDB recognizes so-called orphan hierarchy nodes. An orphan node is a node that declares itself to be a child of a parent node with a certain primary key that evitaDB doesn't know yet (or the orphan node itself). Orphan nodes do not participate in the evaluation of queries on hierarchical structures, but are present in the index. If a node of a referenced primary key is appended to the main hierarchy tree, the orphan nodes (sub-trees) are also appended. In this way, the hierarchy tree eventually becomes consistent.

Within ModifyEntitySchemaMutation you can use mutation:

Prices

When prices are enabled, entities of this type can have a set of prices associated with them and can be filtered and sorted by price constraints. A single entity can have zero or more prices (the system is designed for situations where an entity has tens or hundreds of prices attached to it). For each combination of priceList and currency there is a special .

Within ModifyEntitySchemaMutation you can use mutation:

Attributes

An entity type can have zero or more attributes. The system is designed for situations where an entity has tens of attributes. You should pay attention to the number of filterable / sortable / unique attributes. There is a separate instance of for each filterable attribute, for each sortable attribute and or for each unique attribute. Attributes that are neither filterable / sortable / unique don't consume operating memory.

Attribute schema can be marked as localized, meaning that it only makes sense in a specific

locale

Attribute schema can be made deprecated, which will be propagated to generated web API documentation.

Within ModifyEntitySchemaMutation you can use mutation:

Default value

An attribute may have a default value defined. The value is used when a new entity is created and no value has been assigned to a particular attribute. There is no other situation where the default value matters.

Allowed decimal places

The allowed decimal places setting is an optimization that allows rich numeric types (such as

BigDecimal

for precise number representation) to be converted to the primitive

int

type, which is much more compact and can be used for fast binary searches in array/bitset representation. The original rich format is still present in an attribute container, but internally the database uses the primitive form when an attribute is part of filter or sort conditions.

If a number cannot be converted to a compact form (for example, it has more digits in the fractional part than expected), an exception is thrown and the entity update is refused.

Sortable attribute compounds

Sortable attribute compound is a virtual attribute composed of the values of several other attributes, which can only be used for sorting. evitaDB requires a previously prepared sort index to be able to sort entities. This fact makes sorting much faster than ad-hoc sorting by attribute value. Also, the sorting mechanism of evitaDB is somewhat different from what you might be used to. If you sort entities by two attributes in an orderBy clause of the query, evitaDB sorts them first by the first attribute (if present) and then by the second (but only those where the first attribute is missing). If two entities have the same value of the first attribute, they are not sorted by the second attribute, but by the primary key (in ascending order). If we want to use fast "pre-sorted" indexes, there is no other way to do it, because the secondary order would not be known until a query time.

This default sorting behavior by multiple attributes is not always desirable, so evitaDB allows you to define a sortable attribute compound, which is a virtual attribute composed of the values of several other attributes. evitaDB also allows you to specify the order of the "pre-sorting" behavior (ascending/descending) for each of these attributes, and also the behavior for NULL values (first/last) if the attribute is completely missing in the entity. The sortable attribute compound is then used in the orderBy clause of the query instead of specifying the multiple individual attributes to achieve the expected sorting behavior while maintaining the speed of the "pre-sorted" indexes.

A sortable attribute compound is only created if at least one of its attributes is present in the entity. This fact is crucial for the standard sorting mechanism of evitaDB, where such entities are passed to the next sorter defined in the query (or sorted by the primary key in ascending order if no other sorter is defined).

Sortable attribute compound schema can be made deprecated, which will be propagated to generated web API documentation.

Within ModifyEntitySchemaMutation you can use mutation:

The sortable attribute compound schema is described by:

Associated data

An entity type may have zero or more associated data. The system is designed for situations where an entity has tens of associated data items.

Associated data schema can be marked as localized, meaning that it only makes sense in a specific

locale

Associated data schema can be made deprecated, which will be propagated to generated web API documentation.

Within ModifyEntitySchemaMutation you can use mutation:

Reference

An entity type may have zero or more references. References can be managed or unmanaged. Managed references refer to entities within the same catalog and can be checked for consistency by evitaDB. Unmanaged references refer to entities that are managed by external systems outside the scope of evitaDB. An entity can have a self-reference that refers to the same entity type. An entity type can have several references to the same entity type.

References can have zero or more attributes that apply only to a particular "link" between these two entity instances. A global attribute cannot be used as a reference attribute. Otherwise, the same rules apply for reference attributes as for regular entity attributes.

Within ModifyEntitySchemaMutation you can use mutation:

The ModifyReferenceAttributeSchemaMutation expects nested attribute mutations.

Reference directionality

References are unidirectional in nature, which means that if the reference points from entity A to entity B, it does not mean that entity B automatically references entity A. It is possible to set up a bi-directional reference by creating a so-called "reflected reference" on the other entity type and identifying the original reference that should be reflected. The reflected reference may or may not inherit attributes from the original reference, and it may also define its own separate attributes. This can be described by the following ERD diagram:

erDiagram
    A ||--o{ A_to_B : references
    B ||--o{ A_to_B : references
    A_to_B {
        string A1
        string A2
    }
    B ||--o{ B_to_A : references
    A ||--o{ B_to_A : references
    B_to_A {
        string A1
        string B2
    }

Reflected references are automatically created, updated, and removed when the original reference is manipulated. It also works the other way around - when the reflected reference is manipulated, the original reference is updated.

There is a subtle difference between the original reference and the reflected reference. The original reference can exist even if the referenced entity does not (yet) exist (the reference is orphaned). On the other hand, when you create a reflected reference, the referenced entity must exist. This is because the reflected reference immediately creates the original reference, and the original reference must have a valid target. This behavior is needed to maintain consistency when moving entities between different scopes that treat original and reflected references differently.

If the reference contains an attribute that is not defined on the other side, and the reference is created - the missing attribute on the other side is created with its default value (if no such default value is defined, an exception is thrown).

Reference indexing

You need to select the indexing level for each of the references defined in the entity schema. There are three levels of available:

NONE: Reference has no index available. This means that the reference cannot be used in any query filtering or sorting. Use this type when you do not need to filter nor sort by reference existence or any of the reference attributes, and you want to minimize memory and disk usage.
FOR_FILTERING: Reference has only basic index available that is necessary for referencedEntityHaving filter conditions and referenceProperty sorting constraint interpretation. This is the minimal indexing level that allows filtering by reference existence and reference attributes. Use this type when you need basic reference filtering capabilities but want to minimize memory and disk usage. This is suitable for references that are not frequently used in complex queries or when storage optimization is more important than query performance.This is the recommended default indexing type for references and is sufficient for most use cases.
FOR_FILTERING_AND_PARTITIONING: Reference has basic index available that is necessary for referencedEntityHaving filter conditions and referenceProperty sorting constraint interpretation, and also partitioning indexes for the main entity type (i.e. entity type that contains the reference schema), which may greatly speed up the query execution when the reference is part of the query filtering. This advanced indexing creates additional data structures that allow for more efficient query execution by partitioning the data based on the reference relationships. This can significantly improve performance for complex queries that involve reference filtering, especially when dealing with large datasets. Use this type when reference filtering is frequently used in queries and query performance is critical. Be aware that this option requires more memory and disk space compared to FOR_FILTERING level.

Partitioning indexes are represented by and such an index is created for each reference used in any entity in the schema, and will contain a subset of the attribute, price and other indexes reduced only to entities with the given reference. Let's describe it with an example - let's say we have entity type Product that has reference categories to entity type Category, which is indexed FOR_FILTERING_AND_PARTITIONING. Let's imagine that we need to find all products classified in a specific category that also meet ten other conditions (they are published, currently valid, have an available price in the user's price list and in EUR, etc.). We can evaluate such a query over one large index, where this information is available for all known products in the database, or (if we use partitioning) we can use a much smaller index, in which we can find all the necessary information only for products that have a valid link to the category for which we are evaluating this query. Logically, the response to the query will be significantly faster because the amount of data searched is significantly smaller. The downside of this approach is that it requires a relatively large amount of memory space.

If the reference is marked as faceted, the special is created for the entity type. This index contains optimized data structures for facet summary computation. All reference instances of a given type are then inserted into the facet reference index (there is no way to exclude a reference from indexing in the facet reference index). References can (but don't have to) be organized into facet groups that refer to a managed or non-managed entity type.

Reference cardinality

Each reference schema has a certain cardinality. The cardinality describes the expected number of relations of this type. In evitaDB we define only one-way relations from the perspective of the entity. We follow the ERD modeling standards. Cardinality affects the design of the Web API schemas (returning only single references or arrays) and also helps us to protect the consistency of the data so that it conforms to the creator's mental model.

When you allow definition of duplicate references using one of the cardinality types: ZERO_OR_MORE_WITH_DUPLICATES or ONE_OR_MORE_WITH_DUPLICATES, you'll be able to define two references to the same target entity within a single entity instance. In such a case you need to select at least one reference attribute that would make the two references distinguishable and set it as representative. The representative attribute would then be used to identify the specific reference when querying or manipulating the entity. If no representative attribute is defined, an exception is thrown when you try to create duplicate references.

There are situations when duplicate references come in handy. Imagine you have an entity type Product that has a reference medias to an entity of type Media. You want to be able to link multiple media items to a single product, and you also want to be able to distinguish between them based on their role (e.g., "thumbnail", "gallery", "video", etc.). In such a case you can define reference attribute role as representative, and then you'll be able to create multiple references to the same Media entity with different role values.

Scopes

Scopes are separate areas of memory where entity indexes are stored. Scopes are used to separate live data from archived data. Scopes are used to handle so-called "soft deletes" - the application can choose between a hard delete and archiving the entity, which simply moves the entity to the archive scope. The reasons for this feature are explained in the dedicated blog post.

By default, archived entities have no indexes other than the primary key index. This is because archived entities are not normally queried and are only looked up by their primary key. By not maintaining the indexes of archived entities, we save memory and CPU resources. There may be cases where you want to query the archived entities and therefore you have full control over which indexes are maintained in the archive scope when you define the entity schema. Note that the more indexes you maintain, the more memory and CPU resources you will consume as the list of archived entities grows.

Changes in reference behavior

When you move an entity from one scope to another, the original references are retained, while the reflected references are removed if either of the following conditions is not met:

the reflected reference schema is not marked as indexed in the target scope
the primary reference schema (i.e., the original reference being reflected) is not marked as indexed in the target scope.

Reflected references are something that is maintained by the evitaDB engine, and it requires appropriate indexes to be present in the target scope in order to work. By default, the archive scope does not maintain any indexes other than the primary key and a few others explicitly specified by you in the entity schema.

Therefore, the reflected references are usually removed when the entity is moved to the archive scope. The engine can recreate them if the entity is moved back to the live scope where appropriate indexes exist.

What's next?

The next obvious step is to learn how to define the schema using the evitaDB API. But you might be interested in writing or querying the data instead.

Author: Ing. Jan Novotný

Date updated: 17.1.2023

Documentation Source

Schema

Mutations and versioning

Do I really have to write all the mutations by hand?

Structure

Catalog

Name requirements and name variants

Global attribute schema

What is global uniqueness good for?

Entity

Primary key generation

Evolution

Locales and currencies

Why are price lists not listed in the schema if currencies are?

Hierarchy placement

Orphan hierarchy nodes

Prices

Attributes

Default value

Allowed decimal places

Sortable attribute compounds

Associated data

Reference

Reference directionality

Reference indexing

Reference cardinality

Scopes

Changes in reference behavior

What's next?

Schema

Mutations and versioning

Do I really have to write all the mutations by hand?

Structure

Catalog

Name requirements and name variants

List of mutations related to catalog

Global attribute schema

What is global uniqueness good for?

List of mutations related to global attribute

Entity

List of mutations related to entity type

Primary key generation

List of mutations related to primary key

Evolution

List of mutations related to evolution mode

Locales and currencies

Why are price lists not listed in the schema if currencies are?

List of mutations related to locales & currencies

Hierarchy placement

Orphan hierarchy nodes

List of mutations related to hierarchy placement

Prices

List of mutations related to hierarchy placement

Attributes

List of mutations related to attribute

Default value

Allowed decimal places

Sortable attribute compounds

List of mutations related to sortable attribute compound

Associated data

List of mutations related to associated data

Reference

List of mutations related to reference

Reference directionality

Reference indexing

Reference cardinality

Scopes

Changes in reference behavior

What's next?