A schema is the logical representation of a catalog that specifies the types of entities that can be stored and
the relationships between them. It allows you to maintain the consistency of your data and is very useful
for automatic generation of the web APIs on top of it.
evitaDB internally maintains a schema for each entity collection / catalog,
although it supports a relaxed approach, where the schema is automatically built according to data
inserted into the database.
The schema is not only crucial for maintaining data consistency, but is also a key source for web API schema
generation. It allows us to create Open API and GraphQL schemas. If you
pay close attention to the schema definition, you'll be rewarded with nice, understandable, and self-documented APIs.
Every single piece of information in the schema affects the way the web APIs look. For example, relation cardinality
(zero or one, exactly one, zero or more, one or more) affects whether the API marks the relation as optional, returns
a single value/object, or returns an array of them. Filterable attributes are propagated to the documented query
language blocks, while non-filterable attributes are not. The data types of the attributes affect which query
constraints can be used in relation to this very attribute, and so on. The documentation you write in the evitaDB schema
is propagated to all your APIs. You can read more about this projection in the dedicated Web API chapters of the
documentation.
Mutations and versioning
The schema can only be changed by what are called mutations. While this is a rather cumbersome approach, it has some
big advantages for the system:
mutation represents an isolated change to the schema - this means that the client making the schema change
only sends deltas to the server, which saves a lot of network traffic and also implies server-side logic that doesn't
need to resolve deltas internally
mutation is directly used as a WAL entry - the mutation
represents an atomic operation in the transactional log that is distributed across the cluster, and it also
represents a place where conflict resolution takes place (if the server receives similar mutations from two
parallel sessions, it easily decides whether to throw a concurrent change exception - if the mutations are equal,
there is no conflict; if they are different, the first mutation is accepted and the second is rejected with an
exception)
The schema is versioned - each time a schema mutation is performed, its version number is incremented by one. If you
have two schema instances on the client side, you can easily tell if they're the same by comparing their version
number, and if not, which one is newer.
Hopefully not. We're aware that writing mutations is cumbersome, and provide better support in our drivers. The client
drivers wrap the immutable schemas inside the builder objects, so you can just call alter methods on them and
the builder will generate the list of mutations at the end. See the example.
However, if you want to use evitaDB on a platform that is not yet supported and covered by a specific client driver,
you have to work directly with our web APIs that only accept mutations, and you have no other options than to write
the mutations directly or to write your own client driver. But you can open source it and help the community. Let us
know about it!
The name validation logic and reserved words are present in the class
.
There is also a special property called nameVariants in the schema of each named object. It contains variants
of the object name in different "developer" notations such as camelCase, PascalCase, snake_case and so on. See
.
for a complete listing.
List of mutations related to catalog
Top-level mutations:
Within ModifyCatalogSchemaMutation you can use mutations:
Global attribute schema has the same structure as attribute schema except for one additional
characteristic. A global attribute can be made uniqueGlobally, which means that values of such an attribute must be
unique across all entities and entity types in the entire catalog.
Well, it is useful for entity URLs that we naturally want to be unique among all entities in the catalog. The globally
unique attribute allows us to ask evitaDB for an entity with a specific value without knowing its type in advance.
This solves the use case when a new request arrives in your application and you need to check if there is an entity
that matches it (no matter if it's a product, category, brand, group or whatever types you have in your project).
A global attribute can also be used as a "dictionary definition" for an attribute that is used in multiple entity
collections, and we want to make sure it's named and described the same in all of them. An entity collection cannot
define an attribute with the same name as the global attribute. It can only "use" the global attribute with that name
and thus share its complete definition.
number to a newly inserted entity.
The primary key always starts with 1 and is incremented by 1. evitaDB guarantees its uniqueness within the same
entity type. The primary keys generated in this way are optimal for binary operations in the data structures used.
List of mutations related to primary key
Within ModifyEntitySchemaMutation you can use mutation:
Evolution
We recommend the schema-first approach, but there are cases where you don't want to bother with the schema and just want
to insert and query the data (e.g., rapid prototyping). When a new catalog is created, it is set up
in "auto evolution" mode, where the schema adapts to the data on first insertion. If you want to control the schema
strictly, you have to limit the evolution by changing the default schema. In strict mode, evitaDB throws an exception
if the input data violates the schema.
You still need to create entity collections manually, but after that you can immediately insert
your data and the schema will be built accordingly. The existing schemas will still be validated on each entity
insertion/update - you will not be allowed to store the same attribute as a number type the first time and as a string
the next time. The first use will set up the schema, which must be respected from that moment on.
If the first entity has its primary key, evitaDB expects all entities to have their primary key set when inserting.
If the first entity has its primary key set to NULL, evitaDB will generate primary keys for you and will reject
external primary keys. New attribute schemas are implicitly created as nullable, filterable and non-array data types
as sortable. This means that the client is immediately able to filter/sort on almost anything, but the database itself
will consume a lot of resources. The references will be created as indexed but not faceted.
There are several partial lax modes between strict and fully automatic evolution mode - see
for details.
For example - you can strictly control the entire schema, except for new locale or currency definitions, which are
allowed to be added automatically on first use.
List of mutations related to evolution mode
Within ModifyEntitySchemaMutation you can use mutations:
Locales and currencies
The schema specifies a list of allowed currencies and locales. We assume that the list of allowed currencies/locales
will be relatively small (units, at most lower tens of them) and if the system knows them in advance, it can generate enums
for each of them in web APIs. This helps developers write queries with auto-completion. There is another positive
effect. E-commerce systems don't often extend the list of used currencies or locales (because there are usually a lot
of manual operations involved), and having the allowed set guarded by the system eliminates the possibility of inserting
invalid prices or localizations by mistake.
The price lists are closer to "data" than locales or currencies. The set of price lists is expected to change very
often, and their numbers can reach high cardinality (thousands, tens of thousands). It wouldn't be practical to generate
enumeration values for them and change the Web API schemas every time a price list is added or removed.
List of mutations related to locales & currencies
Within ModifyEntitySchemaMutation you can use mutations:
Hierarchy placement
When hierarchy placement is enabled, entities of this type can form a tree structure. Each entity can have a maximum
of one parent node and zero or more child entities. Neither the depth of the tree nor the number of siblings at each
level is limited.
Enabling hierarchy placement implies the creation of a new
for the involved
entity type. When another entity references a hierarchy entity and the reference is marked as indexed, the special
is created for each hierarchical entity. This index will
hold reduced attribute and price indices of the referencing entity, allowing quick evaluation of
withinHierarchy filter conditions.
Orphan hierarchy nodes
The typical problem associated with creating a tree structure is the order in which nodes are attached to it. In
order to have a consistent tree, one should start from the root nodes and gradually descend along the axis of their
children. This isn't always easy to do when we need to copy an existing tree to an external system (for scripting
purposes, it's much easier and more performance-effective to index in batches using the natural order of records). A similar
situation occurs when an intermediate tree node needs to be removed, but its children do not. We can force developers to
rewire children to different parents before removing their parent, but they often don't have direct control over the
order of operations and can't easily do that.
That's why evitaDB recognizes so-called orphan hierarchy nodes. An orphan node is a node that declares itself to be
a child of a parent node with a certain primary key that evitaDB doesn't know yet (or the orphan node itself). Orphan
nodes do not participate in the evaluation of queries on hierarchical structures,
but are present in the index. If a node of a referenced primary key is appended to the main hierarchy tree, the
orphan nodes (sub-trees) are also appended. In this way, the hierarchy tree eventually becomes consistent.
List of mutations related to hierarchy placement
Within ModifyEntitySchemaMutation you can use mutation:
Prices
When prices are enabled, entities of this type can have a set of prices associated with them and can be
filtered and sorted by price constraints. A single entity
can have zero or more prices (the system is designed for situations where an entity has tens or hundreds of prices attached
to it). For each combination of priceList and currency there is a special
.
List of mutations related to hierarchy placement
Within ModifyEntitySchemaMutation you can use mutation:
Attributes
An entity type can have zero or more attributes. The system is designed for situations where an entity has tens of
attributes. You should pay attention to the number of filterable / sortable / unique attributes. There is a
separate instance of
for each filterable
attribute, for each
sortable attribute and
or for each
unique attribute. Attributes that are neither filterable / sortable / unique don't consume operating memory.
Attribute schema can be marked as localized, meaning that it only makes sense in a specific
Attribute schema can be made deprecated, which will be propagated to generated web API documentation.
List of mutations related to attribute
Within ModifyEntitySchemaMutation you can use mutation:
Default value
An attribute may have a default value defined. The value is used when a new entity is created and no value has been
assigned to a particular attribute. There is no other situation where the default value matters.
Allowed decimal places
The allowed decimal places setting is an optimization that allows rich numeric types (such
as
type, which is much more
compact and can be used for fast binary searches in array/bitset representation. The original rich format is still
present in an attribute container, but internally the database uses the primitive form when an attribute is part of
filter or sort conditions.
If a number cannot be converted to a compact form (for example, it has more digits in the fractional part than expected),
an exception is thrown and the entity update is refused.
Sortable attribute compounds
Sortable attribute compound is a virtual attribute composed of the values of several other attributes, which can only be
used for sorting. evitaDB requires a previously prepared sort index to be able to sort entities. This fact makes sorting
much faster than ad-hoc sorting by attribute value. Also, the sorting mechanism of evitaDB is somewhat different from
what you might be used to. If you sort entities by two attributes in an orderBy clause of the query, evitaDB sorts
them first by the first attribute (if present) and then by the second (but only those where the first attribute is
missing). If two entities have the same value of the first attribute, they are not sorted by the second attribute, but
by the primary key (in ascending order). If we want to use fast "pre-sorted" indexes, there is no other way to do it,
because the secondary order would not be known until a query time.
This default sorting behavior by multiple attributes is not always desirable, so evitaDB allows you to define a sortable
attribute compound, which is a virtual attribute composed of the values of several other attributes. evitaDB also allows
you to specify the order of the "pre-sorting" behavior (ascending/descending) for each of these attributes, and also
the behavior for NULL values (first/last) if the attribute is completely missing in the entity. The sortable attribute
compound is then used in the orderBy clause of the query instead of specifying the multiple individual attributes to
achieve the expected sorting behavior while maintaining the speed of the "pre-sorted" indexes.
A sortable attribute compound is only created if at least one of its attributes is present in the entity. This fact is
crucial for the standard sorting mechanism of evitaDB, where such entities are passed to the next sorter defined in
the query (or sorted by the primary key in ascending order if no other sorter is defined).
Sortable attribute compound schema can be made deprecated, which will be propagated to generated web API documentation.
List of mutations related to sortable attribute compound
Within ModifyEntitySchemaMutation you can use mutation:
The sortable attribute compound schema is described by:
Associated data
An entity type may have zero or more associated data. The system is designed for situations where an entity has
tens of associated data items.
Associated data schema can be marked as localized, meaning that it only makes sense in a specific
Associated data schema can be made deprecated, which will be propagated to generated web API documentation.
List of mutations related to associated data
Within ModifyEntitySchemaMutation you can use mutation:
Reference
An entity type may have zero or more references. References can be managed or unmanaged. Managed references refer to entities within the same catalog and can be checked for consistency by evitaDB. Unmanaged references refer to entities that are managed by external systems outside the scope of evitaDB. An entity can have a self-reference that refers to the same entity type. An entity type can have several references to the same entity type.
References can have zero or more attributes that apply only to a particular "link" between these two entity instances. A global attribute cannot be used as a reference attribute. Otherwise, the same rules apply for reference attributes as for regular entity attributes.
List of mutations related to reference
Within ModifyEntitySchemaMutation you can use mutation:
The ModifyReferenceAttributeSchemaMutation expects nested attribute mutations.
Reference directionality
References are unidirectional in nature, which means that if the reference points from entity A to entity B, it does not mean that entity B automatically references entity A. It is possible to set up a bi-directional reference by creating a so-called "reflected reference" on the other entity type and identifying the original reference that should be reflected. The reflected reference may or may not inherit attributes from the original reference, and it may also define its own separate attributes. This can be described by the following ERD diagram:
erDiagram
A ||--o{ A_to_B : references
B ||--o{ A_to_B : references
A_to_B {
string A1
string A2
}
B ||--o{ B_to_A : references
A ||--o{ B_to_A : references
B_to_A {
string A1
string B2
}
Reflected references are automatically created, updated, and removed when the original reference is manipulated. It also works the other way around - when the reflected reference is manipulated, the original reference is updated.
There is a subtle difference between the original reference and the reflected reference. The original reference can exist even if the referenced entity does not (yet) exist (the reference is orphaned). On the other hand, when you create a reflected reference, the referenced entity must exist. This is because the reflected reference immediately creates the original reference, and the original reference must have a valid target. This behavior is needed to maintain
consistency when moving entities between different scopes that treat original and reflected references differently.
If the reference contains an attribute that is not defined on the other side, and the reference is created - the missing attribute on the other side is created with its default value (if no such default value is defined, an exception is thrown).
Reference indexing
You need to select the indexing level for each of the references defined in the entity schema. There are three levels of available:
NONE
Reference has no index available. This means that the reference cannot be used in any query filtering or sorting. Use this type when you do not need to filter nor sort by reference existence or any of the reference attributes, and you want to minimize memory and disk usage.
FOR_FILTERING
Reference has only basic index available that is necessary for referencedEntityHaving filter conditions and referenceProperty sorting constraint interpretation. This is the minimal indexing level that allows filtering by reference existence and reference attributes. Use this type when you need basic reference filtering capabilities but want to minimize memory and disk usage. This is suitable for references that are not frequently used in complex queries or when storage optimization is more important than query performance.This is the recommended default indexing type for references and is sufficient for most use cases.
FOR_FILTERING_AND_PARTITIONING
Reference has basic index available that is necessary for referencedEntityHaving filter conditions and referenceProperty sorting constraint interpretation, and also partitioning indexes for the main entity type (i.e. entity type that contains the reference schema), which may greatly speed up the query execution when the reference is part of the query filtering. This advanced indexing creates additional data structures that allow for more efficient query execution by partitioning the data based on the reference relationships. This can significantly improve performance for complex queries that involve reference filtering, especially when dealing with large datasets. Use this type when reference filtering is frequently used in queries and query performance is critical. Be aware that this option requires more memory and disk space compared to FOR_FILTERING level.
Partitioning indexes are represented by and such an index is created for each reference used in any entity in the schema, and will contain a subset of the attribute, price and other indexes reduced only to entities with the given reference. Let's describe it with an example - let's say we have entity type Product that has reference categories to entity type Category, which is indexed FOR_FILTERING_AND_PARTITIONING. Let's imagine that we need to find all products classified in a specific category that also meet ten other conditions (they are published, currently valid, have an available price in the user's price list and in EUR, etc.). We can evaluate such a query over one large index, where this information is available for all known products in the database, or (if we use partitioning) we can use a much smaller index, in which we can find all the necessary information only for products that have a valid link to the category for which we are evaluating this query. Logically, the response to the query will be significantly faster because the amount of data searched is significantly smaller. The downside of this approach is that it requires a relatively large amount of memory space.
If the reference is marked as faceted, the special is created for the entity type. This index contains optimized data structures for facet summary computation. All reference instances of a given type are then inserted into the facet reference index (there is no way to exclude a reference from indexing in the facet reference index). References can (but don't have to) be organized into facet groups that refer to a managed or non-managed entity type.
Reference cardinality
Each reference schema has a certain cardinality. The cardinality describes the expected number of relations of this type. In evitaDB we define only one-way relations from the perspective of the entity. We follow the ERD modeling standards. Cardinality affects the design of the Web API schemas (returning only single references or arrays) and also helps us to protect the consistency of the data so that it conforms to the creator's mental model.
When you allow definition of duplicate references using one of the cardinality types: ZERO_OR_MORE_WITH_DUPLICATES or ONE_OR_MORE_WITH_DUPLICATES, you'll be able to define two references to the same target entity within a single entity instance. In such a case you need to select at least one reference attribute that would make the two references distinguishable and set it as representative. The representative attribute would then be used to identify the specific reference when querying or manipulating the entity. If no representative attribute is defined, an exception is thrown when you try to create duplicate references.
There are situations when duplicate references come in handy. Imagine you have an entity type Product that has a reference medias to an entity of type Media. You want to be able to link multiple media items to a single product, and you also want to be able to distinguish between them based on their role (e.g., "thumbnail", "gallery", "video", etc.). In such a case you can define reference attribute role as representative, and then you'll be able to create multiple references to the same Media entity with different role values.
Scopes
Scopes are separate areas of memory where entity indexes are stored. Scopes are used to separate live data from archived
data. Scopes are used to handle so-called "soft deletes" - the application can choose between a hard delete and
archiving the entity, which simply moves the entity to the archive scope. The reasons for this feature are explained in
the dedicated blog post.
By default, archived entities have no indexes other than the primary key index. This is because archived entities are
not normally queried and are only looked up by their primary key. By not maintaining the indexes of archived entities,
we save memory and CPU resources. There may be cases where you want to query the archived entities and therefore you
have full control over which indexes are maintained in the archive scope when you define the entity schema. Note that
the more indexes you maintain, the more memory and CPU resources you will consume as the list of archived entities grows.
Changes in reference behavior
When you move an entity from one scope to another, the original references are retained, while the reflected references
are removed if either of the following conditions is not met:
the reflected reference schema is not marked as indexed in the target scope
the primary reference schema (i.e., the original reference being reflected) is not marked as indexed in the target scope.
Reflected references are something that is maintained by the evitaDB engine, and it requires appropriate indexes to be
present in the target scope in order to work. By default, the archive scope does not maintain any indexes other than
the primary key and a few others explicitly specified by you in the entity schema.
Therefore, the reflected references are usually removed when the entity is moved to the archive scope. The engine can
recreate them if the entity is moved back to the live scope where appropriate indexes exist.