Concept
Embeddings encode data as a vector aka one dimensional list of floats for the purpose of comparison, not retrieval. Embeddings are created with embedding models that represent data semantically. The encoded data can be compared to each other using trigonometric methods like cosine distance, small distances suggest high relatedness and large distances suggest low relatedness. Embedding search represents those distances as a normalized score between 1 and 0 that convey how related two embeddings are within the context of embedding model.
https://cloud.google.com/blog/topics/developers-practitioners/meet-ais-multitool-vector-embeddings
Saga Embeddings
Saga currently only supports text encodings using the text-embedding-ada-002
model from openAI that is automatically when creating or updating a saga embedding. A saga embedding has to be associated with a saga bot or user, can be tagged and contain further metadata that is intended for the search consumer. Embeddings are automatically deleted if the referred bot or user is deleted. Saga caches embeddings, if the same value had been encoded prior it will reuse the embedding bypassing the creation process.
Embeddings can be created, searched and managed via the HTTP API or in Scripts/Jobs using the JS API. A common use case is a script that creates and updates a per bot embedding using type and metadata for embedding identification triggered by a signal based script.
Note: The actual embedding float list is never send back with the APIs as this would create too much I/O and is not that useful.
HTTP API
POST /embeddings
Create a new embedding:
- value - the text to encode
- collection_name: either
bots
orusers
- collection_id: the id of the bot or user, non-existing references result in an error
- tags: list of saga tags
POST
{
"value": "I came, I saw, I conquered",
"collection_name": "bots",
"collection_id": "60bf52a2dbd52e0069edcb35",
"tags": [
"quote",
"military",
"empire"
],
"meta_data": {
"year": -49
}
}
RESPONSE
{
"_id": "577a6490fedf1e030057095c",
"createdAt": "2016-07-04T13:28:48.378Z",
"updatedAt": "2016-07-04T13:28:48.378Z",
"value": "I came, I saw, I conquered",
"collection_name": "bots",
"collection_id": "8791273987123971329",
"tags": [
"quote",
"military",
"empire"
],
"meta_data": {
"year": -49
}
}
PUT /embeddings/:id
Update the embedding with given id
, same fields and responses as in POST.
DELETE /embeddings/:id
Delete the embedding with given id
.
GET /embeddings/
List and search the embeddings. When the search
parameter is supplied the embeddings are searched, using the values of the parameter.
- the value that is not part of
tags:...
,collection_id:...
or ,collection_type:...
is used for the embedding search and is encoded for search - filter by tags with
tags:...
- filter by collection_type with
collection_type:...
- filter by collection_id with
collection_id:...
In addition to the embedding data itself search results also contain a score
that represents the similarity with the given search term.
Search example 'coffee tags:agent'
[
{
"_id": "658a532ee925c206483cb7a0",
"score": 0.8814504146575928,
"value": "I am going to El Cafe Del Coroin",
"collection_name": "bots",
"collection_id": "64a56874f56130e80944208d",
"meta": {
"trigger": "/bots/properties/destination",
"name": "destination"
},
"createdAt": "2024-01-05T04:07:51.740Z",
"updatedAt": "2024-01-05T04:07:51.740Z",
"tags": [
"agent",
"destination"
]
},
{
"_id": "658af4532bbcdf139ba79023",
"score": 0.8727329969406128,
"value": "I am going to Starbucks in Las Vegas",
"collection_name": "bots",
"collection_id": "64a5e00af56130e80944ae8f",
"meta": {
"trigger": "/bots/properties/destination",
"name": "destination"
},
"createdAt": "2024-01-06T13:20:36.745Z",
"updatedAt": "2024-01-06T13:20:36.745Z",
"tags": [
"agent",
"destination"
]
},
{
"_id": "658a583a4b6bda821090fed6",
"score": 0.8553992509841919,
"value": "The weather is clear sky. Listening to Frozen Grasslands by Cora Zea. I am going to Pinho's Bakery in Roselle",
"collection_name": "bots",
"collection_id": "60bf52a2dbd52e0069edcb35",
"meta": {
"trigger": "/bots/properties/*",
"name": "combined"
},
"createdAt": "2024-01-06T14:10:01.622Z",
"updatedAt": "2024-01-06T14:10:01.622Z",
"tags": [
"agent",
"combined"
]
}
]
Javascript API
The embeddings javascript API has a more feature rich profile than the HTTP API, it supports the standard mongoDB functions and a custom search function.
Create / Update
When used in scripts it can test if an embedding with certain criteria already exists to avoid older entries of the same criteria.
Embeddings are typically created in scripts or jobs. This example script creates an embedding describing the weather, if the bot already has a weather embedding it will update the embedding, otherwise it will create a new one. The meta field serves to identify in that case.
const value = `The weather is ${property.value.weather[0].description}`;
//existing embedding for the same type?, maintain timestamp
const existing = await Embedding.findOne({"meta.name":"weather",collection_name:"bots",collection_id:parent._id},{_id:1,createdAt:1,updatedAt:1});
if(existing){
existing.set("value",value);
await existing.save()
} else {
await Embedding.create({
value,
collection_name: "bots",
collection_id: parent._id,
meta: {trigger: "/bots/properties/weather", name: "weather"},
tags: ["agent", "weather"]
})
}
Search
The search function can populate the result with the referenced collection object in a single mongoDB aggregation operation.
Definition:
/**
* Embeddings search function. The field `collection_object` is only returned when either parent_base or parent_properties are defined
* @param {string} term the vector search term
* @param {object} [filter] optional, query filter to be applied prior to vector search, can include 'tags', 'collection_id' and 'collection_name'
* @param {boolean} [parent_base] optional, include the parent base: names, timestamps and id
* @param {[string]} [parent_properties] optional, include the properties in the array, i.e. `["geoJSON"]`. also includes `parent_base`
* @return [{_id:string, score: number,value:string,meta:object,collection_id:string,collection_name:string, collection_object:object}]
*/
EmbeddingsSchema.statics.search = async function (term, filter, {parent_base, parent_properties} = {}) { /**...**/ }
Example: Query the embedding for sunny weather
, limit to bots
, and include collection_object
with the bot name
and the properties geoJSON
and weather
if defined.
const embeddings = Embedding.search("sunny weather", {collection_name:'bots'}, {parent_properties:["geoJSON","weather"]})
//...