TN002 Learning from Human Feedback

Allow Foyle to learn from human feedback.

Objective

Allow Foyle to learn from human feedback.

TL;DR

If a user corrects a command generated by the AI, we want to be able to use that feedback to improve the AI. The simplest way to do this is using few shot prompting. This tech note focuses on how we will collect and retrieve the examples to enrich the prompts. At a high level, the process will look like the following

  • Process logs to identify examples where the human corrected the AI generated response
  • Turn each mistake into one or more examples of a query and response
  • Compute and store the embeddings of the query
  • Use brute force to compute distance to all embeddings

This initial design prioritizes simplicity and flexibility over performance. For example, we may want to experiment with using AI to generate alternative versions of a query. For now, we avoid using a vector database to efficiently store and query a large corpus of documents. I think its premature to optimize for large corpi given users may not have large corpi.

Generate Documents

The first step of learning from human feedback is to generate examples that can be used to train the AI. We can think of each example as a tuple (Doc, Blocks) where Doc is the document sent to the AI for completion and Blocks are the Blocks the AI should return.

We can obtain these examples from our block logs. A good starting point is to look at where the AI made a mistake; i.e. the human had to edit the command the AI provided. We can obtain these examples by looking at our BlockLogs and finding logs where the executed cell differed from the generated block.

There’s lots of ways we could store tuples (Doc, Blocks) but the simplest most obvious way is to append the desired block to Doc.blocks and then serialize the Doc proto as a .foyle file. We can start by assuming that the query corresponds to all but the last block Doc.blocks[:-1] and the expected answer is the last block Doc.Blocks[-1]. This will break when we want to allow for the AI to respond with more than 1 block. However, to get started this should be good enough. Since the embeddings will be stored in an auxilary file containing a serialized proto we could extend that to include information about which blocks to use as the answer.

Embeddings

In Memory + Brute Force

With text-embedding-3-small embeddings have dimension 1536 and are float32; so 6KB/Embedding. If we have 1000 documents that’s 6MB. This should easily fit in memory for the near future.

Computation wise computing a dot product against 1000 documents is about 3.1 Million Floating Point Operations(FLOPS). This is orders of magnitude less than LLAMA2 which clocks in at 1700 Giga Flops. Given people are running LLAMA2 locally (albeit on GPUs) seems like we should be able to get pretty far with a brute force approach.

A brute force in memory option would work as follows

  1. Load all the vector embeddings of documents into memory
  2. Compute a dot product using matrix multiplication
  3. Find the K-values with the smallest values

Serializing the embeddings.

To store the embeddings we will use a serialized proto that lives side by side with the file we computed the embeddings for. As proposed in the previous section we will store the example in the file ${BLOCKLOG}.foyle then its embeddings will live in the file “${BLOCKLOG}.binpb”. This will contain a serialized proto like the following

message Example {
  repeated float32 embedding; 
}

We use a proto so that we can potentially enrich the data format over time. For example, we may want to

  • Store a hash of the source text so we can determine when to recompute embeddings
  • Store additional metadata
  • Store multiple embeddings for the same document corresponding to different segmentations

This design makes it easy to add/remove documents from the collection we can

  • Add “.foyle” or “.md” documents
  • Use globs to match files
  • Check if the embeddings already exist

The downside of this approach is likely performance. Opening and deserializing large numbers of files is almost certainly going to be less efficient then using a format like hdf5 that is optimized for matrices.

Learn command

We can add a command to the Foyle CLI to perform all these steps.

foyle learn

This command will operate in a level based, declarative way. Each time it is invoked it will determine what work needs to be done and then perform it. If no additional work is needed it will be a null op. This design means we can run it periodically as a background process so that learning happens automatically.

Here’s how it will work; it will iterate over the log entries in the block logs to identify logs that need to be processed. We can use a watermark to keep track of processed logs to avoid constantly rereading the entire log history. For each BlockLog that should be turned into an example we can look for the file {BLOCK_ID}.foyle; if the file doesn’t exist then we will create it.

Next, we can check that for each {BLOCK_ID}.foyle file there is a corresponding {BLOCK_ID}.embeddings.binpb file. If it doesn’t exist then we will compute the embeddings.

Discussion

Why Only Mistakes

In the current proposal we only turn mistakes into examples. In principle, we could use examples where the model got things right as well. The intuition is that if the model is already correctly handling a query; there’s no reason to include few shot examples. Arguably, those positive examples might end up confusing the AI if we end up retrieving them rather than examples corresponding to mistakes.

Why duplicate the documents?

The ${BLOCK_ID}.foyle files are likely very similar to the actual .foyle files the user created. An alternative design would be to just reuse the original .foyle documents. This is problematic for several reasons. The {BLOCK_ID}.foyle files are best considered internal to Foyle’s self-learning and shouldn’t be directly under the user’s control. Under the current proposal the {BLOCK_ID}.foyle are generated from logs and represent snapshots of the user’s documents at specific points in time. If we used the user’s actual files we’d have to worry about them changing over time and causing problems. Treating them as internal to Foyle also makes it easier to move them in the future to a different storage backend. It also doesn’t require Foyle to have access to the user’s storage system.

Use a vector DB

Another option would be to use a vector DB. Using a vector database (e.g. Weaviate, Chroma, Pinecone) adds a lot of complexity. In particular, it creates an additional dependency which could be a barrier for users just interested in trying Foyle out. Furthermore, a vector DB means we need to think about schemas, indexes, updates, etc… Until its clear we have sufficient data to benefit from a vector db its not worth introducing.

References

OpenAI Embeddings Are Normalized