Without a copilot for devops we won't keep up

Since co-pilot launched in 2021, AI accelerated software development has become the norm. More importantly, as Simon Willison argued at last year’s AI Engineer Summit with AI there has never been an easier time to learn to code. This means the population of people writing code is set to explode. All of this begs the question, who is going to operate all this software? While writing code is getting easier, our infrastructure is becoming more complex and harder to understand. Perhaps we shouldn’t be surprised that as the cost of writing software decreases, we see an explosion in the number of tools and abstractions increasing complexity; the expanding CNCF landscape is a great illustration of this.

The only way to keep up with AI assisted coding is with AI assisted operations. While we are being flooded with copilots and autopilots for writing software, there has been much less progress with assistants for operations. This is because 1) everyone’s infrastructure is different; an outcome of the combinatorial complexity of today’s options and 2) there is no single system with a complete and up to date picture of a company’s infrastructure.

Consider a problem that has bedeviled me ever since I started working on Kubeflow; “I just want to deploy Jupyter on my company’s Cloud and access it securely.” To begin to ask AIs (ChatGPT, Claude, Bard) for help we need to teach them about our infrastructure; e.g. What do we use for compute, ECS, GKE, GCE? What are we using for VPN; tailscale, IAP, Cognito? How do we attach credentials to Jupyter so we can access internal data stores? What should we do for storage; Persistent disk or File store?

The fundamental problem is mapping a user’s intent, “deploy Jupyter”, to the specific set of operations to achieve that within our organization. The current solution is to build platforms that create higher level abstractions that hopefully more closely map to user intent while hiding implementation details. Unfortunately, building platforms is expensive and time consuming. I have talked to organizations with 100s of engineers building an internal developer platform (IDP).

Foyle is an OSS project that aims to simplify software operations with AI. Foyle uses notebooks to create a UX that encourages developers to express intent as well as actions. By logging this data, Foyle is able to build models that predict the operations needed to achieve a given intent. This is a problem which LLMs are unquestionably good at.

Demo

Let’s consider one of the most basic operations; fetching the logs to understand why something isn’t working. Observability is critical but at least for me a constant headache. Each observability tool has their own hard to remember query language and queries depend on how applications were instrumented. As an example, Hydros is a tool I built for CICD. To figure out whether hydros successfully built an image or hydrated some manifests I need to query its logs.

A convenient and easy way for me to express my intent is with the query

fetch the hydros logs for the image vscode-ext

If we send this to an AI (e.g. ChatGPT) with no knowledge of our infrastructure we get an answer which is completely wrong.

hydros logs vscode-ext

This is a good guess but wrong because my logs are stored in Google Cloud Logging. The correct query executed using gcloud is the following.

gcloud logging read 'logName="projects/foyle-dev/logs/hydros" jsonPayload.image="vscode-ext"' --freshness=1d  --project=foyle-dev

Now the very first time I access hydros logs I’m going to have to help Foyle understand that I’m using Cloud Logging and how hydros structures its logs; i.e. that each log entry contains a field image with the name of the image being logged. However, since Foyle logs the intent and final action it is able to learn. The next time I need to access logs if I issue a query like

show the hydros logs for the image caribou

Foyle responds with the correct query

gcloud logging read 'logName="projects/foyle-dev/logs/hydros" jsonPayload.image="caribou"' --freshness=1d --project=foyle-dev

I have intentionally asked for an image that doesn’t exist because I wanted to test whether Foyle is able to learn the correct pattern as opposed to simply memorizing commands. Using a single example Foyle learns 1) what log is used by hydros and 2) how hydros uses structured logging to associate log messages with a particular image. This is possible because foundation models already have significant knowledge of relevant systems (i.e. gcloud and Cloud Logging).

Foyle relies on a UX which prioritizes collecting implicit feedback to teach the AI about our infrastructure.

implicit feedback interaction diagram

In this interaction, a user asks an AI to translate their intent into one or more tools the user can invoke. The tools are rendered in executable, editable cells inside the notebook. This experience allows the user to iterate on the commands if necessary to arrive at the correct answer. Foyle logs these iterations (see this previous blog post for a detailed discussion) so it can learn from them.

The learning mechanism is quite simple. As denoted above we have the original query, Q, the initial answer from the AI, A, and then the final command, A’, the user executed. This gives us a triplet (Q, A, A’). If A=A’ the AI got the answer right; otherwise the AI made a mistake the user had to fix.

The AI can easily learn from its mistakes by storing the pairs (Q, A’). Given a new query Q* we can easily search for similar queries from the past where the AI made a mistake. Matching text based on semantic similarity is one of the problems LLMs excel at. Using LLMs we can compute the embedding of Q and Q* and measure the similarity to find similar queries from the past. Given a set of similar examples from the past {(Q1,A1’),(Q2,A2’),…,(Qn,An’)} we can use few shot prompting to get the LLM to learn from those past examples and answer the new query correctly. As demonstrated by the example above this works quite well.

This pattern of collecting implicit human feedback and learning from it is becoming increasingly common. Dosu uses this pattern to build AIs that can automatically label issues.

An IDE For DevOps

One of the biggest barriers to building copilots for devops is that when it comes to operating infrastructure we are constantly switching between different modalities

  • We use IDEs/Editors to edit our IAC configs
  • We use terminals to invoke CLIs
  • We use UIs for click ops and visualization
  • We use tickets/docs to capture intent and analysis
  • We use proprietary web apps to get help from AIs

This fragmented experience for operations is a barrier to collecting data that would let us train powerful assistants. Compare this to writing software where a developer can use a single IDE to write, build, and test their software. When these systems are well instrumented you can train really valuable software assistants like Google’s DIDACT and Replit Code Repair.

This is an opportunity to create a better experience for devops even in the absence of AI. A great example of this is what the Runme.dev project is doing. Below is a screenshot of a Runme.dev interactive widget for VMs rendered directly in the notebook.

runme gce renderer

This illustrates a UX where users don’t need to choose between the convenience of ClickOps and being able to log intent and action. Another great example is Datadog Notebooks. When I was at Primer, I found using Datadog notebooks to troubleshoot and document issues was far superior to copying and pasting links and images into tickets or Google Docs.

Conclusion: Leading the AI Wave

If you’re a platform engineer like me you’ve probably spent previous waves of AI building tools to support AI builders; e.g. by exposing GPUs or deploying critical applications like Jupyter. Now we, platform engineers, are in a position to use AI to solve our own problems and better serve our customers. Despite all the excitement about AI, there’s a shortage of examples of AI positively transforming how we work. Let’s make platform engineering a success story for AI.

Acknowledgements

I really appreciate Hamel Husain reviewing and editing this post.

Logging Implicit Human Feedback

Foyle is an open source assistant to help software developers deal with the pain of devops. Developers are expected to operate their software which means dealing with the complexity of Cloud. Foyle aims to simplify operations with AI. One of Foyle’s central premises is that creating a UX that implicitly captures human feedback is critical to building AIs that effectively assist us with operations. This post describes how Foyle logs that feedback.

The Problem

As software developers, we all ask AIs (ChatGPT, Claude, Bard, Ollama, etc.…) to write commands to perform operations. These AIs often make mistakes. This is especially true when the correct answer depends on internal knowledge, which the AI doesn’t have.

  • What region, cluster, or namespace is used for dev vs. prod?
  • What resources is the internal code name “caribou” referring to?
  • What logging schema is used by our internal CICD tool?

The experience today is

  • Ask an assistant for one or more commands
  • Copy those commands to a terminal
  • Iterate on those commands until they are correct

When it comes to building better AIs, the human feedback provided by the last step is gold. Yet today’s UX doesn’t allow us to capture this feedback easily. At best, this feedback is often collected out of band as part of a data curation step. This is problematic for two reasons. First, it’s more expensive because it requires paying for labels (in time or money). Second, if we’re dealing with complex, bespoke internal systems, it can be hard to find people with the requisite expertise.

Frontend

If we want to collect human feedback, we need to create a single unified experience for

  1. Asking the AI for help
  2. Editing/Executing AI suggested operations

If users are copying and pasting between two different applications the likelihood of being able to instrument it to collect feedback goes way down. Fortunately, we already have a well-adopted and familiar pattern for combining exposition, commands/code, and rich output. Its notebooks.

Foyle’s frontend is VSCode notebooks. In Foyle, when you ask an AI for assistance, the output is rendered as cells in the notebook. The cells contain shell commands that can then be used to execute those commands either locally or remotely using the notebook controller API, which talks to a Foyle server. Here’s a short video illustrating the key interactions.

Crucially, cells are central to how Foyle creates a UX that automatically collects human feedback. When the AI generates a cell, it attaches a UUID to that cell. That UUID links the cell to a trace that captures all the processing the AI did to generate it (e.g any LLM calls, RAG calls, etc…). In VSCode, we can use cell metadata to track the UUID associated with a cell.

When a user executes a cell, the frontend sends the contents of the cell along with its UUID to the Foyle server. The UUID then links the cell to a trace of its execution. The cell’s UUID can be used to join the trace of how the AI generated the cell with a trace of what the user actually executed. By comparing the two we can easily see if the user made any corrections to what the AI suggested.

cell ids interaction diagram

Traces

Capturing traces of the AI and execution are essential to logging human feedback. Foyle is designed to run on your infrastructure (whether locally or in your Cloud). Therefore, it’s critical that Foyle not be too opinionated about how traces are logged. Fortunately, this is a well-solved problem. The standard pattern is:

  1. Instrument the app using structured logs
  2. App emits logs to stdout and stderr
  3. When deploying the app collect stdout and stderr and ship them to whatever backend you want to use (e.g. Google Cloud Logging, Datadog, Splunk etc…)

When running locally, setting up an agent to collect logs can be annoying, so Foyle has the built-in ability to log to files. We are currently evaluating the need to add direct support for other backends like Cloud Logging. This should only matter when running locally because if you’re deploying on Cloud chances are your infrastructure is already instrumented to collect stdout and stderr and ship them to your backend of choice.

Don’t reinvent logging

Using existing logging libraries that support structured logging seems so obvious to me that it hardly seems worth mentioning. Except, within the AI Engineering/LLMOps community, it’s not clear to me that people are reusing existing libraries and patterns. Notably, I’m seeing a new class of observability solutions that require you to instrument your code with their SDK. I think this is undesirable as it violates the separation of concerns between how an application is instrumented and how that telemetry is stored, processed, and rendered. My current opinion is that Agent/LLM observability can often be achieved by reusing existing logging patterns. So, in defense of that view, here’s the solution I’ve opted for.

Structured logging means that each log line is a JSON record which can contain arbitrary fields. To Capture LLM or RAG requests and responses, I log them; e.g.

request := openai.ChatCompletionRequest{
			Model:       a.config.GetModel(),
			Messages:    messages,
			MaxTokens:   2000,
			Temperature: temperature,
	    }

log.Info("OpenAI:CreateChatCompletion", "request", request)

This ends up logging the request in JSON format. Here’s an example

{
  "severity": "info",
  "time": 1713818994.8880482,
  "caller": "agent/agent.go:132",
  "function": "github.com/jlewi/foyle/app/pkg/agent.(*Agent).completeWithRetries",
  "message": "OpenAI:CreateChatCompletion response",
  "traceId": "36eb348d00d373e40552600565fccd03",
  "resp": {
    "id": "chatcmpl-9GutlxUSClFaksqjtOg0StpGe9mqu",
    "object": "chat.completion",
    "created": 1713818993,
    "model": "gpt-3.5-turbo-0125",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "To list all the images in Artifact Registry using `gcloud`, you can use the following command:\n\n```bash\ngcloud artifacts repositories list --location=LOCATION\n```\n\nReplace `LOCATION` with the location of your Artifact Registry. For example, if your Artifact Registry is in the `us-central1` location, you would run:\n\n```bash\ngcloud artifacts repositories list --location=us-central1\n```"
        },
        "finish_reason": "stop"
      }
    ],
    "usage": {
      "prompt_tokens": 329,
      "completion_tokens": 84,
      "total_tokens": 413
    },
    "system_fingerprint": "fp_c2295e73ad"
  }
}

A single request can generate multiple log entries. To group all the log entries related to a particular request, I attach a trace id to each log message.

func (a *Agent) Generate(ctx context.Context, req *v1alpha1.GenerateRequest) (*v1alpha1.GenerateResponse, error) {
   span := trace.SpanFromContext(ctx)
   log := logs.FromContext(ctx)
   log = log.WithValues("traceId", span.SpanContext().TraceID())

Since I’ve instrumented Foyle with open telemetry(OTEL), each request is automatically assigned a trace id. I attach that trace id to all the log entries associated with that request. Using the trace id assigned by OTEL means I can link the logs with the open telemetry trace data.

OTEL is an open standard for distributed tracing. I find OTEL great for instrumenting my code to understand how long different parts of my code took, how often errors occur and how many requests I’m getting. You can use OTEL for LLM Observability; here’s an example. However, I chose logs because as noted in the next section they are easier to mine.

Aside: Structured Logging In Python

Python’s logging module supports structured logging. In Python you can use the extra argument to pass an arbitrary dictionary of values. In python the equivalent would be:

logger.info("OpenAI:CreateChatCompletion", extra={'request': request, "traceId": traceId})

You then configure the logging module to use the python-json-logger formatter to emit logs as JSON. Here’s the logging.conf I use for Python.

Logs Need To Be Mined

Post-processing your logs is often critical to unlocking the most valuable insights. In the context of Foyle, I want a record for each cell that captures how it was generated and any subsequent executions of that cell. To produce this, I need to write a simple ETL pipeline that does the following:

  • Build a trace by grouping log entries by trace ID
  • Reykey each trace by the cell id the trace is associated with
  • Group traces by cell id

This logic is highly specific to Foyle. No observability tool will support it out of box.

Consequently, a key consideration for my observability backend is how easily it can be wired up to my preferred ETL tool. Logs processing is such a common use case that most existing logging providers likely have good support for exporting your logs. With Google Cloud Logging for example it’s easy to setup log sinks to route logs to GCS, BigQuery or PubSub for additional processing.

Visualization

The final piece is being able to easily visualize the traces to inspect what’s going on. Arguably, this is where you might expect LLM/AI focused tools might shine. Unfortunately, as the previous section illustrates, the primary way I want to view Foyle’s data is to look at the processing associated with a particular cell. This requires post-processing the raw logs. As a result, out of box visualizations won’t let me view the data in the most meaningful way.

To solve the visualization problem, I’ve built a lightweight progressive web app(PWA) in Go (code) using maxence-charriere/go-app. While I won’t be winning any design awards, it allows me to get the job done quickly and reuse existing libraries. For example, to render markdown as HTML I could reuse the Go libraries I was already using (yuin/goldmark). More importantly, I don’t have to wrestle with a stack(typescript, REACT, etc…) that I’m not proficient in. With Google Logs Analytics, I can query the logs using SQL. This makes it very easy to join and process a trace in the web app. This makes it possible to view traces in real-time without having to build and deploy a streaming pipeline.

Try Foyle

Please consider following the getting started guide to try out an early version of Foyle and share your thoughts by email(jeremy@lewi.us) on GitHub(jlewi/foyle) or on twitter (@jeremylewi)!

About Me

I’m a Machine Learning platform engineer with over 15 years of experience. I create platforms that facilitate the rapid deployment of AI into production. I worked on Google’s Vertex AI where I created Kubeflow, one of the most popular OSS frameworks for ML.

I’m open to new consulting work and other forms of advisory. If you need help with your project, send me a brief email at jeremy@lewi.us.

Acknowledgements

I really appreciate Hamel Husain and Joseph Gleasure reviewing and editing this post.