Hamel's Blog

Debugging AI With Adversarial Validation

Hamel Husain — Fri, 12 Apr 2024 07:00:00 GMT

For years, I’ve relied on a straightforward method to identify sudden changes in model inputs or training data, known as “drift.” This method, Adversarial Validation¹, is both simple and effective. The best part? It requires no complex tools or infrastructure.

Examples where drift can cause bugs in your AI:

Your data for evaluations are materially different from the inputs your model receives in production, causing your evaluations to be misleading.
Updates to prompts, functions, RAG, and similar elements aren’t incorporated into your fine-tuning or training data, leading to unexpected model behavior in production.

No matter how careful you are, bugs can still slip through the cracks. A high-value activity is to routinely audit all your AI/ML projects for drift.

How It Works

Uncool Warning

This method is so simple that it might seem uncool. You aren’t going to impress any data scientists. Despite this, it’s too valuable to ignore.

This slide from my talk on MLOps tools explains the technique behind Adversarial Validation²:

Slide

The process is as follows:

Collect two datasets to compare. For example:
- Training data from two different fine-tuning runs
- Training data vs. evaluation data
- Training data vs. production data (organized into the same format)
- Data from two different time-periods
Create features from the dataset. A basic example that creates features from tokens is illustrated here.³
Give dataset #1 a label of 0 and dataset #2 a label of 1.
Fit a binary classifier (random forest, logistic regression, etc) to discriminate between the two datasets.
If the classifier demonstrates sufficient predictive power (ex: AUC >=0.60), we know there is drift.
If you used an interpretable model (like logistic regression, random forest, etc.), you can inspect feature importance metrics to understand the root cause of the drift. If you use a more complex model (like a neural network), you can use SHAP values or other methods to understand what is causing the drift. I recommend starting with a simple interpretable model.

Warning

If this process doesn’t detect drift, it doesn’t mean there isn’t drift. It just means that we couldn’t detect it with the model and features we used.

Minimal Example: `ft_drift`

I work with lots of folks who are fine-tuning models using the OpenAI API. I’ve created a small CLI tool, ft_drift, that detects drift between two multi-turn chat formatted jsonl files. Currently, ft_drift only detects drift in prompt templates, schemas and other token-based drift (as opposed to semantic drift). However, this is a good starting point to understand the general concept of adversarial validation. Here is a demo of this tool at work:

The demo shows the tool detecting a difference between two datasets, file_a.jsonl and file_b.jsonl. Afterward, a table of important tokens that account for the drift are shown, such as END-UI-FORMAT, UI-FORMAT, etc.

This demo is from a real-world example where an unintentional change in a prompt template caused unexpected behavior in a model. We were able to apply the tool and quickly find the root cause of the issue. The modeling code is embarrassingly simple and located at ft_drift/model.py. The point is you don’t need sophisticated techniques to get started.

You can take this approach further by adding embeddings to your features to also detect semantic drift. Similarly, you could add additional features by hand like the number of conversation turns, length of messages, etc.

Footnotes

I first learned of this technique in 2016 from this blog post by Zygmunt Zając. Throughout the years, I’ve seen this technique used in a variety of contexts, sometimes with different names.↩︎
This slide uses the word “skew” which is interchangeable with “drift” in this context.↩︎
For classic ML, if you are already training a model on this data, you likely have a feature engineering pipeline that you can reuse.↩︎

Your AI Product Needs Evals

Hamel Husain — Fri, 29 Mar 2024 07:00:00 GMT

Motivation

I started working with language models five years ago when I led the team that created CodeSearchNet, a precursor to GitHub CoPilot. Since then, I’ve seen many successful and unsuccessful approaches to building LLM products. I’ve found that unsuccessful products almost always share a common root cause: a failure to create robust evaluation systems.

I’m currently an independent consultant who helps companies build domain-specific AI products. I hope companies can save thousands of dollars in consulting fees by reading this post carefully. As much as I love making money, I hate seeing folks make the same mistake repeatedly.

This post outlines my thoughts on building evaluation systems for LLMs-powered AI products.

Iterating Quickly == Success

Like software engineering, success with AI hinges on how fast you can iterate. You must have processes and tools for:

Evaluating quality (ex: tests).
Debugging issues (ex: logging & inspecting data).
Changing the behavior or the system (prompt eng, fine-tuning, writing code)

Many people focus exclusively on #3 above, which prevents them from improving their LLM products beyond a demo.¹ Doing all three activities well creates a virtuous cycle differentiating great from mediocre AI products (see the diagram below for a visualization of this cycle).

If you streamline your evaluation process, all other activities become easy. This is very similar to how tests in software engineering pay massive dividends in the long term despite requiring up-front investment.

To ground this post in a real-world situation, I’ll walk through a case study in which we built a system for rapid improvement. I’ll primarily focus on evaluation as that is the most critical component.

Case Study: Lucy, A Real Estate AI Assistant

Rechat is a SaaS application that allows real estate professionals to perform various tasks, such as managing contracts, searching for listings, building creative assets, managing appointments, and more. The thesis of Rechat is that you can do everything in one place rather than having to context switch between many different tools.

Rechat’s AI assistant, Lucy, is a canonical AI product: a conversational interface that obviates the need to click, type, and navigate the software. During Lucy’s beginning stages, rapid progress was made with prompt engineering. However, as Lucy’s surface area expanded, the performance of the AI plateaued. Symptoms of this were:

Addressing one failure mode led to the emergence of others, resembling a game of whack-a-mole.
There was limited visibility into the AI system’s effectiveness across tasks beyond vibe checks.
Prompts expanded into long and unwieldy forms, attempting to cover numerous edge cases and examples.

Problem: How To Systematically Improve The AI?

To break through this plateau, we created a systematic approach to improving Lucy centered on evaluation. Our approach is illustrated by the diagram below.

This diagram is a best-faith effort to illustrate my mental model for improving AI systems. In reality, the process is non-linear and can take on many different forms that may or may not look like this diagram.

I discuss the various components of this system in the context of evaluation below.

The Types Of Evaluation

Rigorous and systematic evaluation is the most important part of the whole system. That is why “Eval and Curation” is highlighted in yellow at the center of the diagram. You should spend most of your time making your evaluation more robust and streamlined.

There are three levels of evaluation to consider:

Level 1: Unit Tests
Level 2: Model & Human Eval (this includes debugging)
Level 3: A/B testing

The cost of Level 3 > Level 2 > Level 1. This dictates the cadence and manner you execute them. For example, I often run Level 1 evals on every code change, Level 2 on a set cadence and Level 3 only after significant product changes. It’s also helpful to conquer a good portion of your Level 1 tests before you move into model-based tests, as they require more work and time to execute.

There isn’t a strict formula as to when to introduce each level of testing. You want to balance getting user feedback quickly, managing user perception, and the goals of your AI product. This isn’t too dissimilar from the balancing act you must do for products more generally.

Level 1: Unit Tests

Unit tests for LLMs are assertions (like you would write in pytest). Unlike typical unit tests, you want to organize these assertions for use in places beyond unit tests, such as data cleaning and automatic retries (using the assertion error to course-correct) during model inference. The important part is that these assertions should run fast and cheaply as you develop your application so that you can run them every time your code changes. If you have trouble thinking of assertions, you should critically examine your traces and failure modes. Also, do not shy away from using an LLM to help you brainstorm assertions!

Step 1: Write Scoped Tests

The most effective way to think about unit tests is to break down the scope of your LLM into features and scenarios. For example, one feature of Lucy is the ability to find real estate listings, which we can break down into scenarios like so:

Feature: Listing Finder

This feature to be tested is a function call that responds to a user request to find a real estate listing. For example, “Please find listings with more than 3 bedrooms less than $2M in San Jose, CA”

The LLM converts this into a query that gets run against the CRM. The assertion then verifies that the expected number of results is returned. In our test suite, we have three user inputs that trigger each of the scenarios below, which then execute corresponding assertions (this is an oversimplified example for illustrative purposes):

Scenario	Assertions
Only one listing matches user query	len(listing_array) == 1
Multiple listings match user query	len(listing_array) > 1
No listings match user query	len(listing_array) == 0

There are also generic tests that aren’t specific to any one feature. For example, here is the code for one such generic test that ensures the UUID is not mentioned in the output:

const noExposedUUID = message => {
  // Remove all text within double curly braces
  const sanitizedComment = message.comment.replace(/\{\{.*?\}\}/g, '')

  // Search for exposed UUIDs
  const regexp = /[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/ig
  const matches = Array.from(sanitizedComment.matchAll(regexp))
  expect(matches.length, 'Exposed UUIDs').to.equal(0, 'Exposed UUIDs found')
}

CRM results returned to the LLM contain fields that shouldn’t be surfaced to the user; such as the UUID associated with an entry. Our LLM prompt tells the LLM to not include UUIDs. We use a simple regex to assert that the LLM response doesn’t include UUIDs.

Rechat has hundreds of these unit tests. We continuously update them based on new failures we observe in the data as users challenge the AI or the product evolves. These unit tests are crucial to getting feedback quickly when iterating on your AI system (prompt engineering, improving RAG, etc.). Many people eventually outgrow their unit tests and move on to other levels of evaluation as their product matures, but it is essential not to skip this step!

Step 2: Create Test Cases

To test these assertions, you must generate test cases or inputs that will trigger all scenarios you wish to test. I often utilize an LLM to generate these inputs synthetically; for example, here is one such prompt Rechat uses to generate synthetic inputs for a feature that creates and retrieves contacts:

Write 50 different instructions that a real estate agent can give to his assistant to create contacts on his CRM. The contact details can include name, phone, email, partner name, birthday, tags, company, address and job.

For each of the instructions, you need to generate a second instruction which can be used to look up the created contact.

. The results should be a JSON code block with only one string as the instruction like the following:


[
  ["Create a contact for John (johndoe@apple.com)", 
  "What's the email address of John Smith?"]
]

Using the above prompt, we generate test cases like below:

[ 
    [
        'Create a contact for John Smith (johndoe@apple.com) with phone number 123-456-7890 and address 123 Apple St.', 
        'What\'s the email address of John Smith?'
    ],
    [
        'Add Emily Johnson with phone 987-654-3210, email emilyj@email.com, and company ABC Inc.', 
        'What\'s the phone number for Emily Johnson?'
    ],
    [
        'Create a contact for Tom Williams with birthday 10/20/1985, company XYZ Ltd, and job title Manager.', 
        'What\'s Tom Williams\' job title?'
    ],
    [
        'Add a contact for Susan Brown with partner name James Brown, and email susanb@email.com.', 
    'What\'s the partner name of Susan Brown?'
    ],
…
]

For each of these test cases, we execute the first user input to create the contact. We then execute the second query to fetch that contact. If the CRM doesn’t return exactly 1 result then we know there was a problem either creating or fetching the contact. We can also run generic assertions like the one to verify UUIDs are not in the response. You must constantly update these tests as you observe data through human evaluation and debugging. The key is to make these as challenging as possible while representing users’ interactions with the system.

You don’t need to wait for production data to test your system. You can make educated guesses about how users will use your product and generate synthetic data. You can also let a small set of users use your product and let their usage refine your synthetic data generation strategy. One signal you are writing good tests and assertions is when the model struggles to pass them - these failure modes become problems you can solve with techniques like fine-tuning later on.

On a related note, unlike traditional unit tests, you don’t necessarily need a 100% pass rate. Your pass rate is a product decision, depending on the failures you are willing to tolerate.

Step 3: Run & Track Your Tests Regularly

There are many ways to orchestrate Level 1 tests. Rechat has been leveraging CI infrastructure (e.g., GitHub Actions, GitLab Pipelines, etc.) to execute these tests. However, the tooling for this part of the workflow is nascent and evolving rapidly.

My advice is to orchestrate tests that involve the least friction in your tech stack. In addition to tracking tests, you need to track the results of your tests over time so you can see if you are making progress. If you use CI, you should collect metrics along with versions of your tests/prompts outside your CI system for easy analysis and tracking.

I recommend starting simple and leveraging your existing analytics system to visualize your test results. For example, Rechat uses Metabase to track their LLM test results over time. Below is a screenshot of a dashboard Rechat built with Metabase:

This screenshot shows the prevalence of a particular error (shown in yellow) in Lucy before (left) vs after (right) we addressed it.

Level 2: Human & Model Eval

After you have built a solid foundation of Level 1 tests, you can move on to other forms of validation that cannot be tested by assertions alone. A prerequisite to performing human and model-based eval is to log your traces.

Logging Traces

A trace is a concept that has been around for a while in software engineering and is a log of a sequence of events such as user sessions or a request flow through a distributed system. In other words, tracing is a logical grouping of logs. In the context of LLMs, traces often refer to conversations you have with a LLM. For example, a user message, followed by an AI response, followed by another user message, would be an example of a trace.

There are a growing number of solutions for logging LLM traces.² Rechat uses LangSmith, which logs traces and allows you to view them in a human-readable way with an interactive playground to iterate on prompts. Sometimes, logging your traces requires you to instrument your code. In this case, Rechat was using LangChain which automatically logs trace events to LangSmith for you. Here is a screenshot of what this looks like:

I like LangSmith - it doesn’t require that you use LangChain and is intuitive and easy to use. Searching, filtering, and reading traces are essential features for whatever solution you pick. I’ve found that some tools do not implement these basic functions correctly!

Looking At Your Traces

You must remove all friction from the process of looking at data. This means rendering your traces in domain-specific ways. I’ve often found that it’s better to build my own data viewing & labeling tool so I can gather all the information I need onto one screen. In Lucy’s case, we needed to look at many sources of information (trace log, the CRM, etc) to understand what the AI did. This is precisely the type of friction that needs to be eliminated. In Rechat’s case, this meant adding information like:

What tool (feature) & scenario was being evaluated.
Whether the trace resulted from a synthetic input or a real user input.
Filters to navigate between different tools and scenario combinations.
Links to the CRM and trace logging system for the current record.

I’ve built different variations of this tool for each problem I’ve worked on. Sometimes, I even need to embed another application to see what the user interaction looks like. Below is a screenshot of the tool we built to evaluate Rechat’s traces:

Another design choice specific to Lucy is that we noticed that many failures involved small mistakes in the final output of the LLM (format, content, etc). We decided to make the final output editable by a human so that we could curate & fix data for fine-tuning.

These tools can be built with lightweight front-end frameworks like Gradio, Streamlit, Panel, or Shiny in less than a day. The tool shown above was built with Shiny for Python. Furthermore, there are tools like Lilac which uses AI to search and filter data semantically, which is incredibly handy for finding a set of similar data points while debugging an issue.

I often start by labeling examples as good or bad. I’ve found that assigning scores or more granular ratings is more onerous to manage than binary ratings. There are advanced techniques you can use to make human evaluation more efficient or accurate (e.g., active learning, consensus voting, etc.), but I recommend starting with something simple. Finally, like unit tests, you should organize and analyze your human-eval results to assess if you are progressing over time.

As discussed later, these labeled examples measure the quality of your system, validate automated evaluation, and curate high-quality synthetic data for fine-tuning.

How much data should you look at?

I often get asked how much data to examine. When starting, you should examine as much data as possible. I usually read traces generated from ALL test cases and user-generated traces at a minimum. You can never stop looking at data—no free lunch exists. However, you can sample your data more over time, lessening the burden. ³

Automated Evaluation w/ LLMs

Many vendors want to sell you tools that claim to eliminate the need for a human to look at the data. Having humans periodically evaluate at least a sample of traces is a good idea. I often find that “correctness” is somewhat subjective, and you must align the model with a human.

You should track the correlation between model-based and human evaluation to decide how much you can rely on automatic evaluation. Furthermore, by collecting critiques from labelers explaining why they are making a decision, you can iterate on the evaluator model to align it with humans through prompt engineering or fine-tuning. However, I tend to favor prompt engineering for evaluator model alignment.

I love using low-tech solutions like Excel to iterate on aligning model-based eval with humans. For example, I sent my colleague Phillip the following spreadsheet every few days to grade for a different use-case involving a natural language query generator. This spreadsheet would contain the following information:

model response: this is the prediction made by the LLM.
model critique: this is a critique written by a (usually more powerful) LLM about your original LLM’s prediction.
model outcome: this is a binary label the critique model assigns to the model response as being “good” or “bad.”

Phillip then fills out his version of the same information - meaning his critique, outcome, and desired response for 25-50 examples at a time (these are the columns prefixed with “phillip_” below):

This information allowed me to iterate on the prompt of the critique model to make it sufficiently aligned with Phillip over time. This is also easy to track in a low-tech way in a spreadsheet:

This is a screenshot of a spreadsheet where we recorded our attempts to align model-based eval with a human evaluator.

General tips on model-based eval:

Use the most powerful model you can afford. It often takes advanced reasoning capabilities to critique something well. You can often get away with a slower, more powerful model for critiquing outputs relative to what you use in production.
Model-based evaluation is a meta-problem within your larger problem. You must maintain a mini-evaluation system to track its quality. I have sometimes fine-tuned a model at this stage (but I try not to).
After bringing the model-based evaluator in line with the human, you must continue doing periodic exercises to monitor the model and human agreement.

My favorite aspect about creating a good evaluator model is that its critiques can be used to curate high-quality synthetic data, which I will touch upon later.

Level 3: A/B Testing

Finally, it is always good to perform A/B tests to ensure your AI product is driving user behaviors or outcomes you desire. A/B testing for LLMs compared to other types of products isn’t too different. If you want to learn more about A/B testing, I recommend reading the Eppo blog (which was created by colleagues I used to work with who are rock stars in A/B testing).

It’s okay to put this stage off until you are sufficiently ready and convinced that your AI product is suitable for showing to real users. This level of evaluation is usually only appropriate for more mature products.

Evaluating RAG

Aside from evaluating your system as a whole, you can evaluate sub-components of your AI, like RAG. Evaluating RAG is beyond the scope of this post, but you can learn more about this subject in a post by Jason Liu.

Eval Systems Unlock Superpowers For Free

In addition to iterating fast, eval systems unlock the ability to fine-tune and debug, which can take your AI product to the next level.

Fine-Tuning

Rechat resolved many failure modes through fine-tuning that were not possible with prompt engineering alone. Fine-tuning is best for learning syntax, style, and rules, whereas techniques like RAG supply the model with context or up-to-date facts.

99% of the labor involved with fine-tuning is assembling high-quality data that covers your AI product’s surface area. However, if you have a solid evaluation system like Rechat’s, you already have a robust data generation and curation engine! I will expand more on the process of fine-tuning in a future post.⁴

Data Synthesis & Curation

To illustrate why data curation and synthesis come nearly for free once you have an evaluation system, consider the case where you want to create additional fine-tuning data for the listing finder mentioned earlier. First, you can use LLMs to generate synthetic data with a prompt like this:

Imagine if Zillow was able to parse natural language. Come up with 50 different ways users would be able to search listings there. Use real names for cities and neighborhoods.

You can use the following parameters:



Output should be a JSON code block array. Example:

[
"Homes under $500k in New York"
]

This is almost identical to the exercise for producing test cases! You can then use your Level 1 & Level 2 tests to filter out undesirable data that fails assertions or that the critique model thinks are wrong. You can also use your existing human evaluation tools to look at traces to curate traces for a fine-tuning dataset.

Debugging

When you get a complaint or see an error related to your AI product, you should be able to debug this quickly. If you have a robust evaluation system, you already have:

A database of traces that you can search and filter.
A set of mechanisms (assertions, tests, etc) that can help you flag errors and bad behaviors.
Log searching & navigation tools that can help you find the root cause of the error. For example, the error could be RAG, a bug in the code, or a model performing poorly.
The ability to make changes in response to the error and quickly test its efficacy.

In short, there is an incredibly large overlap between the infrastructure needed for evaluation and that for debugging.

Conclusion

Evaluation systems create a flywheel that allows you to iterate very quickly. It’s almost always where people get stuck when building AI products. I hope this post gives you an intuition on how to go about building your evaluation systems. Some key takeaways to keep in mind:

Remove ALL friction from looking at data.
Keep it simple. Don’t buy fancy LLM tools. Use what you have first.
You are doing it wrong if you aren’t looking at lots of data.
Don’t rely on generic evaluation frameworks to measure the quality of your AI. Instead, create an evaluation system specific to your problem.
Write lots of tests and frequently update them.
LLMs can be used to unblock the creation of an eval system. Examples include using a LLM to:
- Generate test cases and write assertions
- Generate synthetic data
- Critique and label data etc.
Re-use your eval infrastructure for debugging and fine-tuning.

I’d love to hear from you if you found this post helpful or have any questions. My email is hamel@parlance-labs.com.

This article is an adaptation of this conversation I had with Emil Sedgh and Hugo Browne-Anderson on the Vanishing Gradients podcast. Thanks to Jeremy Howard, Eugene Yan, Shreya Shankar, Jeremy Lewi, and Joseph Gleasure for reviewing this article.

Footnotes

This is not to suggest that people are lazy. Many don’t know how to set up eval systems and skip these steps.↩︎
Some examples include arize, human loop, openllmetry and honeyhive.↩︎
A reasonable heuristic is to keep reading logs until you feel like you aren’t learning anything new.↩︎
If you cannot wait, I’ll be teaching this course on fine-tuning soon.↩︎

Is Fine-Tuning Still Valuable?

Hamel Husain — Wed, 27 Mar 2024 07:00:00 GMT

Here is my personal opinion about the questions I posed in this tweet:

There are a growing number of voices expressing disillusionment with fine-tuning.

I'm curious about the sentiment more generally. (I am withholding sharing my opinion rn).

Tweets below are from @mlpowered @abacaj @emollick pic.twitter.com/cU0hCdubBU
— Hamel Husain (@HamelHusain) March 26, 2024

I think that fine-tuning is still very valuable in many situations. I’ve done some more digging and I find that people who say that fine-tuning isn’t useful are indeed often working on products where fine-tuning isn’t likely to be useful:

They are making developer tools - foundation models have been trained extensively on coding tasks.
They are building foundation models and testing for the most general cases. But the foundation models themselves are also being trained for the most general cases.
They are building a personal assistant that isn’t scoped to any type of domain or use case and is essentially similar to the same folks building foundation models.

Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness.

It’s impossible to fine-tune effectively without an eval system which can lead to writing off fine-tuning if you haven’t completed this prerequisite. It’s also impossible to improve your product without a good eval system in the long term, fine-tuning or not.

You should do as much prompt engineering as possible before you fine-tune. But not for reasons you would think! The reason for doing lots of prompt engineering is that it’s a great way to stress test your eval system!

If you find that prompt-engineering works fine (and you are systematically evaluating your product) then it’s fine to stop there. I’m a big believer in using the simplest approach to solving a problem. I just don’t think you should write off fine-tuning yet.

Examples where I’ve seen fine-tuning work well

Generally speaking, fine-tuning works best to learn syntax, style and rules whereas techniques like RAG work best to supply the model with context or up-to-date facts.

These are some examples from companies I’ve worked with. Hopefully, we will be able to share more details soon.

Honeycomb’s Natural Language Query Assistant - previously, the “programming manual” for the Honeycomb query language was being dumped into the prompt along with many examples. While this was OK, fine-tuning worked much better to allow the model to learn the syntax and rules of this niche domain-specific language.
ReChat’s Lucy - this is an AI real estate assistant integrated into an existing Real Estate CRM system. ReChat needs LLM responses to be provided in a very idiosyncratic format that weaves together structured and unstructured data to allow the front end to render widgets, cards and other interactive elements dynamically into the chat interface. Fine-tuning was the key to making this work correctly. This talk has more details.

P.S. Fine-tuning is not only limited to open or “small” models. There are lots of folks who have been fine-tuning GPT-3.5, such as Perplexity.AI: and CaseText, to name a few.

Fuck You, Show Me The Prompt.

Hamel Husain — Wed, 14 Feb 2024 08:00:00 GMT

Background

There are many libraries that aim to make the output of your LLMs better by re-writing or constructing the prompt for you. These libraries purport to make the output of your LLMs:

safer (ex: guardrails)
deterministic (ex: guidance)
structured (ex: instructor)
resilient (ex: langchain)
… or even optimized for an arbitrary metric (ex: DSPy).

A common theme among some of these tools is they encourage users to disintermediate themselves from prompting.

DSPy: “This is a new paradigm in which LMs and their prompts fade into the background …. you can compile your program again DSPy will create new effective prompts”

guidance “guidance is a programming paradigm that offers superior control and efficiency compared to conventional prompting …”

Even when tools don’t discourage prompting, I’ve often found it difficult to retrieve the final prompt(s) these tools send to the language model. The prompts sent by these tools to the LLM is a natural language description of what these tools are doing, and is the fastest way to understand how they work. Furthermore, some tools have dense terminology to describe internal constructs which can further obfuscate what they are doing.

For reasons I’ll explain below, I think most people would benefit from the following mindset:

In this blog post, I’ll show you how you can intercept API calls w/prompts for any tool, without having to fumble through docs or read source code. I’ll show you how to setup and operate mitmproxy with examples from the LLM the tools I previously mentioned.

Motivation: Minimize accidental complexity

Before adopting an abstraction, its important to consider the dangers of taking on accidental complexity. This danger is acute for LLM abstractions relative to programming abstractions. With LLM abstractions, we often force the user to regress towards writing code instead of conversing with the AI in natural language, which can run counter to the purpose of LLMs:

Programming abstraction -> a human-like language you can use to translate your task into machine code

LLM abstraction -> an unintelligible framework you can use to translate your task into human language
— Hamel Husain (@HamelHusain) February 5, 2024

While this is a cheeky comment, it’s worth keeping this in mind while evaluating tools. There are two primary types of automation that tools provide:

Interleaving code and LLMs: Expressing this automation is often best done through code, since code must be run to carry out the task. Examples include routing, executing functions, retries, chaining, etc.
Re-Writing and constructing prompts: Expressing your intent is often best done through natural language. However, there are exceptions! For example, it is convenient to express a function definition or schema from code instead of natural language.

Many frameworks offer both types of automation. However, going too far with the second type can have negative consequences. Seeing the prompt allows you decide:

Is this framework really necessary?
Should I just steal the final prompt (a string) and jettison the framework?
Can we write a better prompt than this (shorter, aligned with your intent, etc)?
Is this the best approach (do the # of API calls seem appropriate)?

In my experience, seeing the prompts and API calls are essential to making informed decisions.

Intercepting LLM API calls

There are many possible ways to intercept LLM API calls, such as monkey patching source code or finding a user-facing option. I’ve found that those approaches take far too much time since the quality of source code and documentation can vary greatly. After all, I just want to see API calls without worrying about how the code works!

A framework agnostic way to see API calls is to setup a proxy that logs your outgoing API requests. This is easy to do with mitmproxy, an free, open-source HTTPS proxy.

Setting Up mitmproxy

This is an opinionated way to setup mitmproxythat’s beginner-friendly for our intended purposes:

Follow the installation instructions on the website
Start the interactive UI by running mitmweb in the terminal. Pay attention to the url of the interactive UI in the logs which will look something like this: Web server listening at http://127.0.0.1:8081/
Next, you need to configure your device (i.e. your laptop) to route all traffic through mitproxy, which listens on http://localhost:8080. Per the documentation:

We recommend to simply search the web on how to configure an HTTP proxy for your system. Some operating system have a global settings, some browser have their own, other applications use environment variables, etc.

In my case, A google search for “set proxy for macos” returned these results:

choose Apple menu > System Settings, click Network in the sidebar, click a network service on the right, click Details, then click Proxies.

I then insert localhost and 8080 in the following places in the UI:
Next, navigate to http://mitm.it and it will give you instructions on how to install the mitmproxy Certificate Authority (CA), which you will need for intercepting HTTPS requests. (You can also do this manually here.) Also, take note of the location of the CA file as we will reference it later.
You can test that everything works by browsing to a website like https://mitmproxy.org/, and seeing the corresponding output in the mtimweb UI which for me is located at http://127.0.0.1:8081/ (look at the logs in your terminal to get the URL).
Now that you set everything up, you can disable the proxy that you previously enabled on your network. I do this on my mac by toggling the proxy buttons in the screenshot I showed above. This is because we want to scope the proxy to only the python program to eliminate unnecessary noise.

Tip

Networking related software commonly allows you to proxy outgoing requests by setting environment variables. This is the approach we will use to scope our proxy to specific Python programs. However, I encourage you to play with other types of programs to see what you find after you are comfortable!

Environment variables for Python

We need to set the following environment variables so that the requests and httpx libraries will direct traffic to the proxy and reference the CA file for HTTPS traffic:

Important

Make sure you set these environment variables before running any of the code snippets in this blog post.

import os
# The location of my CA File
cert_file = '/Users/hamel/Downloads/mitmproxy-ca-cert.pem' 
os.environ['REQUESTS_CA_BUNDLE'] = cert_file
os.environ['SSL_CERT_FILE'] = cert_file
os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:8080'

You can do a minimal test by running the following code:

import requests
requests.post('https://httpbin.org/post', 
              data={'key': 'value'})

This will appear in the UI like so:

Examples

Now for the fun part, let’s run through some examples of LLM libraries and intercept their API calls!

Guardrails

Guardrails allows you specify structure and types, which it uses to validate and correct the outputs of large language models. This is a hello world example from the guardrails-ai/guardrails README:

from pydantic import BaseModel, Field
from guardrails import Guard
import openai

class Pet(BaseModel):
    pet_type: str = Field(description="Species of pet")
    name: str = Field(description="a unique pet name")

prompt = """
    What kind of pet should I get and what should I name it?

    ${gr.complete_json_suffix_v2}
"""
guard = Guard.from_pydantic(output_class=Pet, prompt=prompt)

validated_output, *rest = guard(
    llm_api=openai.completions.create,
    engine="gpt-3.5-turbo-instruct"
)

print(f"{validated_output}")

{
    "pet_type": "dog",
    "name": "Buddy

What is happening here? How is this structured output and validation working? Looking at the mitmproxy UI, I can see that the above code resulted in two LLM API calls, the first one with this prompt:

What kind of pet should I get and what should I name it?

    
Given below is XML that describes the information to extract from this document and the tags to extract it into.


    
    



ONLY return a valid JSON object (no other text is necessary), where the key of the field in JSON is the `name` attribute of the corresponding XML, and the value is of the type specified by the corresponding XML's tag. The JSON MUST conform to the XML format, including any types and format requests e.g. requests for lists, objects and specific types. Be correct and concise.

Here are examples of simple (XML, JSON) pairs that show the expected behavior:
- `` => `{'foo': 'example one'}`
- `` => `{"bar": ['STRING ONE', 'STRING TWO', etc.]}`
- `` => `{'baz': {'foo': 'Some String', 'index': 1}}`

Followed by another call with this prompt:

I was given the following response, which was not parseable as JSON.

"{\n    \"pet_type\": \"dog\",\n    \"name\": \"Buddy"

Help me correct this by making it valid JSON.

Given below is XML that describes the information to extract from this document and the tags to extract it into.


    
    



ONLY return a valid JSON object (no other text is necessary), where the key of the field in JSON is the `name` attribute of the corresponding XML, and the value is of the type specified by the corresponding XML's tag. The JSON MUST conform to the XML format, including any types and format requests e.g. requests for lists, objects and specific types. Be correct and concise. If you are unsure anywhere, enter `null`.

Woof. That’s a whole lot of ceremony to get structured output! We learned that this library’s approach to structured output uses XML schemas (while others use function calling). It’s worth considering if you can fashion a better or simpler approach now that the magic has been lifted. Either way, we now have insight into how it works without dragging you into unnecessary complexity, which is a win.

Guidance

Guidance offers constrained generation and programming constructs for writing prompts. Let’s dive into a chat example from their tutorials:

import guidance
gpt35 = guidance.models.OpenAI("gpt-3.5-turbo")

import re
from guidance import gen, select, system, user, assistant

@guidance
def plan_for_goal(lm, goal: str):
    
    # This is a helper function which we will use below
    def parse_best(prosandcons, options):
        best = re.search(r'Best=(\d+)', prosandcons)
        if not best:
            best =  re.search(r'Best.*?(\d+)', 'Best= option is 3')
        if best:
            best = int(best.group(1))
        else:
            best = 0
        return options[best]

    # Some general instruction to the model
    with system():
        lm += "You are a helpful assistant."

    # Simulate a simple request from the user
    # Note that we switch to using 'lm2' here, because these are intermediate steps (so we don't want to overwrite the current lm object)
    with user():
        lm2 = lm + f"""\
        I want to {goal}
        Can you please generate one option for how to accomplish this?
        Please make the option very short, at most one line."""

    # Generate several options. Note that this means several sequential generation requests
    n_options = 5
    with assistant():
        options = []
        for i in range(n_options):
            options.append((lm2 + gen(name='option', temperature=1.0, max_tokens=50))["option"])

    # Have the user request pros and cons
    with user():
        lm2 += f"""\
        I want to {goal}
        Can you please comment on the pros and cons of each of the following options, and then pick the best option?
        ---
        """
        for i, opt in enumerate(options):
            lm2 += f"Option {i}: {opt}\n"
        lm2 += f"""\
        ---
        Please discuss each option very briefly (one line for pros, one for cons), and end by saying Best=X, where X is the number of the best option."""

    # Get the pros and cons from the model
    with assistant():
        lm2 += gen(name='prosandcons', temperature=0.0, max_tokens=600, stop="Best=") + "Best=" + gen("best", regex="[0-9]+") 

    # The user now extracts the one selected as the best, and asks for a full plan
    # We switch back to 'lm' because this is the final result we want
    with user():
        lm += f"""\
        I want to {goal}
        Here is my plan: {options[int(lm2["best"])]}
        Please elaborate on this plan, and tell me how to best accomplish it."""

    # The plan is generated
    with assistant():
        lm += gen(name='plan', max_tokens=500)

    return lm

results = gpt35 + plan_for_goal(goal="read more books")

system
You are a helpful assistant.
user
I want to read more books
Here is my plan: Set aside 30 minutes of dedicated reading time each day.
Please elaborate on this plan, and tell me how to best accomplish it.
assistant
Setting aside 30 minutes of dedicated reading time each day is a great plan to read more books. Here are some tips to help you accomplish this goal:

1. Establish a routine: Choose a specific time of day that works best for you, whether it's in the morning, during lunch break, or before bed. Consistency is key to forming a habit.

2. Create a reading-friendly environment: Find a quiet and comfortable spot where you can focus on your reading without distractions. It could be a cozy corner in your home, a park bench, or a local library.

3. Minimize distractions: Put away your phone, turn off the TV, and avoid any other potential interruptions during your dedicated reading time. This will help you stay focused and fully immerse yourself in the book.

4. Choose books that interest you: Select books that align with your personal interests, hobbies, or goals. When you're genuinely interested in the subject matter, you'll be more motivated to read regularly.

5. Start with manageable goals: If you're new to reading or have a busy schedule, start with a smaller time commitment, such as 15 minutes, and gradually increase it to 30 minutes or more as you become more comfortable.

6. Set a timer: Use a timer or a reading app that allows you to track your reading time. This will help you stay accountable and ensure that you dedicate the full 30 minutes to reading.

7. Make reading enjoyable: Create a cozy reading atmosphere by lighting a candle, sipping a cup of tea, or playing soft background music. Engaging all your senses can enhance your reading experience.

8. Join a book club or reading group: Consider joining a book club or participating in a reading group to connect with fellow book lovers. This can provide additional motivation, discussion opportunities, and book recommendations.

9. Keep a reading log: Maintain a record of the books you've read, along with your thoughts and reflections. This can help you track your progress, discover patterns in your reading preferences, and serve as a source of inspiration for future reading.

10. Be flexible: While it's important to have a dedicated reading time, be flexible and adaptable. Life can sometimes get busy, so if you miss a day, don't be discouraged. Simply pick up where you left off and continue with your reading routine.

Remember, the goal is to enjoy the process of reading and make it a regular part of your life. Happy reading!

This looks pretty neat! But what is it doing exactly? This makes a total of 7 calls to OpenAI, which I have put in this gist. 5 of 7 of these API calls are “internal” thoughts asking the LLM to generate ideas. Even though the temperature is set to 1.0, these “ideas” are mostly redundant. The penultimate call to OpenAI enumerates these “ideas” which I’ve included below:

I want to read more books
Can you please comment on the pros and cons of each of the following options, and then pick the best option?
---
Option 0: Set aside dedicated time each day for reading.
Option 1: Set aside 30 minutes of dedicated reading time each day.
Option 2: Set aside dedicated time each day for reading.
Option 3: Set aside dedicated time each day for reading.
Option 4: Join a book club.
---
Please discuss each option very briefly (one line for pros, one for cons), and end by saying Best=X, where X is the number of the best option.

I know from experience that you are likely to get better results if you tell the language model to generate ideas in one shot. That way, the LLM can reference previous ideas and achieve more diversity. This is a good example of accidental complexity: its very tempting to take this design pattern and apply it blindly. This is less of a critique of this particular framework, since the code makes it clear that 5 independent calls will happen. Either way, its good idea to check your work by inspecting API calls!.

Langchain

Langchain is a multi-tool for all things LLM. Lots of people rely on Langchain when get started with LLMs. The core LangChain library doesn’t generally hide prompts from you, however there are experimental features that do. Let’s take a look at one of these features called SmartLLMChain:

from langchain.prompts import PromptTemplate
from langchain_experimental.smart_llm import SmartLLMChain
from langchain_openai import ChatOpenAI

hard_question = "I have a 12 liter jug and a 6 liter jug.\
I want to measure 6 liters. How do I do it?"
prompt = PromptTemplate.from_template(hard_question)
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")

chain = SmartLLMChain(llm=llm, prompt=prompt, 
                      n_ideas=2, 
                      verbose=True)
result = chain.run({})

print(result)

Idea 1: 1. Fill the 12 liter jug completely.
2. Pour the contents of the 12 liter jug into the 6 liter jug. This will leave you with 6 liters in the 12 liter jug.
3. Empty the 6 liter jug.
4. Pour the remaining 6 liters from the 12 liter jug into the now empty 6 liter jug.
5. You now have 6 liters in the 6 liter jug.

Idea 2: 1. Fill the 12 liter jug completely.
2. Pour the contents of the 12 liter jug into the 6 liter jug. This will leave you with 6 liters in the 12 liter jug.
3. Empty the 6 liter jug.
4. Pour the remaining 6 liters from the 12 liter jug into the now empty 6 liter jug.
5. You now have 6 liters in the 6 liter jug.

Improved Answer:
1. Fill the 12 liter jug completely.
2. Pour the contents of the 12 liter jug into the 6 liter jug until the 6 liter jug is full. This will leave you with 6 liters in the 12 liter jug and the 6 liter jug completely filled.
3. Empty the 6 liter jug.
4. Pour the remaining 6 liters from the 12 liter jug into the now empty 6 liter jug.
5. You now have 6 liters in the 6 liter jug.

Full Answer:
To measure 6 liters using a 12 liter jug and a 6 liter jug, follow these steps:
1. Fill the 12 liter jug completely.
2. Pour the contents of the 12 liter jug into the 6 liter jug until the 6 liter jug is full. This will leave you with 6 liters in the 12 liter jug and the 6 liter jug completely filled.
3. Empty the 6 liter jug.
4. Pour the remaining 6 liters from the 12 liter jug into the now empty 6 liter jug.
5. You now have 6 liters in the 6 liter jug.

Neat! So what happened exactly? While this API emits logs that show you a lot of information (available on this gist), the API request pattern is interesting:

Two seperate api calls for each “idea”.
Another API call that incorporates the two ideas as context, with the prompt:

You are a researcher tasked with investigating the 2 response options provided. List the flaws and faulty logic of each answer options. Let’w work this out in a step by step way to be sure we have all the errors:”
A final API call that that takes the critique from step 2 and generates an answer.

Its not clear that this approach is optimal. I am not sure it should take 4 separate API calls to accomplish this task. Perhaps the critique and the final answer could be generated in one step? Furthermore, the prompt has a spelling error (Let'w) and also overly focuses on the negative about identifying errors - which makes me skeptical that this prompt has been optimized or tested.

Instructor

Instructor is a framework for structured outputs.

Structred data extraction with Pydantic

Here is a basic example from the project’s README that allows you to extract structured data by using Pydantic to define your schema.

import instructor
from openai import OpenAI
from pydantic import BaseModel

client = instructor.patch(OpenAI())

class UserDetail(BaseModel):
    name: str
    age: int

user = client.chat.completions.create(
    model="gpt-3.5-turbo",
    response_model=UserDetail,
    messages=[{"role": "user", "content": "Extract Jason is 25 years old"}])

We can see how this works by inspecting the API call logged to mitmproxy:

{
    "function_call": {
        "name": "UserDetail"
    },
    "functions": [
        {
            "description": "Correctly extracted `UserDetail` with all the required parameters with correct types",
            "name": "UserDetail",
            "parameters": {
                "properties": {
                    "age": {
                        "title": "Age",
                        "type": "integer"
                    },
                    "name": {
                        "title": "Name",
                        "type": "string"
                    }
                },
                "required": [
                    "age",
                    "name"
                ],
                "type": "object"
            }
        }
    ],
    "messages": [
        {
            "content": "Extract Jason is 25 years old",
            "role": "user"
        }
    ],
    "model": "gpt-3.5-turbo"
}

This is great. For structured output - It does exactly what I want, and it correctly uses the OpenAI API the way I would use it if I were writing this manually (by defining a function schema). I would consider this specific API a zero-cost abstraction, meaning it does exactly what I expect it to with a minimal surface area.

Validation

However, instructor has other APIs that are more agressive and write prompts for you. For example, consider this validation example. Running through that example should trigger similar questions to the exploration of Langchain’s SmartLLMChain above. In this example, you will observe 3 LLM API calls to get the right answer, with the final payload looking like this:

{
    "function_call": {
        "name": "Validator"
    },
    "functions": [
        {
            "description": "Validate if an attribute is correct and if not,\nreturn a new value with an error message",
            "name": "Validator",
            "parameters": {
                "properties": {
                    "fixed_value": {
                        "anyOf": [
                            {
                                "type": "string"
                            },
                            {
                                "type": "null"
                            }
                        ],
                        "default": null,
                        "description": "If the attribute is not valid, suggest a new value for the attribute",
                        "title": "Fixed Value"
                    },
                    "is_valid": {
                        "default": true,
                        "description": "Whether the attribute is valid based on the requirements",
                        "title": "Is Valid",
                        "type": "boolean"
                    },
                    "reason": {
                        "anyOf": [
                            {
                                "type": "string"
                            },
                            {
                                "type": "null"
                            }
                        ],
                        "default": null,
                        "description": "The error message if the attribute is not valid, otherwise None",
                        "title": "Reason"
                    }
                },
                "required": [],
                "type": "object"
            }
        }
    ],
    "messages": [
        {
            "content": "You are a world class validation model. Capable to determine if the following value is valid for the statement, if it is not, explain why and suggest a new value.",
            "role": "system"
        },
        {
            "content": "Does `According to some perspectives, the meaning of life is to find purpose, happiness, and fulfillment. It may vary depending on individual beliefs, values, and cultural backgrounds.` follow the rules: don't say objectionable things",
            "role": "user"
        }
    ],
    "model": "gpt-3.5-turbo",
    "temperature": 0
}

Concretely, I’m curious if these steps could be collapsed into two LLM calls instead of three. Furthermore, I wonder if generic validation functions (as supplied in the above payload) are the right way to critique output? I don’t know the answer, but this is an interesting design pattern that is worth poking at.

Note

As far as LLM frameworks go, I really like this one. The core functionality of defining schemas with Pydantic is very convenient. The code is also very readable and easy to understand. Despite this, I still found it helpful to intercept instructor’s API calls to get another perspective.

There is a way to set a logging level in instructor to see the raw API calls, however, I like using a framework agnostic approach :)

DSPy

DSPy is the framework that helps you optimize your prompts to optimize any arbitrary metric. There is a fairly steep learning curve to DSPy, partly because it introduces many new technical terms specific to its framework like compilers and teleprompters. However, we can quickly peel back the complexity by looking at the API calls that it makes!

Let’s run the minimal working example:

import time
import dspy
from dspy.datasets.gsm8k import GSM8K, gsm8k_metric
start_time = time.time()

# Set up the LM
turbo = dspy.OpenAI(model='gpt-3.5-turbo-instruct', max_tokens=250)
dspy.settings.configure(lm=turbo)

# Load math questions from the GSM8K dataset
gms8k = GSM8K()
trainset, devset = gms8k.train, gms8k.dev

class CoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.ChainOfThought("question -> answer")
    
    def forward(self, question):
        return self.prog(question=question)

from dspy.teleprompt import BootstrapFewShotWithRandomSearch

# Set up the optimizer: we want to "bootstrap" (i.e., self-generate) 8-shot examples of our CoT program.
# The optimizer will repeat this 10 times (plus some initial attempts) before selecting its best attempt on the devset.
config = dict(max_bootstrapped_demos=8, max_labeled_demos=8, num_candidate_programs=10, num_threads=4)

# Optimize! Use the `gms8k_metric` here. In general, the metric is going to tell the optimizer how well it's doing.
teleprompter = BootstrapFewShotWithRandomSearch(metric=gsm8k_metric, **config)
optimized_cot = teleprompter.compile(CoT(), trainset=trainset, valset=devset)

This was not very minimal

Despite this being the official quick-start/minimal working example, this code took more than 30 minutes to run, and made hundreds of calls to OpenAI! This cost non-trivial time (and money), especially as an entry-point to the library for someone trying to take a look. There was no prior warning that this would happen.

DSPy made 100s of API calls because it was iteratively sampling examples for a few-shot prompt and selecting the best ones according to the gsm8k_metric on a validation set. I was able to quickly understand this by scanning through the API requests logged to mitmproxy.

DSPy offers an inspect_history method which allows you to see the the last n prompts and their completions:

turbo.inspect_history(n=1)

I was able to verify that these prompts matched the last few API calls being made in mitmproxy. Overall, I would be motivated to potentially keep the prompt and and jettison the library. That being said, I think I am curious to see how this library evolves.

My Personal Experience

Do I hate LLM libraries? No! I think many of the libraries in this blog post could be helpful if used thoughtfully in the right situations. However, I’ve witnessed too many people fall into the trap of using these libraries without understanding what they are doing.

One thing I focus on as an independent consultant is to make sure my clients don’t take on accidental complexity. It’s very tempting to adopt additional tools given all the excitement around LLMs. Looking at prompts is one way to mitigate that temptation.

I’m wary of frameworks that distance the human too far from LLMs. By whispering “Fuck you, show me the prompt!” when using these tools, you are empowered to decide for yourself.¹

Acknowledgments: Thanks to Jeremy Howard and Ben Clavie for thoughtfully reviewing this post.

Footnotes

You don’t have to whisper. Saying it out loud is fine too - let others know!↩︎

How To Debug Axolotl

Hamel Husain — Thu, 11 Jan 2024 08:00:00 GMT

Motivation

Axolotl is a great project for fine-tuning LLMs. I started contributing to the project, and I found that it was difficult to debug. I wanted to share some tips and tricks I learned along the way, along with configuration files for debugging with VSCode. Moreover, I think being able to debug axolotl empowers developers who encounter bugs or want to understand how the code works. I hope this document helps you get started.

This content is now part of the Axolotl docs!

I contributed this blog post’s contents as documentation for the axolotl project. You can find this content in the axolotl repo here.

General Tips

While debugging, it’s helpful to simplify your test scenario as much as possible. Here are some tips for doing so:

Note

All of these tips are incorporated into the example configuration for debugging with VSCode below.

Make sure you are using the latest version of axolotl: This project changes often and bugs get fixed fast. Check your git branch and make sure you have pulled the latest changes from main.
Eliminate Concurrency: Restrict the number of processes to 1 for both training and data preprocessing:
- Set CUDA_VISIBLE_DEVICES to a single GPU, ex: export CUDA_VISIBLE_DEVICES=0.
- Set dataset_processes: 1 in your axolotl config or run the training command with --dataset_processes=1.
Use a small dataset: Construct or use a small dataset from HF Hub. When using a small dataset, you will often have to make sure sample_packing: False and eval_sample_packing: False to avoid errors. If you are in a pinch and don’t have time to construct a small dataset but want to use from the HF Hub, you can shard the data (this will still tokenize the entire dataset but will only use a fraction of the data for training. For example, to shard the dataset into 20 pieces, add the following to your axolotl config):
```
dataset:
    ...
    shards: 20
```
Use a small model: A good example of a small model is TinyLlama/TinyLlama-1.1B-Chat-v1.0.
Minimize iteration time: Make sure the training loop finishes as fast as possible, with these settings.
- micro_batch_size: 1
- max_steps: 1
- val_set_size: 0
Clear Caches: Axolotl caches certain steps and so does the underlying HuggingFace trainer. You may want to clear some of these caches when debugging.
- Data preprocessing: When debugging data preprocessing, which includes prompt template formation, you may want to delete the directory set in dataset_prepared_path: in your axolotl config. If you didn’t set this value, the default is last_run_prepared.
- HF Hub: If you are debugging data preprocessing, you should clear the relevant HF cache HuggingFace cache, by deleting the appropriate ~/.cache/huggingface/datasets/... folder(s).
- The recommended approach is to redirect all outputs and caches to a temporary folder and delete selected subfolders before each run. This is demonstrated in the example configuration below.

Debugging with VSCode

Background

The below example shows how to configure VSCode to debug data preprocessing of the sharegpt format. This is the format used when you have the following in your axolotl config:

datasets:
  - path:  # example on HF Hub: philschmid/guanaco-sharegpt-style
    type: sharegpt

Important

If you are already familiar with advanced VSCode debugging, you can skip the below explanation and look at the files .vscode/launch.json and .vscode/tasks.json for an example configuration.

Tip

If you prefer to watch a video, rather than read, you can skip to the video tutorial below (but doing both is recommended).

Setup

Make sure you have an editable install of Axolotl, which ensures that changes you make to the code are reflected at runtime. Run the following commands from the root of this project:

pip3 install packaging
pip3 install -e '.[flash-attn,deepspeed]'

Remote Hosts

If you developing on a remote host, you can easily use VSCode to debug remotely. To do so, you will need to follow this remote - SSH guide. You can also see the video below on Docker and Remote SSH debugging.

Configuration

The easiest way to get started is to modify the .vscode/launch.json file in the axolotl GitHub repo. This is just an example configuration, so you may need to modify or copy it to suit your needs.

For example, to mimic the command cd devtools && CUDA_VISIBLE_DEVICES=0 accelerate launch -m axolotl.cli.train dev_sharegpt.yml, you would use the below configuration¹. Note that we add additional flags that override the axolotl config and incorporate the tips above (see the comments). We also set the working directory to devtools and set the env variable HF_HOME to a temporary folder that is later partially deleted. This is because we want to delete the HF dataset cache before each run in order to ensure that the data preprocessing code is run from scratch.

// https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/.vscode/launch.json
{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Debug axolotl prompt - sharegpt",
            "type": "python",
            "module": "accelerate.commands.launch",
            "request": "launch",
            "args": [
                "-m", "axolotl.cli.train", "dev_sharegpt.yml",
                // The flags below simplify debugging by overriding the axolotl config 
                // with the debugging tips above.  Modify as needed.
                "--dataset_processes=1",      // limits data preprocessing to one process
                "--max_steps=1",              // limits training to just one step
                "--batch_size=1",             // minimizes batch size
                "--micro_batch_size=1",       // minimizes batch size
                "--val_set_size=0",           // disables validation
                "--sample_packing=False",     // disables sample packing which is necessary for small datasets
                "--eval_sample_packing=False",// disables sample packing on eval set
                "--dataset_prepared_path=temp_debug/axolotl_outputs/data", // send data outputs to a temp folder
                "--output_dir=temp_debug/axolotl_outputs/model" // send model outputs to a temp folder
                ],
            "console": "integratedTerminal",      // show output in the integrated terminal
            "cwd": "${workspaceFolder}/devtools", // set working directory to devtools from the root of the project
            "justMyCode": true,                   // step through only axolotl code
            "env": {"CUDA_VISIBLE_DEVICES": "0",  // Since we aren't doing distributed training, we need to limit to one GPU
                    "HF_HOME": "${workspaceFolder}/devtools/temp_debug/.hf-cache"}, // send HF cache to a temp folder
            "preLaunchTask": "cleanup-for-dataprep", // delete temp folders (see below)
        }
    ]
}

Additional notes about this configuration:

The argument justMyCode is set to true such that you step through only the axolotl code. If you want to step into dependencies, set this to false.
The preLaunchTask: cleanup-for-dataprep is defined in .vscode/tasks.json and is used to delete the following folders before debugging, which is essential to ensure that the data pre-processing code is run from scratch:
- ./devtools/temp_debug/axolotl_outputs
- ./devtools/temp_debug/.hf-cache/datasets

Tip

You may not want to delete these folders. For example, if you are debugging model training instead of data pre-processing, you may NOT want to delete the cache or output folders. You may also need to add additional tasks to the tasks.json file depending on your use case.

Below is the ./vscode/tasks.json file that defines the cleanup-for-dataprep task. This task is run before each debugging session when you use the above configuration. Note how there are two tasks that delete the two folders mentioned above. The third task cleanup-for-dataprep is a composite task that combines the two tasks. A composite task is necessary because VSCode does not allow you to specify multiple tasks in the preLaunchTask argument of the launch.json file.

// https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/.vscode/tasks.json
// this file is used by launch.json
{
    "version": "2.0.0",
    "tasks": [
      // this task changes into the devtools directory and deletes the temp_debug/axolotl_outputs folder
      {
        "label": "delete-outputs",
        "type": "shell",
        "command": "rm -rf temp_debug/axolotl_outputs",
        "options":{ "cwd": "${workspaceFolder}/devtools"},
        "problemMatcher": []
      },
      // this task changes into the devtools directory and deletes the `temp_debug/.hf-cache/datasets` folder
      {
        "label": "delete-temp-hf-dataset-cache",
        "type": "shell",
        "command": "rm -rf temp_debug/.hf-cache/datasets",
        "options":{ "cwd": "${workspaceFolder}/devtools"},
        "problemMatcher": []
      },
        // this task combines the two tasks above
      {
       "label": "cleanup-for-dataprep",
       "dependsOn": ["delete-outputs", "delete-temp-hf-dataset-cache"],
      }
    ]
}

Customizing your debugger

Your debugging use case may differ from the example above. The easiest thing to do is to put your own axolotl config in the devtools folder and modify the launch.json file to use your config. You may also want to modify the preLaunchTask to delete different folders or not delete anything at all.

Video Tutorial

The following video tutorial walks through the above configuration and demonstrates how to debug with VSCode:

Debugging With Docker

Using official Axolotl Docker images is a great way to debug your code, and is a very popular way to use Axolotl. Attaching VSCode to Docker takes a few more steps.

Setup

On the host that is running axolotl (ex: if you are using a remote host), clone the axolotl repo and change your current directory to the root:

git clone https://github.com/OpenAccess-AI-Collective/axolotl
cd axolotl

Tip

If you already have axolotl cloned on your host, make sure you have the latest changes and change into the root of the project.

Next, run the desired docker image and mount the current directory. Below is a docker command you can run to do this:²

docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-py3.10-cu118-2.0.1

Tip

To understand which containers are available, see the Docker section of the README and the DockerHub repo. For details of how the Docker containers are built, see axolotl’s Docker CI builds.

You will now be in the container. Next, perform an editable install of Axolotl:

pip3 install packaging
pip3 install -e '.[flash-attn,deepspeed]'

Attach To Container

Next, if you are using a remote host, Remote into this host with VSCode. If you are using a local host, you can skip this step.

Next, select Dev Containers: Attach to Running Container... using the command palette (CMD + SHIFT + P) in VSCode. You will be prompted to select a container to attach to. Select the container you just created. You will now be in the container with a working directory that is at the root of the project. Any changes you make to the code will be reflected both in the container and on the host.

Now you are ready to debug as described above (see Debugging with VSCode).

Video - Attaching To Docker On Remote Host

Here is a short video that demonstrates how to attach to a Docker container on a remote host:

Footnotes

The config actually mimics the command CUDA_VISIBLE_DEVICES=0 python -m accelerate.commands.launch -m axolotl.cli.train devtools/sharegpt.yml, but this is the same thing.↩︎
Many of the below flags are recommended best practices by Nvidia when using nvidia-container-toolkit. You can read more about these flags here.↩︎

Dokku: my favorite personal serverless platform

Hamel Husain — Tue, 09 Jan 2024 08:00:00 GMT

With Dokku, you can turn a VPS into a powerful serverless platform

What is Dokku?

Dokku is an open-source Platform as a Service (PaaS) that runs on a single server of your choice. It’s like Heroku, but you own it. It is a great way to get the benefits of Heroku without the costs (Heroku can get quite expensive!). I need to deploy many applications for my LLM consulting work. Having a cost-effective, easy-to-use serverless platform is essential for me.

I run a Dokku server on a $7/month VPS on OVHcloud for non-gpu workloads. These applications include things like nbsanity and data cleaning tools for LLMs.

Some of the features I love about Dokku:

Easy to use (like Heroku).
Automatic SSL certificate management via Let’s Encrypt.
Basic Auth support so I can password-protect sites.
Scale up and down with a single command.
Flexibility to handle any application (Node, Python, etc), including defining a Docker container.
Lots of official plugins that do almost anything I want.
Easily deploy with git commands.

Minimal Dokku Examples

Make sure you install Dokku on your VPS. As I mentioned, I use OVH.

Deploying Apps as A Docker Container

An easy way to deploy applications is with a Docker container.

To deploy a Docker container, I put a Dockerfile in the root of my git repo like this:

Dockerfile

FROM python:3.10

COPY . /app
WORKDIR /app

# Install the local package
RUN pip install .

# This directory contains app.py, a FastApi app
WORKDIR /app/

ENTRYPOINT ["./entrypoint.sh"]

Tip

The entrypoint.sh script allows me to easily run the app locally or in a Docker container. It looks like this:

entrypoint.sh

#!/bin/bash
exec uvicorn main:app --port "$PORT" --host 0.0.0.0

On the Dokku host, create the app:

dokku apps:create myapp

Locally, set up access to the Dokku host and name it dokku in your ~/.ssh/config file. For example, here is mine:

Host dokku
  HostName 
  User ubuntu
  IdentityFile /Users/hamel/.ssh/dokku

Locally, add the Dokku host as a remote and push to it:

git remote add dokku dokku@dokku:myapp
git push dokku main

That’s it - your app should be running on the Dokku host! Your local logs will print the URL that your application is served on, which by default will be myapp.yourdomain.com. You can also scale it up/down with the following command:

#scale to two workers
dokku ps:scale myapp web=2

We are just scratching the surface. For more details, see the Dokku docs.

Static Sites

GitHub Pages is annoying in that you can’t easily deploy private static sites without paying for an expensive Enterprise account. With Dokku, you can easily deploy a static site from a private GitHub Repo and password-protect it.

We will assume that you have a static site in a git repo in a folder named _site.

On the Dokku host, create an app named mysite and set the NGINX_ROOT environment variable to _site:

dokku apps:create mysite
dokku config:set static-site NGINX_ROOT=_site

Also on the Dokku host, install basic auth and set permissions so the plugin can work properly.

# do setup for the auth plugin that we will use later
sudo dokku plugin:install https://github.com/dokku/dokku-http-auth.git
sudo chmod +x /home/dokku

Then execute the following commands from the root of your git repo that contains the static site. :

1touch .static
2echo BUILDPACK_URL=https://github.com/dokku/buildpack-nginx > .env
3git remote add dokku dokku@dokku:mysite

1: tells dokku that this is a static site
2: tells dokku to use the nginx buildpack for static sites (it will usually automatically detect this, but if you have a project with code and a static site, you need to tell it to use the nginx buildpack so it doesn’t get confused).
3: add the dokku host as a remote. For this to work, make sure dokku is a hostname in your ~/.ssh/config file as described in the previous section.

Finally, deploy your application:

git push dokku main

You can now add auth by running the following command on the Dokku host:

dokku http-auth:enable mysite <username> <password>

Note

You can add multiple usernames/passwords and even filter specific IPs. See the docs.

SSL / HTTPS

It’s often desirable to have HTTPS for your site. Dokku makes this easy with the Let’s Encrypt Plugin, which will even auto-renew for you. I don’t use this, because I’m letting Cloudflare handle this with its proxy.

If you are using Cloudflare this way, activating this plugin will mess things up (don’t worry its easy to disable). Honestly, I think it’s easier to let Cloudflare handle it if you are already doing so.

Deploying With GitHub Actions

You can automatically deploy Dokku apps with GitHub Actions, which is helpful if you don’t want to fiddle with pushing to the Dokku host. Here is an example GitHub Action workflow that does this:

deploy-dokku.yml

name: CI
on:
  workflow_dispatch:
  push:
    branches: [main]

concurrency: # Cancel previous jobs to avoid deploy locks on dokku
  group: ${{ github.ref }}
  cancel-in-progress: true

jobs:
  deploy-dokku:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2
        with:
          fetch-depth: 0
      
      - name: Install SSH key
        run: |
          echo "${{ secrets.DOKKU_SSH_PRIVATE_KEY }}" > private_key.pem
          chmod 600 private_key.pem

      - name: Add remote and push
        run: |
          git remote add dokku dokku@rechat.co:llm-eval
          GIT_SSH_COMMAND="ssh -i private_key.pem -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" git push dokku main -f

Miscellaneous Tips

These are things I often forget, so I’m writing them down here. For these examples, assume my app is named llm-eval and my host is rechat.co.

Run commands remotely

You don’t have to ssh into the Dokku host just to execute commands. You can execute them remotely via the dokku user like this:

# https://dokku.com/docs/deployment/application-management/
ssh dokku@rechat.co apps:list

Docker cache

This is how you can invalidate the docker cache for a fresh build:

ssh dokku@rechat.co repo:purge-cache llm-eval

Rebuild without pushing

Sometimes you want to rebuild without pushing. There are many ways to do this, but one way is like this:

ssh dokku@rehcat.co ps:rebuild llm-eval

Why Did I Write This?

I had to dig up these details whenever I wanted to deploy a new app, so I had to write it up anyway. I hope you find it useful, too!

Tokenization Gotchas

Sun, 17 Dec 2023 08:00:00 GMT

Footguns with tokenizers and inferencing LLMs

Tools for curating LLM Data

Wed, 15 Nov 2023 08:00:00 GMT

A review of tools for curating LLMs, and why you should build your own.

vLLM & Large Models

Sat, 28 Oct 2023 07:00:00 GMT

Using tensor parallelism w/ vLLM & Modal to run Llama 70b

Optimizing LLM latency

Sun, 15 Oct 2023 07:00:00 GMT

An exploration of ways to optimize LLM latency.

On commercializing nbdev

Hamel Husain — Tue, 30 May 2023 07:00:00 GMT

nbdev is a software development tool based on Jupyter that feels like its from the future.

A few friends have asked me why I decided not to commercialize nbdev, especially after putting lots of work into the project, including leaving my full-time job to work on it. So I thought I would write a short post to explain my reasoning.

Background

nbdev is an innovative software development framework for Python that embraces literate and exploratory programming. I worked on nbdev from 2020-2023 with Jeremy Howard and, later, Wasim Lorgat. I had the privilege and excitement of exploring the boundaries of developer tools and exploratory programming while working with very talented software engineers. In addition to creating a tool many people enjoyed, I enjoyed using nbdev for personal and professional projects.

Opportunities

While conducting product research, I interviewed many developers from different backgrounds to understand their pain points and needs. All developers I talked to struggled with one key challenge: it was difficult, if not impossible, to convince other engineers to use nbdev.

The following are the biggest reasons that prevented adoption:

Friction in onboarding engineers. In many companies, there are often existing Python projects, and it can be detrimental to maintain different ways of doing things when a company has already settled upon one way that it has built processes and tools around.
Collisions with the rest of the software development stack: it was (and still is) a pain to version control notebooks in a way that’s conducive to collaboration. For practical purposes, you cannot perform code reviews of notebooks on GitHub without purchasing a tool called ReviewNB. So instead of convincing people to use nbdev, you have to convince them to use nbdev and ReviewNB. This makes the barrier to initial adoption considerably high - as procuring software in many organizations is a non-trivial process involving security review, compliance, legal and other stakeholders.

I viewed solving the above problems as potential opportunities for commercializing nbdev.

Shifting Focus

Jeremy, Wasim, and I eventually settled on the idea of “WordPress for developers,” a hosted site allowing people to create and share nbdev projects. We thought this would be an excellent way to get people to try nbdev without installing anything. The idea was to narrow the audience to people interested in hosting projects on a platform that promoted exploration and sharing, similar to Glitch that was as easy to use and pragmatic as Wordpress.

Around the same time we began discussing hosted tools, the machine learning world experienced a tectonic shift due to the explosion of Generative AI, namely Stable Diffusion. fast.ai, the organization that created nbdev, was also changing its focus. fast.ai’s prime directive was to make deep learning accessible to as many people as possible, and generative AI was too important to ignore. Accordingly, Jeremy placed his full attention on a Stable Diffusion course.

This pivot caused some turbulence as we navigated the different priorities of nbdev, generative AI research, and making money. We eventually settled on offering consulting services for everything related to fast.ai in the form of fast.ai partners, which would allow us to bootstrap ourselves financially and embrace the larger mission of fast.ai (including generative AI and research). Eventually, I found the splintered focus across so many areas to be unproductive¹ and decided to step away from everything except consulting to regain my footing.

Soon after that, ChatGPT emerged onto the scene and caused further shifts in machine learning that were orders of magnitude larger than their text-to-image predecessors. Pretty soon, all of my clients were interested in language models, and I found myself working exclusively on operationalizing them (a skill that I have cultivated by working in machine learning for 20+ years). Additionally, LLMs profoundly changed the nature of software development, especially the kind of software development that nbdev was designed to support². These factors and those discussed earlier suggested it was a good time to step away from nbdev and focus on other things.

What I learned

I learned some important lessons during this process:

Just because you love a project and find it useful, that doesn’t necessarily imply that it’s ripe for commercialization. I always struggled to gain conviction that there was a good business model for nbdev.³ Instead, I pursued this path because I was drawn to the idea of starting a business with people I really liked. Ultimately, I learned that at least one person needs strong conviction in addition to being excited about the people you are working with - not just one or the other.⁴ I also learned that it’s important to be honest with yourself about your (and your team’s) level of conviction and not try to force something that isn’t there.
Listen to your instincts. I ignored my instincts on multiple occasions throughout this journey. As I’ve grown older, I’ve learned to make this mistake much less often, but I could have done better here.
Don’t be afraid to pivot. I think we avoided unnecessary churn by steering clear of a situation that wasn’t promising. I’m much more excited about the work I’m doing now.⁵
Own your own brand. My professional brand became increasingly tied to fast.ai and my friend Jeremy Howard. I’m grateful for the growth I’ve experienced under this mentorship – but I believe it is important to build your own distinct brand and identity. I discovered it can be challenging to build your own brand when you are working on someone else’s project⁶, and is something I struggled with. I’m looking forward to working on this more.

Future Directions

I suspect that I’m not completely finished with nbdev. I may revisit the project or related ideas when the time is right. I’m excited by the work Posit is doing in the areas of literate and exploratory programming, which include many of the ideas explored in nbdev. Wasim has even joined the team at Posit, so I’m excited to see what they come up with.⁷

Regarding what I’m working on next – I’ll have to save my thoughts on that for another post 😊.

Footnotes

I burned out several times during this process, but I didn’t realize why at the time. Not surprisingly, trying to focus on too many things at once was the root cause.↩︎
See this demo for ideas on how coding with LLMs might look like, especially with notebooks.↩︎
The problem with the hosted solution is that this is not something I would want to use. I can’t picture myself trying to host code on something other than GitHub/GitLab.↩︎
Without shared conviction, there is no glue holding everyone together and people can drift apart.↩︎
I’ll share more about this in a future post.↩︎
I don’t believe this is always the case, but it can be true depending on the dynamics of the group.↩︎
We previously partnered with Posit and JJ Allaire and built nbdev on top of Quarto. I’m currently advising Posit on their product and strategy. They have additional projects on their roadmap that I cannot disclose now.↩︎

Why Should ML Engineers Learn Kubernetes?

Hamel Husain — Mon, 16 Jan 2023 08:00:00 GMT

K8s For Data Scientists Course

If you came here looking for the course, feel free to jump ahead to: K8s For Data Scientists.

Kubernetes, known as K8s, is an open-source system for deploying and managing containerized applications in the cloud. An increasing amount of modern web applications are deployed on K8s. If you are an ML engineer, it is increasingly likely that either the infrastructure you use to train, monitor, or orchestrate your models is deployed on K8s, or downstream applications that consume your models are running on K8s. However, K8s is a complex system that can be intimidating to learn.

I agree with Chip Huyen that, in theory, Data Scientists shouldn’t need to learn K8s. However, the truth is: Even though you shouldn’t have to, it’s really beneficial if you do! I’ve found that I’m often constrained by infrastructure and that infrastructure is increasingly hosted on Kubernetes.

For example, I’m rarely given access to a cloud provider’s console, and instead, I have access to a K8s cluster with some data tools already installed. When something goes awry, it’s beneficial to know enough about K8s to debug the issue. Additionally, familiarity with basic concepts allows me to have more productive conversations with my team about infrastructure.

Vicki Boykis seems to agree that the investment in learning this technology is worthwhile¹:

Below, I outline several reasons why learning K8s is a good idea for machine learning engineers².

Hosted data/ML tools are not always an option

A robot concierge helping a scientist

Large cloud providers offer their flavors of ML infrastructure as hosted solutions³. However, there is often a gap between these offerings and the needs of machine learning teams. For example, I’ve seen the following tools deployed alongside or in place of hosted solutions:

When open source isn’t enough, third-party vendors are happy to install their software on your cloud. However, you often need basic infrastructure skills to enable this. These skills often intersect with Kubernetes. While you may not be responsible for deploying the infrastructure yourself, it is helpful to understand the basics of how things work so that you can do basic debugging and troubleshooting. For example, knowing where to find logs or an API/HTTPS endpoint can unblock you in many cases.

Nobody is coming to save you

A super hero

A typical first experience as a machine learning professional is that you don’t have the necessary tools to get started. This is incredibly frustrating, as making progress without the proper tools can be hard. This experience usually culminates in a conversation like this:

ML Eng: I’m excited to join ACME company! You’ve hired me to optimize marketing spending with predictive models. The issue is that we don’t have the basic infrastructure or tools necessary for me to work efficiently.

Manager: I’m confused. Can’t you install the tools you need? Isn’t that what you are for? I was expecting that you would figure it out.

ML Eng: No, I don’t know how to set up and deploy infrastructure. We need a special infrastructure or DevOps person for that.

Manager: It will be hard to ask for more resources if we don’t know the expected return on investment. Can you do the ML project first, demonstrate some value, and then we can invest in infrastructure?

ML Eng: I need some minimum tools to experiment more quickly and develop a proof of concept. Also, I need tools that might help me collaborate better with my team…

My experience is that DevOps teams are chronically understaffed and overworked. While it usually isn’t advisable to deploy enterprise software yourself on Kubernetes for security concerns, having basic skills can lift a tremendous burden off your DevOps counterparts and make it tractable for them to help you.

K8s are not a panacea for all infrastructure problems. You must operate within the constraints of your organization and existing software stack.⁴ However, with its growing popularity, it is increasingly likely that learning this technology will help you.

ML research is crowded. Compete on SWE skills.

An overcrowded room of scientists

One of the best ways to set yourself apart as a data scientist is through your skills. Traditional education often emphasizes learning the latest ML techniques. However, cutting-edge ML research is very competitive. It’s also an extremely crowded space.

In my experience, the bottleneck many teams face is not a lack of knowledge of cutting-edge ML techniques but software engineering skills and partners to help operationalize models. If you take some time to learn how to stand up tools and infrastructure, you will be invaluable to your team.

More importantly, deploying and integrating models into services and applications is critical to connecting ML to business problems. Learning K8s will help you do this.

Your company likely already runs K8s

A scientist shaking hands with someone who runs infrastructure

Just as Python is the lingua franca of data science, K8s is becoming the lingua franca of cloud infrastructure. According to a 2021 Survey by CNCF, 96% of organizations are either using or evaluating Kubernetes. Furthermore, Stack Overflow’s 2022 Developer Survey shows that Docker and Kubernetes are the number one and two most loved and wanted tools, respectively. This is a strong indicator that K8s are here to stay.

Basic proficiency with K8s will drastically increase your chances of garnering support for your desired tools in many organizations. Proficiency with K8s increases the likelihood that:

Your DevOps counterparts will feel comfortable with the tools you want to deploy
You will have a shared language in which to talk to your application administrators
You will be more likely to attract people to help you with infra ⁵

These factors make it much more likely that you will get the tools that meet you where you are as opposed to something a software engineer without any data science experience thinks is a good idea (which I’ve seen happen a lot!).

But isn’t it overkill?

Cutting oranges with a chainsaw

For simple apps that you want to stand up quickly or prototype, K8s is overkill. Instead, I’m advocating knowledge of K8s as useful when working within the environments found in many companies. For example, hosting your data product on a single VM is often insufficient if you want to deploy production software. Many companies even have infrastructure that may block you from doing this with paved paths that only include Kubernetes.

Even if you are not deploying any production software, K8s can be invaluable in allowing you to deploy the tools you need. In many cases using K8s can make tasks easier. Enterprises have necessarily invested resources in creating guardrails to control costs and security. Those guardrails are increasingly built around K8s patterns⁶. Understanding these concepts can make operating within the confines of your company’s cloud stack easier.

You don’t need to be an expert

A student sitting at a desk in a library

K8s are complicated, but you don’t need to become an expert to unlock great value as a Data Scientist. I’m not suggesting that data scientists become K8s administrators. K8s Administration is a very involved task and worthy of its own role. Unfortunately, nearly all educational material around K8s is focused on being an administrator, which is overkill for what most data scientists need.

A course?

I haven’t yet found a good resource for people like data scientists to learn Kubernetes without wading through lots of irrelevant material geared towards administrators. So my colleagues and I are considering creating a free course with data scientists in mind. If this sounds interesting, you can sign up here.

Footnotes

Vicki is not someone who is impressed by flashy or new technologies and is someone who takes a pragmatic approach to get the job done. When she says you should learn K8s, you should pay attention!↩︎
Each subsection of this article has a picture that has been generated by Stable diffusion with a prompt that very similar to the image caption.↩︎
These systems are AWS - Sagemaker, Azure - AzureML and GCP - VertexAI.↩︎
Some organizations have built solutions that avoid K8s. For example, BigHat uses a solution based on AWS SageMaker + Lambda and other hosted solutions. So it might be a mistake to try to move over to K8s in that example – you should try to leverage your company’s existing tech stack where possible!↩︎
My friend Michał Jastrzębski, who specializes in ML infrastructure, has shared the following colorful anecdote with me: “when I hear Data Scientists shouldn’t learn K8s”, I hear “DevOps needs to learn Airflow”.↩︎
Specifically, K8s concepts that are relevant are namespaces, labels and RBAC.↩︎

nbdev + Quarto: A new secret weapon for productivity

Thu, 28 Jul 2022 07:00:00 GMT

Our favorite tool for software engineering productivity–nbdev, now re-written with Quarto.

Notebooks in production with Metaflow

Wed, 09 Feb 2022 08:00:00 GMT

Introduces a new Metaflow feature that allows users to use notebooks in production ML workflows.

ghapi, a new third-party Python client for the GitHub API

Fri, 18 Dec 2020 08:00:00 GMT

Learn about ghapi, a third-party Python library and CLI client for the GitHub API. It includes tab-completion, integrated documentation and automatic pagination of responses. ghapi automatically manages required headers, query strings, route parameters, post data, and much more.

Nbdev: A literate programming environment that democratizes software engineering best practices

Fri, 20 Nov 2020 08:00:00 GMT

Learn about nbdev, a new literate programming environment for Python.

fastcore: An Underrated Python Library

Tue, 01 Sep 2020 07:00:00 GMT

A unique python library that extends the python programming language and provides utilities that enhance productivity.

Data Science Meets Devops: MLOps with Jupyter, Git, & Kubernetes

Tue, 01 Sep 2020 07:00:00 GMT

An end-to-end example of deploying a machine learning product using Jupyter, Papermill, Tekton, GitOps and Kubeflow. by _Jeremy Lewi, Hamel_Husain_, The Kubeflow Blog.

GitHub Actions: Providing Data Scientists With New Superpowers.

Fri, 06 Mar 2020 08:00:00 GMT

A Tutorial on GitHub Actions For Data Scientists.

Introducing fastpages, An easy to use blogging platform with extra features for Jupyter Notebooks.

Fri, 21 Feb 2020 08:00:00 GMT

An easy to use blogging platform with extra features for Jupyter Notebooks.

Hamel's Blog

Debugging AI With Adversarial Validation

How It Works

Minimal Example: ft_drift

Footnotes

Your AI Product Needs Evals

Motivation

Iterating Quickly == Success

Case Study: Lucy, A Real Estate AI Assistant

Problem: How To Systematically Improve The AI?

The Types Of Evaluation

Level 1: Unit Tests

Step 1: Write Scoped Tests

Step 2: Create Test Cases

Step 3: Run & Track Your Tests Regularly

Level 2: Human & Model Eval

Logging Traces

Looking At Your Traces

How much data should you look at?

Automated Evaluation w/ LLMs

Level 3: A/B Testing

Evaluating RAG

Eval Systems Unlock Superpowers For Free

Fine-Tuning

Data Synthesis & Curation

Debugging

Conclusion

Footnotes

Is Fine-Tuning Still Valuable?

Examples where I’ve seen fine-tuning work well

Fuck You, Show Me The Prompt.

Background

Motivation: Minimize accidental complexity

Intercepting LLM API calls

Setting Up mitmproxy

Environment variables for Python

Examples

Guardrails

Guidance

Langchain

Instructor

Structred data extraction with Pydantic

Validation

DSPy

My Personal Experience

Footnotes

How To Debug Axolotl

Motivation

General Tips

Debugging with VSCode

Background

Setup

Remote Hosts

Configuration

Customizing your debugger

Video Tutorial

Debugging With Docker

Setup

Attach To Container

Video - Attaching To Docker On Remote Host

Footnotes

Dokku: my favorite personal serverless platform

What is Dokku?

Minimal Dokku Examples

Deploying Apps as A Docker Container

Static Sites

Deploying With GitHub Actions

Miscellaneous Tips

Run commands remotely

Docker cache

Rebuild without pushing

Why Did I Write This?

Tokenization Gotchas

Tools for curating LLM Data

vLLM & Large Models

Optimizing LLM latency

On commercializing nbdev

Background

Opportunities

Shifting Focus

Minimal Example: `ft_drift`