Curating LLM data

A review of tools
Published

October 30, 2024

TLDR

I think many people should build their own data annotation/curation tools for LLMs. The benefits far outweigh costs in many situations, especially when using general-purpose front-end frameworks. It’s too critical of a task to outsource without careful consideration. Furthermore, you don’t want to get constrained by the limitations of a vendor’s tool early on.

I recommend using Shiny For Python for reasons discussed here. I wouldn’t recommend Streamlit for reasons discussed here.

Background

One pattern I noticed is that great AI researchers are willing to manually inspect lots of data. And more than that, they build infrastructure that allows them to manually inspect data quickly. Though not glamorous, manually examining data gives valuable intuitions about the problem. The canonical example here is Andrej Karpathy doing the ImageNet 2000-way classification task himself.

Jason Wei, AI Researcher at OpenAI

I couldn’t agree with Jason more. I don’t think people look at their data enough. Building your own tools so you can quickly sort through and curate your data is one of the highest-impact activities you can do when working with LLMs. Looking at and curating your own data is critical for both evaluation and fine-tuning.

Things I tried

At the outset, I tried to avoid building something myself. I tried the following vendors who provide tools for data curation/review:

Vendors

Warning

These tools are at varying levels of maturity. I interacted with the developers on all these products, and they were super responsive, kind and aware of these limitations. I expect that these tools will improve significantly over time.

  • Spacy Prodigy: This was my favorite “pre-packaged” tool/framework. They have the cleanest UI. However, I found it a bit difficult to quickly hack it for my specific needs. They have excellent features for lots of different NLP tasks. In the end, I ended up drawing inspiration from their UI and building my own tool.
  • Argilla: This platform has lots of functionality, however the LLM functionality fell short for me. I couldn’t do simple things like sorting, filtering, and labeling. Their LLM vs non-LLM functionality has very different APIs, which makes things quite fragmented at the moment. I think it could have potential once it matures.
  • Lilac: I found that this was more of a dataset viewer rather than something that allowed me to label data and curate it. So it didn’t really fit my needs. The user interface did not seem that hackable/extendable.

One thing that became clear to me while trying these vendors is the importance of being able to hack these tools to fit your specific needs. Every company you work with will have an idiosyncratic tech stack and tools that you might want to integrate into this data annotation tool. This led me to build my own tools using general-purpose frameworks.

General Purpose Frameworks

Python has really great front-end frameworks that are easy to use like Gradio or Panel and Streamlit. There is a new kid on the block, Shiny For Python, was my favorite after evaluating all of them.

Reasons I liked Shiny the most:

  • Native integration with Quarto.
  • A powerful reactive model that is snappy.
  • A small API that is easy to learn and keep in your head.
  • Amazing WASM support, for example I have embedded a version of the app in this blog post!

I found that Shiny apps always required much less code and were easier to understand than the other frameworks.