Introduction

This post serves as a transition guide for those who are mainly using closed-source LLM APIs and their finetuning interface. You can be a prompt engineer, ML developer or a software engineer, this guide is meant to provide step-by-step instructions to make the transition as simple as possible.

I will draw up the parallels between the APIs and the open-source alternatives, as well as any important points of consideration. I hope that this guide serves you well so you can start on your open-source LLM journey and begin to find alternatives for closed-source LLM APIs.

The guide will be structured as such:

Why use open-source LLMs?
How to prepare your dataset for open-source LLMs
Axolotl and Modal (what are they?)
LoRA and QLoRA???

1. Why use open-source LLMs?

The big players such as OpenAI, Anthropic has been churning out models that are superior in quality compared to all other open-source models in the field. It is super simple to set up a script using their API to run some prompts and get some results, so why do we even need to use open-source models?

There are a few reasons:

The LLM calls are cheaper since we are only paying for the GPU hosting cost.
They can be better at some specific niche tasks compared to closed-source top-tier models (more on this later).
You do not have to send your private data over to the big corporations and risk getting data breached or unauthorised access.
Zero data drift for the open-source LLMs as unlike GPT models, your local model doesn’t receive blackbox updates that changes its performance.

2. How to prepare your dataset

Let’s omit prompting in this section. There are numerous prompting definitions online but one that works best for me personally is the following.

Prompting is the act of using natural language to guide the LLM to respond as close to your intention as possible, without the need to change the model’s internal parameters.

Hence, I will focus the discussion on LLM Finetuning in this section.

For closed-source LLMs, the dataset format is usually consistent across model providers. Typically, it is a JSONL file with each line containing a data sample, which in turn then contains message bodies tagged with their roles (i.e., SYSTEM, USER, ASSISTANT).

OpenAI Finetuning Format (JSONL file)

---
{
	"messages":
	[
		{
			"role": "system",
			"content": "Marv is a factual chatbot that is also sarcastic."
		},
		{
			"role": "user",
			"content": "What's the capital of France?"
		},
		{
			"role": "assistant",
			"content": "Paris, as if everyone doesn't know that already."
		}
	]
}
----
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

*NOTE: the first “messages” body is multi-line wrapped for visibility (typically each “messages” body only spans one line)*

Open-source SQLQA dataset (JSONL file)

{
	"question": "How many heads of the departments are older than 56 ?",
	"context": "CREATE TABLE head (age INTEGER)",
	"answer": "[SQL] SELECT COUNT(*) FROM head WHERE age > 56 [/SQL]"}
}

*NOTE: once again, multi-line wrapped this example for visibility, each data point is still only one line in the file*

Now you can see that although both of them are JSONL files, the format looks quite different in terms of keys in the JSON body. —> explain this clearer

roles and merely keys comparison
talk about task specific examples that’s why have this format for SQLQA
formulate dataset accordingly to fit the task

Before diving into these two, let’s mention Hugging Face (HF) first. You’ve came across Hugging Face, it is that giant GitHub-like collection of machine learning models and task-specific leaderboards. You can download datasets and LLMs directly from there. Big-tech companies that trained open-source LLMs (e.g. Meta and Llama models) are usually released there first in public. This short description of HF is just a refresher as we will be needing it later.

a. Axolotl

What is Axolotl? I didn’t have time to come up with a proper analogy but I think this temporary one actually works quite well.

Here it is:

Imagine you are at McDonalds’ and you want a Double Cheeseburger.
You can order the Double Cheeseburger by listing down the ingredients like two buns, two patties, cheese etc …
OR you can simply click on the already-assembled picture of a Double Cheeseburger shown on the Kiosk machine screens
Axolotl is basically the Kiosk Machine that builds all the ingredients for you. It abstracts away a lot of tedious stuff when finetuning LLMs, making it very simple to change your configuration by updating a YAML file.

Axolotl uses YAML files which controls the model hyperparameters for finetuning, model quantization, dataset formats and many other settings. The workflow of Axolotl enables us to change the YAML to suit our own needs instead of writing a bunch of code. In addition, since the LLMs are open-source, there are a lot more model choices (i.e. special tokens, LoRA rank etc) that are usually absent in closed-source LLMs finetuning.

In short, Axolotl is a package that allows you to paste your JSONL dataset file based on a specific format, and then automatically structure your data to input in the LLM and allow you to easily start your finetuning. The same goes for inference too, after you have finetuned your model and wished to use it.

Modal is a Cloud-hosted GPU solutions platform that easily enable you to finetune your LLM using Axolotl as a intermediary (helpful but not a must).

We all faced the timeouts when using Google Colab GPUs and usually we don’t have spare 4090s just lying around.

Modal is a simple-to-use package that uses Cloud GPU and Docker-like commands without the actual need of using Docker. They also have good dashboards to track your experiments and start and stop your application easily. Think of it as a better google cloud.

2. Prepare your own dataset

After defining your task, you will have a better sense on how your input and output data will look like. Using that knowledge as the prerequisite, restructure your dataset format into appropriate input and output pairs. For most tasks, the typical data flow will be input to LLM and LLM will generate an output. Hence, most datasets contain many pairs of inputs and outputs.

Before jumping into finetuning our LLM, we should do some exploratory data analysis first. This step typically would be more rigorous in more serious environments but this blogpost shall be focusing on guiding people on the End to End flow of finetuning open-source LLMs, hence I won’t spend too much time on the EDA process.

Exploratory Data Analysis (EDA)

For my first attempt, I am not aiming for perfect accuracy so my EDA was only done to see if the entire conversation flow makes sense. I scanned through the raw inputs and outputs messages for some sanity checks.

During the EDA process, I found some obvious but yet interesting trends:

Trend 1: In each conversation turn, we humans typically send multiple messages containing of short sentences instead of sending one gigantic paragraph of text.
Trend 2: There seem to be a large proportion of non-text entities in the Telegram dataset (i.e., video links, GIFs, Telebubbles, Pictures, Stickers etc …)

Trend 1: Multiple messages in one conversation turn

I realised that in order to stop the LLM from outputting a big chunk of text as its response, I needed to combine all messages from one user in one turn, group them together BUT still preserve some indications that they are multiple messages.

My solution was to use double new lines “\n\n” as the separator. Then, in my post-processing step, I will separate the big chunk into smaller messages by using the new lines.

[show the comparison of multiple messages and one chunk]

Trend 2: Non-text entities contain non-trivial information

Initially I tried focusing my efforts on finetuning text-only messages but the flow of the conversation did not make sense. It showed that the non-text entities actually provide a lot more context information. For simplicity sake, I grouped all these non-entities and replaced their original text with the label “NON-TEXT” as indication that these non-text entities were present.

[show example of gibberish convo with text only] [THEN show same convo but NON-TEXT]

Overall thoughts

All these solutions could be seen as half-baked, but they work for now so let us run with that. Any improvements can be made on future iterations. Some potential ideas to improve performance is that we can replace the emojis and stickers with their matching expressions in natural language. Another solution (that does not modify the raw dataset) can be to include an additional “context” section for each sample where it contains the prior conversation history for that day. This could help improve the relevancy of the LLM generated responses.

Choosing dataset format in Axolotl

There are many, many different Axolotl datasets format to choose from. The choice usually depends on what dataset you are working with and your defined task.

Here are some examples:

Alpaca

{"instruction": "...", "input": "...", "output": "..."}

ShareGPT

{"conversations": [{"from": "...", "value": "..."}]}

SummarizeTLDR

{"article": "...", "summary": "..."}

The whole idea of using a specific dataset type is to allow for structured context and role-based input and output. Logically speaking, we would expect a conversation chatbot LLM and a news article summarizer LLM to have different datasets for their finetuning. Also, dataset types helps us to steer the LLM performance through the use of prompting. More information on Instruction Fine-tuning will be located here in a future blogpost.

Dataset format must match YML config (be careful!)

Both of your dataset format in the JSONL file and the dataset type in the YML config file must match to prevent dataset creation errors. Here is an Alpaca format example:

Alpaca dataset example (jsonl file)

{"instruction": "...", "input": "...", "output": "..."}

Config file example (yml file)

  - path: data.jsonl
    ds_type: json
    type:
      # JSONL file contains question, context, answer fields per line.
      # This gets mapped to instruction, input, output axolotl tags.
      field_instruction: **instruction**
      field_input: **input**
      field_output: **output**
      # Format is used by axolotl to generate the prompt.
      format: |-
        [INST] Given an input text message, response with an appropriate
        text message.
        {input}
        {instruction} [/INST]

Those 3 bolded fields should match the 3 respective keys in the jsonl file. For example, for the “summarizetldr” dataset format, it will contain two keys “article” and “summary”. Make sure that those two keys are explicitly mentioned underneath your dataset type in your config file. This is to allow Axolotl to create your dataset without running into errors.

Modal is a Cloud GPU provider that works similarly to Docker and bundles a simple E2E package for practitioners to finetune and serve LLMs in the cloud. You can try different cloud providers like JarvisLabs and Google Colab but for this example, I’ll be using Modal instead since its relatively straightforward.

Here are some basic steps (seen on their Github Example at the point of writing - Jun ‘24) on how to finetune a LLM:

Clone the Modal Finetuning LLM Github repo

git clone https://github.com/modal-labs/llm-finetuning.git
cd llm-finetuning

Launch finetuning job

export ALLOW_WANDB=true  # if you're using Weights & Biases
modal run --detach src.train --config=config/mistral-memorize.yml --data=data/sqlqa.subsample.jsonl

Finetuning in progress

Here are some pointers to know that you are on the right track

loss values started to get printed out during training
live usage cost in your modal dashboard starts increasing lol

Potential bugs

Hugging Face (HF) authentication error
- Despite adding the HuggingFace API key in my Modal workspace secrets, I still ran into authentication errors when trying to download models. The way I resolved it was to import modal in my “src.train” script and added an extra Modal Secrets flag in my function decorators (train and launch functions)
```
import modal
...
@app.function(
    image=axolotl_image,
    gpu=SINGLE_GPU_CONFIG,
    volumes=VOLUME_CONFIG,
    timeout=24 * HOURS,
    secrets=[modal.Secret.from_name("HF_TOKEN")],
    _allow_background_volume_commits=True,
)
def train(...)
...
```
Multi-GPU and DeepSpeed issues

Ran into problems when doing multi-GPUs finetuning using DeepSpeed.
For my personal task, I didn’t need to use multiple GPUs and DeepSpeed for LLM finetuning. I have since switched to single GPU config and left my DeepSpeed Parameter blank in my YML config file.
I have also changed the decorators to be using “SINGLE_GPU_CONFIG” in their “gpu” flag to force the training process to only utilize one single GPU.