[[{“value”:”

In the rapidly evolving world of AI, the ability to customize language models for specific industries has become more important. Although large language models (LLMs) are adept at handling a wide range of tasks with natural language, they excel at general purpose tasks as compared with specialized tasks. This can create challenges when processing text data from highly specialized domains with their own distinct terminology or specialized tasks where intrinsic knowledge of the LLM is not well-suited for solutions such as Retrieval Augmented Generation (RAG).

For instance, in the automotive industry, users might not always provide specific diagnostic trouble codes (DTCs), which are often proprietary to each manufacturer. These codes, such as P0300 for a generic engine misfire or C1201 for an ABS system fault, are crucial for precise diagnosis. Without these specific codes, a general purpose LLM might struggle to provide accurate information. This lack of specificity can lead to hallucinations in the generated responses, where the model invents plausible but incorrect diagnoses, or sometimes result in no answers at all. For example, if a user simply describes “engine running rough” without providing the specific DTC, a general LLM might suggest a wide range of potential issues, some of which may be irrelevant to the actual problem, or fail to provide any meaningful diagnosis due to insufficient context. Similarly, in tasks like code generation and suggestions through chat-based applications, users might not specify the APIs they want to use. Instead, they often request help in resolving a general issue or in generating code that utilizes proprietary APIs and SDKs.

Moreover, generative AI applications for consumers can offer valuable insights into the types of interactions from end-users. With appropriate feedback mechanisms, these applications can also gather important data to continuously improve the behavior and responses generated by these models.

For these reasons, there is a growing trend in the adoption and customization of small language models (SLMs). SLMs are compact transformer models, primarily utilizing decoder-only or encoder-decoder architectures, typically with parameters ranging from 1–8 billion. They are generally more efficient and cost-effective to train and deploy compared to LLMs, and are highly effective when fine-tuned for specific domains or tasks. SLMs offer faster inference times, lower resource requirements, and are suitable for deployment on a wider range of devices, making them particularly valuable for specialized applications and edge computing scenarios. Additionally, more efficient techniques for customizing both LLMs and SLMs, such as Low Rank Adaptation (LoRA), are making these capabilities increasingly accessible to a broader range of customers.

AWS offers a wide range of solutions for interacting with language models. Amazon Bedrock is a fully managed service that offers foundation models (FMs) from Amazon and other AI companies to help you build generative AI applications and host customized models. Amazon SageMaker is a comprehensive, fully managed machine learning (ML) service to build, train, and deploy LLMs and other FMs at scale. You can fine-tune and deploy models with Amazon SageMaker JumpStart or directly through Hugging Face containers.

In this post, we guide you through the phases of customizing SLMs on AWS, with a specific focus on automotive terminology for diagnostics as a Q&A task. We begin with the data analysis phase and progress through the end-to-end process, covering fine-tuning, deployment, and evaluation. We compare a customized SLM with a general purpose LLM, using various metrics to assess vocabulary richness and overall accuracy. We provide a clear understanding of customizing language models specific to the automotive domain and its benefits. Although this post focuses on the automotive domain, the approaches are applicable to other domains. You can find the source code for the post in the associated Github repository.

Solution overview

This solution uses multiple features of SageMaker and Amazon Bedrock, and can be divided into four main steps:

Data analysis and preparation – In this step, we assess the available data, understand how it can be used to develop solution, select data for fine-tuning, and identify required data preparation steps. We use Amazon SageMaker Studio, a comprehensive web-based integrated development environment (IDE) designed to facilitate all aspects of ML development. We also employ SageMaker jobs to access more computational power on-demand, thanks to the SageMaker Python SDK.
Model fine-tuning – In this step, we prepare prompt templates for fine-tuning SLM. For this post, we use Meta Llama3.1 8B Instruct from Hugging Face as the SLM. We run our fine-tuning script directly from the SageMaker Studio JupyterLab environment. We use the @remote decorator feature of the SageMaker Python SDK to launch a remote training job. The fine-tuning script uses LoRA, distributing compute across all available GPUs on a single instance.
Model deployment – When the fine-tuning job is complete and the model is ready, we have two deployment options:
- Deploy in SageMaker by selecting the best instance and container options available.
- Deploy in Amazon Bedrock by importing the fine-tuned model for on-demand use.
Model evaluation – In this final step, we evaluate the fine-tuned model against a similar base model and a larger model available from Amazon Bedrock. Our evaluation focuses on how well the model uses specific terminology for the automotive space, as well as the improvements provided by fine-tuning in generating answers.

The following diagram illustrates the solution architecture.

Using the Automotive_NER dataset

The Automotive_NER dataset, available on the Hugging Face platform, is designed for named entity recognition (NER) tasks specific to the automotive domain. This dataset is specifically curated to help identify and classify various entities related to the automotive industry and uses domain-specific terminologies.

The dataset contains approximately 256,000 rows; each row contains annotated text data with entities related to the automotive domain, such as car brands, models, component, description of defects, consequences, and corrective actions. The terminology used to describe defects, reference to components, or error codes reported is a standard for the automotive industry. The fine-tuning process enables the language model to learn the domain terminologies better and helps improve the vocabulary used in the generation of answers and overall accuracy for the generated answers.

The following table is an example of rows contained in the dataset.

1	COMPNAME	DESC_DEFECT	CONEQUENCE_DEFECT	CORRECTIVE_ACTION
2	ELECTRICAL SYSTEM:12V/24V/48V BATTERY:CABLES	CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC ENGINES, LOOSE OR BROKEN ATTACHMENTS AND MISROUTED BATTERY CABLES COULD LEAD TO CABLE INSULATION DAMAGE.	THIS, IN TURN, COULD CAUSE THE BATTERY CABLES TO SHORT RESULTING IN HEAT DAMAGE TO THE CABLES. BESIDES HEAT DAMAGE, THE “CHECK ENGINE” LIGHT MAY ILLUMINATE, THE VEHICLE MAY FAIL TO START, OR SMOKE, MELTING, OR FIRE COULD ALSO OCCUR.	DEALERS WILL INSPECT THE BATTERY CABLES FOR THE CONDITION OF THE CABLE INSULATION AND PROPER TIGHTENING OF THE TERMINAL ENDS. AS NECESSARY, CABLES WILL BE REROUTED, RETAINING CLIPS INSTALLED, AND DAMAGED BATTERY CABLES REPLACED. OWNER NOTIFICATION BEGAN FEBRUARY 10, 2003. OWNERS WHO DO NOT RECEIVE THE FREE REMEDY WITHIN A REASONABLE TIME SHOULD CONTACT FORD AT 1-866-436-7332.
3	ELECTRICAL SYSTEM:12V/24V/48V BATTERY:CABLES	CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC ENGINES, LOOSE OR BROKEN ATTACHMENTS AND MISROUTED BATTERY CABLES COULD LEAD TO CABLE INSULATION DAMAGE.	THIS, IN TURN, COULD CAUSE THE BATTERY CABLES TO SHORT RESULTING IN HEAT DAMAGE TO THE CABLES. BESIDES HEAT DAMAGE, THE “CHECK ENGINE” LIGHT MAY ILLUMINATE, THE VEHICLE MAY FAIL TO START, OR SMOKE, MELTING, OR FIRE COULD ALSO OCCUR.	DEALERS WILL INSPECT THE BATTERY CABLES FOR THE CONDITION OF THE CABLE INSULATION AND PROPER TIGHTENING OF THE TERMINAL ENDS. AS NECESSARY, CABLES WILL BE REROUTED, RETAINING CLIPS INSTALLED, AND DAMAGED BATTERY CABLES REPLACED. OWNER NOTIFICATION BEGAN FEBRUARY 10, 2003. OWNERS WHO DO NOT RECEIVE THE FREE REMEDY WITHIN A REASONABLE TIME SHOULD CONTACT FORD AT 1-866-436-7332.
4	EQUIPMENT:OTHER:LABELS	ON CERTAIN FOLDING TENT CAMPERS, THE FEDERAL CERTIFICATION (AND RVIA) LABELS HAVE THE INCORRECT GROSS VEHICLE WEIGHT RATING, TIRE SIZE, AND INFLATION PRESSURE LISTED.	IF THE TIRES WERE INFLATED TO 80 PSI, THEY COULD BLOW RESULTING IN A POSSIBLE CRASH.	OWNERS WILL BE MAILED CORRECT LABELS FOR INSTALLATION ON THEIR VEHICLES. OWNER NOTIFICATION BEGAN SEPTEMBER 23, 2002. OWNERS SHOULD CONTACT JAYCO AT 1-877-825-4782.
5	STRUCTURE	ON CERTAIN CLASS A MOTOR HOMES, THE FLOOR TRUSS NETWORK SUPPORT SYSTEM HAS A POTENTIAL TO WEAKEN CAUSING INTERNAL AND EXTERNAL FEATURES TO BECOME MISALIGNED. THE AFFECTED VEHICLES ARE 1999 – 2003 CLASS A MOTOR HOMES MANUFACTURED ON F53 20,500 POUND GROSS VEHICLE WEIGHT RATING (GVWR), FORD CHASSIS, AND 2000-2003 CLASS A MOTOR HOMES MANUFACTURED ON W-22 22,000 POUND GVWR, WORKHORSE CHASSIS.	CONDITIONS CAN RESULT IN THE BOTTOMING OUT THE SUSPENSION AND AMPLIFICATION OF THE STRESS PLACED ON THE FLOOR TRUSS NETWORK. THE ADDITIONAL STRESS CAN RESULT IN THE FRACTURE OF WELDS SECURING THE FLOOR TRUSS NETWORK SYSTEM TO THE CHASSIS FRAME RAIL AND/OR FRACTURE OF THE FLOOR TRUSS NETWORK SUPPORT SYSTEM. THE POSSIBILITY EXISTS THAT THERE COULD BE DAMAGE TO ELECTRICAL WIRING AND/OR FUEL LINES WHICH COULD POTENTIALLY LEAD TO A FIRE.	DEALERS WILL INSPECT THE FLOOR TRUSS NETWORK SUPPORT SYSTEM, REINFORCE THE EXISTING STRUCTURE, AND REPAIR, AS NEEDED, THE FLOOR TRUSS NETWORK SUPPORT. OWNER NOTIFICATION BEGAN NOVEMBER 5, 2002. OWNERS SHOULD CONTACT MONACO AT 1-800-685-6545.
6	STRUCTURE	ON CERTAIN CLASS A MOTOR HOMES, THE FLOOR TRUSS NETWORK SUPPORT SYSTEM HAS A POTENTIAL TO WEAKEN CAUSING INTERNAL AND EXTERNAL FEATURES TO BECOME MISALIGNED. THE AFFECTED VEHICLES ARE 1999 – 2003 CLASS A MOTOR HOMES MANUFACTURED ON F53 20,500 POUND GROSS VEHICLE WEIGHT RATING (GVWR), FORD CHASSIS, AND 2000-2003 CLASS A MOTOR HOMES MANUFACTURED ON W-22 22,000 POUND GVWR, WORKHORSE CHASSIS.	CONDITIONS CAN RESULT IN THE BOTTOMING OUT THE SUSPENSION AND AMPLIFICATION OF THE STRESS PLACED ON THE FLOOR TRUSS NETWORK. THE ADDITIONAL STRESS CAN RESULT IN THE FRACTURE OF WELDS SECURING THE FLOOR TRUSS NETWORK SYSTEM TO THE CHASSIS FRAME RAIL AND/OR FRACTURE OF THE FLOOR TRUSS NETWORK SUPPORT SYSTEM. THE POSSIBILITY EXISTS THAT THERE COULD BE DAMAGE TO ELECTRICAL WIRING AND/OR FUEL LINES WHICH COULD POTENTIALLY LEAD TO A FIRE.	DEALERS WILL INSPECT THE FLOOR TRUSS NETWORK SUPPORT SYSTEM, REINFORCE THE EXISTING STRUCTURE, AND REPAIR, AS NEEDED, THE FLOOR TRUSS NETWORK SUPPORT. OWNER NOTIFICATION BEGAN NOVEMBER 5, 2002. OWNERS SHOULD CONTACT MONACO AT 1-800-685-6545.

Data analysis and preparation on SageMaker Studio

When you’re fine-tuning LLMs, the quality and composition of your training data are crucial (quality over quantity). For this post, we implemented a sophisticated method to select 6,000 rows out of 256,000. This method uses TF-IDF vectorization to identify the most significant and the rarest words in the dataset. By selecting rows containing these words, we maintained a balanced representation of common patterns and edge cases. This improves computational efficiency and creates a high-quality, diverse subset leading to effective model training.

The first step is to open a JupyterLab application previously created in our SageMaker Studio domain.

After you clone the git repository, install the required libraries and dependencies:

pip install -r requirements.txt

The next step is to read the dataset:

from datasets import load_dataset
import pandas as pd

dataset = load_dataset("sp01/Automotive_NER")
df = pd.DataFrame(dataset['train'])

The first step of our data preparation activity is to analyze the importance of the words in our dataset, for identifying both the most important (frequent and distinctive) words and the rarest words in the dataset, by using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization.

Given the dataset’s size, we decided to run the fine-tuning job using Amazon SageMaker Training.

By using the @remote function capability of the SageMaker Python SDK, we can run our code into a remote job with ease.

In our case, the TF-IDF vectorization and the extraction of the top words and bottom words are performed in a SageMaker training job directly from our notebook, without any code changes, by simply adding the @remote decorator on top of our function. You can define the configurations required by the SageMaker training job, such as dependencies and training image, in a config.yaml file. For more details on the settings supported by the config file, see Using the SageMaker Python SDK

See the following code:

SchemaVersion: '1.0'
SageMaker:
  PythonSDK:
    Modules:
      RemoteFunction:
        Dependencies: ./requirements.txt
        ImageUri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.4-gpu-py311'
        InstanceType: ml.g5.12xlarge
        PreExecutionCommands:
          - 'export NCCL_P2P_DISABLE=1'
  Model:
    EnableNetworkIsolation: false

Next step is to define and execute our processing function:

import numpy as np
import re
from sagemaker.remote_function import remote
from sklearn.feature_extraction.text import TfidfVectorizer
import string

@remote(volume_size=10, job_name_prefix=f"preprocess-auto-ner-auto-merge", instance_type="ml.m4.10xlarge")
def preprocess(df,
               top_n=6000,
               bottom_n=6000
    ):
    # Download nltk stopwords
    import nltk
    nltk.download('stopwords')
    from nltk.corpus import stopwords

    # Define a function to preprocess text
    def preprocess_text(text):
        if not isinstance(text, str):
            # Return an empty string or handle the non-string value as needed
            return ''
    
        # Remove punctuation
        text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
    
        # Convert to lowercase
        text = text.lower()
    
        # Remove stop words (optional)
        stop_words = set(stopwords.words('english'))
        text = ' '.join([word for word in text.split() if word not in stop_words])
    
        return text
    
    print("Applying text preprocessing")
    
    # Preprocess the text columns
    df['DESC_DEFECT'] = df['DESC_DEFECT'].apply(preprocess_text)
    df['CONEQUENCE_DEFECT'] = df['CONEQUENCE_DEFECT'].apply(preprocess_text)
    df['CORRECTIVE_ACTION'] = df['CORRECTIVE_ACTION'].apply(preprocess_text)
    
    # Create a TfidfVectorizer object
    tfidf_vectorizer = TfidfVectorizer()

    print("Compute TF-IDF")
    
    # Fit and transform the text data
    X_tfidf = tfidf_vectorizer.fit_transform(df['DESC_DEFECT'] + ' ' + df['CONEQUENCE_DEFECT'] + ' ' + df['CORRECTIVE_ACTION'])
    
    # Get the feature names (words)
    feature_names = tfidf_vectorizer.get_feature_names_out()
    
    # Get the TF-IDF scores
    tfidf_scores = X_tfidf.toarray()
    
    top_word_indices = np.argsort(tfidf_scores.sum(axis=0))[-top_n:]
    bottom_word_indices = np.argsort(tfidf_scores.sum(axis=0))[:bottom_n]

    print("Extracting top and bottom words")
    
    # Get the top and bottom words
    top_words = [feature_names[i] for i in top_word_indices]
    bottom_words = [feature_names[i] for i in bottom_word_indices]

    return top_words, bottom_words

top_words, bottom_words = preprocess(df)

After we extract the top and bottom 6,000 words based on their TF-IDF scores from our original dataset, we classify each row in the dataset based on whether it contained any of these important or rare words. Rows are labeled as ‘top’ if they contained important words, ‘bottom’ if they contained rare words, or ‘neither’ if they don’t contain either:

# Create a function to check if a row contains important or rare words
def contains_important_or_rare_words(row):
    try:
        if ("DESC_DEFECT" in row.keys() and row["DESC_DEFECT"] is not None and
            "CONEQUENCE_DEFECT" in row.keys() and row["CONEQUENCE_DEFECT"] is not None and
            "CORRECTIVE_ACTION" in row.keys() and row["CORRECTIVE_ACTION"] is not None):
            text = row['DESC_DEFECT'] + ' ' + row['CONEQUENCE_DEFECT'] + ' ' + row['CORRECTIVE_ACTION']
        
            text_words = set(text.split())
        
            # Check if the row contains any important words (top_words)
            for word in top_words:
                if word in text_words:
                    return 'top'
        
            # Check if the row contains any rare words (bottom_words)
            for word in bottom_words:
                if word in text_words:
                    return 'bottom'
        
            return 'neither'
        else:
            return 'none'
    except Exception as e:
        raise e

df['word_type'] = df.apply(contains_important_or_rare_words, axis=1)

Finally, we create a balanced subset of the dataset by selecting all rows containing important words (‘top’) and an equal number of rows containing rare words (‘bottom’). If there aren’t enough ‘bottom’ rows, we filled the remaining slots with ‘neither’ rows.

	DESC_DEFECT	CONEQUENCE_DEFECT	CORRECTIVE_ACTION	word_type
2	ON CERTAIN FOLDING TENT CAMPERS, THE FEDERAL C…	IF THE TIRES WERE INFLATED TO 80 PSI, THEY COU…	OWNERS WILL BE MAILED CORRECT LABELS FOR INSTA…	top
2402	CERTAIN PASSENGER VEHICLES EQUIPPED WITH DUNLO…	THIS COULD RESULT IN PREMATURE TIRE WEAR.	DEALERS WILL INSPECT AND IF NECESSARY REPLACE …	bottom
0	CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC…	THIS, IN TURN, COULD CAUSE THE BATTERY CABLES …	DEALERS WILL INSPECT THE BATTERY CABLES FOR TH…	neither

Finally, we randomly sampled 6,000 rows from this balanced set:

# Select all rows from each group
top_rows = df[df['word_type'] == 'top']
bottom_rows = df[df['word_type'] == 'bottom']

# Combine the two groups, ensuring a balanced dataset
if len(bottom_rows) > 0:
    df = pd.concat([top_rows, bottom_rows.sample(n=len(bottom_rows), random_state=42)], ignore_index=True)
else:
    df = top_rows.copy()

# If the combined dataset has fewer than 6010 rows, fill with remaining rows
if len(df) < 6000:
    remaining_rows = df[df['word_type'] == 'neither'].sample(n=6010 - len(df), random_state=42)
    df = pd.concat([df, remaining_rows], ignore_index=True)

df = df.sample(n=6000, random_state=42)

Fine-tuning Meta Llama 3.1 8B with a SageMaker training job

After selecting the data, we need to prepare the resulting dataset for the fine-tuning activity. By examining the columns, we aim to adapt the model for two different tasks:

The following code is for the first prompt:

# User: 
{MFGNAME}
{COMPNAME}
{DESC_DEFECT}
# AI: 
{CONEQUENCE_DEFECT}

With this prompt, we instruct the model to highlight the possible consequences of a defect, given the manufacturer, component name, and description of the defect.

The following code is for the second prompt:

# User:
{MFGNAME}
{COMPNAME}
{DESC_DEFECT}
# AI: 
{CORRECTIVE_ACTION}

With this second prompt, we instruct the model to suggest possible corrective actions for a given defect and component of a specific manufacturer.

First, let’s split the dataset into train, test, and validation subsets:

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.1, random_state=42)
train, valid = train_test_split(train, test_size=10, random_state=42)

Next, we create prompt templates to convert each row item into the two prompt formats previously described:

from random import randint

# template dataset to add prompt to each sample
def template_dataset_consequence(sample):
    # custom instruct prompt start
    prompt_template = f"""
    <|begin_of_text|><|start_header_id|>user<|end_header_id|>
    These are the information related to the defect    

    Manufacturer: {{mfg_name}}
    Component: {{comp_name}}
    Description of a defect:
    {{desc_defect}}
    
    What are the consequences of defect?
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    {{consequence_defect}}
    <|end_of_text|><|eot_id|>
    """
    sample["text"] = prompt_template.format(
        mfg_name=sample["MFGNAME"],
        comp_name=sample["COMPNAME"],
        desc_defect=sample["DESC_DEFECT"].lower(),
        consequence_defect=sample["CONEQUENCE_DEFECT"].lower())
    return sample

from random import randint

# template dataset to add prompt to each sample
def template_dataset_corrective_action(sample):
    # custom instruct prompt start
    prompt_template = f"""
    <|begin_of_text|><|start_header_id|>user<|end_header_id|>
    Manufacturer: {{mfg_name}}
    Component: {{comp_name}}
    
    Description of a defect:
    {{desc_defect}}
    
    What are the possible corrective actions?
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    {{corrective_action}}
    <|end_of_text|><|eot_id|>
    """
    sample["text"] = prompt_template.format(
        mfg_name=sample["MFGNAME"],
        comp_name=sample["COMPNAME"],
        desc_defect=sample["DESC_DEFECT"].lower(),
        corrective_action=sample["CORRECTIVE_ACTION"].lower())
    return sample

Now we can apply the template functions template_dataset_consequence and template_dataset_corrective_action to our datasets:

As a final step, we concatenate the four resulting datasets for train and test:

Our final training dataset comprises approximately 12,000 elements, properly split into about 11,000 for training and 1,000 for testing.

Now we can prepare the training script and define the training function train_fn and put the @remote decorator on the function.

The training function does the following:

Tokenizes and chunks the dataset
Sets up BitsAndBytesConfig, for model quantization, which specifies the model should be loaded in 4-bit
Uses mixed precision for the computation, by converting model parameters to bfloat16
Loads the model
Creates LoRA configurations that specify ranking of update matrices (r), scaling factor (lora_alpha), the modules to apply the LoRA update matrices (target_modules), dropout probability for Lora layers (lora_dropout), task_type, and more
Starts the training and evaluation

Because we want to distribute the training across all the available GPUs in our instance, by using PyTorch Distributed Data Parallel (DDP), we use the Hugging Face Accelerate library that enables us to run the same PyTorch code across distributed configurations.

For optimizing memory resources, we have decided to run a mixed precision training:

from accelerate import Accelerator
from huggingface_hub import login
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training
from sagemaker.remote_function import remote

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, set_seed
import transformers

# Start training
@remote(
    keep_alive_period_in_seconds=0,
    volume_size=100, job_name_prefix=f"train-{model_id.split('/')[-1].replace('.', '-')}-auto",
    use_torchrun=True,
    nproc_per_node=4)

def train_fn(
        model_name,
        train_ds,
        test_ds=None,
        lora_r=8,
        lora_alpha=16,
        lora_dropout=0.1,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        gradient_accumulation_steps=1,
        learning_rate=2e-4,
        num_train_epochs=1,
        fsdp="",
        fsdp_config=None,
        gradient_checkpointing=False,
        merge_weights=False,
        seed=42,
        token=None
):

    set_seed(seed)
    accelerator = Accelerator()
    if token is not None:
        login(token=token)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Set Tokenizer pad Token
    tokenizer.pad_token = tokenizer.eos_token
    with accelerator.main_process_first():

        # tokenize and chunk dataset
        lm_train_dataset = train_ds.map(
            lambda sample: tokenizer(sample["text"]), remove_columns=list(train_ds.features)
        )

        print(f"Total number of train samples: {len(lm_train_dataset)}")

        if test_ds is not None:

            lm_test_dataset = test_ds.map(
                lambda sample: tokenizer(sample["text"]), remove_columns=list(test_ds.features)
            )

            print(f"Total number of test samples: {len(lm_test_dataset)}")
        else:
            lm_test_dataset = None
      
    torch_dtype = torch.bfloat16

    # Defining additional configs for FSDP
    if fsdp != "" and fsdp_config is not None:
        bnb_config_params = {
            "bnb_4bit_quant_storage": torch_dtype
        }

        model_configs = {
            "torch_dtype": torch_dtype
        }

        fsdp_configurations = {
            "fsdp": fsdp,
            "fsdp_config": fsdp_config,
            "gradient_checkpointing_kwargs": {
                "use_reentrant": False
            },
            "tf32": True
        }

    else:
        bnb_config_params = dict()
        model_configs = dict()
        fsdp_configurations = dict()
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch_dtype,
        **bnb_config_params
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        quantization_config=bnb_config,
        attn_implementation="flash_attention_2",
        use_cache=not gradient_checkpointing,
        cache_dir="/tmp/.cache",
        **model_configs
    )

    if fsdp == "" and fsdp_config is None:
        model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=gradient_checkpointing)

    if gradient_checkpointing:
        model.gradient_checkpointing_enable()

    config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules="all-linear",
        lora_dropout=lora_dropout,
        bias="none",
        task_type="CAUSAL_LM"
    )

    model = get_peft_model(model, config)
    print_trainable_parameters(model)

    trainer = transformers.Trainer(
        model=model,
        train_dataset=lm_train_dataset,
        eval_dataset=lm_test_dataset if lm_test_dataset is not None else None,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=per_device_train_batch_size,
            per_device_eval_batch_size=per_device_eval_batch_size,
            gradient_accumulation_steps=gradient_accumulation_steps,
            gradient_checkpointing=gradient_checkpointing,
            logging_strategy="steps",
            logging_steps=1,
            log_on_each_node=False,
            num_train_epochs=num_train_epochs,
            learning_rate=learning_rate,
            bf16=True,
            ddp_find_unused_parameters=False,
            save_strategy="no",
            output_dir="outputs",
            **fsdp_configurations
        ),

        data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )

    trainer.train()

    if trainer.is_fsdp_enabled:
        trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")

    if merge_weights:
        output_dir = "/tmp/model"
        # merge adapter weights with base model and save
        # save int 4 model
        trainer.model.save_pretrained(output_dir, safe_serialization=False)
      
        if accelerator.is_main_process:
            # clear memory
            del model
            del trainer
            torch.cuda.empty_cache()

            # load PEFT model
            model = AutoPeftModelForCausalLM.from_pretrained(
                output_dir,
                torch_dtype=torch.float16,
                low_cpu_mem_usage=True,
                trust_remote_code=True,
            ) 

            # Merge LoRA and base model and save
            model = model.merge_and_unload()
            model.save_pretrained(
                "/opt/ml/model", safe_serialization=True, max_shard_size="2GB"
            )

    else:
        trainer.model.save_pretrained("/opt/ml/model", safe_serialization=True)

    if accelerator.is_main_process:
        tokenizer.save_pretrained("/opt/ml/model")

We can specify to run a distributed job in the @remote function through the parameters use_torchrun and nproc_per_node, which indicates if the SageMaker job should use as entrypoint torchrun and the number of GPUs to use. You can pass optional parameters like volume_size, subnets, and security_group_ids using the @remote decorator.

Finally, we run the job by invoking train_fn():

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

train_fn(
    model_id,
    train_ds=train_dataset,
    test_ds=test_dataset,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    num_train_epochs=1,
    merge_weights=True,
    token="<HF_TOKEN>"
)

The training job runs on the SageMaker training cluster. The training job took about 42 minutes, by distributing the computation across the 4 available GPUs on the selected instance type ml.g5.12xlarge.

We choose to merge the LoRA adapter with the base model. This decision was made during the training process by setting the merge_weights parameter to True in our train_fn() function. Merging the weights provides us with a single, cohesive model that incorporates both the base knowledge and the domain-specific adaptations we’ve made through fine-tuning.

By merging the model, we gain flexibility in our deployment options.

Model deployment

When deploying a fine-tuned model on AWS, multiple deployment strategies are available. In this post, we explore two deployment methods:

SageMaker real-time inference – This option is designed for having full control of the inference resources. We can use a set of available instances and deployment options for hosting our model. By using the SageMaker built-in containers, such as DJL Serving or Hugging Face TGI, we can use the inference script and the optimization options provided in the container.
Amazon Bedrock Custom Model Import – This option is designed for importing and deploying custom language models. We can use this fully managed capability for interacting with the deployed model with on-demand throughput.

Model deployment with SageMaker real-time inference

SageMaker real-time inference is designed for having full control over the inference resources. It allows you to use a set of available instances and deployment options for hosting your model. By using the SageMaker built-in container Hugging Face Text Generation Inference (TGI), you can take advantage of the inference script and optimization options available in the container.

In this post, we deploy the fine-tuned model to a SageMaker endpoint for running inference, which will be used for evaluating the model in the next step.

We create the HuggingFaceModel object, which is a high-level SageMaker model class for working with Hugging Face models. The image_uri parameter specifies the container image URI for the model, and model_data points to the Amazon Simple Storage Service (Amazon S3) location containing the model artifact (automatically uploaded by the SageMaker training job). We also specify a set of environment variables to configure the number of GPUs (SM_NUM_GPUS), quantization methodology (QUANTIZE), and maximum input and total token lengths (MAX_INPUT_LENGTH and MAX_TOTAL_TOKENS).

model = HuggingFaceModel(
    image_uri=image_uri,
    model_data=f"s3://{bucket_name}/{job_name}/{job_name}/output/model.tar.gz",
    role=get_execution_role(),
    env={
        'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
        'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
        'QUANTIZE': 'bitsandbytes',
        'MAX_INPUT_LENGTH': '4096',
        'MAX_TOTAL_TOKENS': '8192'
    }
)

After creating the model object, we can deploy it to an endpoint using the deploy method. The initial_instance_count and instance_type parameters specify the number and type of instances to use for the endpoint. The container_startup_health_check_timeout and model_data_download_timeout parameters set the timeout values for the container startup health check and model data download, respectively.

predictor = model.deploy(
    initial_instance_count=instance_count,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
    model_data_download_timeout=3600
)

It takes a few minutes to deploy the model before it becomes available for inference and evaluation. The endpoint is invoked using the AWS SDK with the boto3 client for sagemaker-runtime, or directly by using the SageMaker Python SDK and the predictor previously created, by using the predict API.

body = {
        'inputs': prompt,
        'parameters': {
            "top_p": 0.9,
            "temperature": 0.2,
            "max_new_tokens": 512,
            "return_full_text": False,
            "stop": [
                '<|eot_id|>',
                '<|end_of_text|>'
            ]
        }
    }
response = predictor.predict(body)

Model deployment with Amazon Bedrock Custom Model Import

Amazon Bedrock Custom Model Import is a fully managed capability, currently in public preview, designed for importing and deploying custom language models. It allows you to interact with the deployed model both on-demand and by provisioning the throughput.

In this section, we use the Custom Model Import feature in Amazon Bedrock for deploying our fine-tuned model in the fully managed environment of Amazon Bedrock.

After defining the model and job_name variables, we import our model from the S3 bucket by supplying it in the Hugging Face weights format.

Next, we use a preexisting AWS Identity and Access Management (IAM) role that allows reading the binary file from Amazon S3 and create the import job resource in Amazon Bedrock for hosting our model.

It takes a few minutes to deploy the model, and it can be invoked using the AWS SDK with the boto3 client for bedrock-runtime by using the invoke_model API:

fine_tuned_model_id = “<MODEL_ARN>”

body = {
        "prompt": prompt,
        "temperature": 0.1,
        "top_p": 0.9,
    }

response = bedrock_client.invoke_model(
        modelId=fine_tuned_model_id,
        body=json.dumps(body)
)

Model evaluation

In this final step, we evaluate the fine-tuned model against the base models Meta Llama 3 8B Instruct and Meta Llama 3 70B Instruct on Amazon Bedrock. Our evaluation focuses on how well the model uses specific terminology for the automotive space and the improvements provided by fine-tuning in generating answers.

The fine-tuned model’s ability to understand components and error descriptions for diagnostics, as well as identify corrective actions and consequences in the generated answers, can be evaluated on two dimensions.

To evaluate the quality of the generated text and whether the vocabulary and terminology used are appropriate for the task and industry, we use the Bilingual Evaluation Understudy (BLEU) score. BLEU is an algorithm for evaluating the quality of text, by calculating n-gram overlap between the generated and the reference text.

To evaluate the accuracy of the generated text and see if the generated answer is similar to the expected one, we use the Normalized Levenshtein distance. This algorithm evaluates how close the calculated or measured values are to the actual value.

The evaluation dataset comprises 10 unseen examples of component diagnostics extracted from the original training dataset.

The prompt template for the evaluation is structured as follows:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Manufacturer: {row['MFGNAME']}
Component: {row['COMPNAME']}

Description of a defect:
{row['DESC_DEFECT']}

What are the consequences?
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

BLEU score evaluation with base Meta Llama 3 8B and 70B Instruct

The following table and figures show the calculated values for the BLEU score comparison (higher is better) with Meta Llama 3 8B and 70 B Instruct.

	Example	Fine-Tuned Score	Base Score: Meta Llama 3 8B	Base Score: Meta Llama 3 70B
1	2733	0. 2936	5.10E-155	4.85E-155
2	3382	0.1619	0.058	1.134E-78
3	1198	0.2338	1.144E-231	3.473E-155
4	2942	0.94854	2.622E-231	3.55E-155
5	5151	1.28E-155	0	0
6	2101	0.80345	1.34E-78	1.27E-78
7	5178	0.94854	0.045	3.66E-155
8	1595	0.40412	4.875E-155	0.1326
9	2313	0.94854	3.03E-155	9.10E-232
10	557	0.89315	8.66E-79	0.1954

By comparing the fine-tuned and base scores, we can assess the performance improvement (or degradation) achieved by fine-tuning the model in the vocabulary and terminology used.

The analysis suggests that for the analyzed cases, the fine-tuned model outperforms the base model in the vocabulary and terminology used in the generated answer. The fine-tuned model appears to be more consistent in its performance.

Normalized Levenshtein distance with base Meta Llama 3 8B Instruct

The following table and figures show the calculated values for the Normalized Levenshtein distance comparison with Meta Llama 3 8B and 70B Instruct.

	Example	Fine-tuned Score	Base Score – Llama 3 8B	Base Score – Llama 3 70B
1	2733	0.42198	0.29900	0.27226
2	3382	0.40322	0.25304	0.21717
3	1198	0.50617	0.26158	0.19320
4	2942	0.99328	0.18088	0.19420
5	5151	0.34286	0.01983	0.02163
6	2101	0.94309	0.25349	0.23206
7	5178	0.99107	0.14475	0.17613
8	1595	0.58182	0.19910	0.27317
9	2313	0.98519	0.21412	0.26956
10	557	0.98611	0.10877	0.32620

By comparing the fine-tuned and base scores, we can assess the performance improvement (or degradation) achieved by fine-tuning the model on the specific task or domain.

The analysis shows that the fine-tuned model clearly outperforms the base model across the selected examples, suggesting the fine-tuning process has been quite effective in improving the model’s accuracy and generalization in understanding the specific cause of the component defect and providing suggestions on the consequences.

In the evaluation analysis performed for both selected metrics, we can also highlight some areas for improvement:

Example repetition – Provide similar examples for further improvements in the vocabulary and generalization of the generated answer, increasing the accuracy of the fine-tuned model.
Evaluate different data processing techniques – In our example, we selected a subset of the original dataset by analyzing the frequency of words across the entire dataset, extracting the rows containing the most meaningful information and identifying outliers. Further curation of the dataset by properly cleaning and expanding the number of examples can increase the overall performance of the fine-tuned model.

Clean up

After you complete your training and evaluation experiments, clean up your resources to avoid unnecessary charges. If you deployed the model with SageMaker, you can delete the created real-time endpoints using the SageMaker console. Next, delete any unused SageMaker Studio resources. If you deployed the model with Amazon Bedrock Custom Model Import, you can delete the imported model using the Amazon Bedrock console.

Conclusion

This post demonstrated the process of customizing SLMs on AWS for domain-specific applications, focusing on automotive terminology for diagnostics. The provided steps and source code show how to analyze data, fine-tune models, deploy them efficiently, and evaluate their performance against larger base models using SageMaker and Amazon Bedrock. We further highlighted the benefits of customization by enhancing vocabulary within specialized domains.

You can evolve this solution further by implementing proper ML pipelines and LLMOps practices through Amazon SageMaker Pipelines. SageMaker Pipelines enables you to automate and streamline the end-to-end workflow, from data preparation to model deployment, enhancing reproducibility and efficiency. You can also improve the quality of training data using advanced data processing techniques. Additionally, using the Reinforcement Learning from Human Feedback (RLHF) approach can align the model response to human preferences. These enhancements can further elevate the performance of customized language models across various specialized domains. You can find the sample code discussed in this post on the GitHub repo.

About the authors

Bruno Pistone is a Senior Generative AI and ML Specialist Solutions Architect for AWS based in Milan. He works with large customers helping them to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His expertise include: Machine Learning end to end, Machine Learning Industrialization, and Generative AI. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations

Gopi Krishnamurthy is a Senior AI/ML Solutions Architect at Amazon Web Services based in New York City. He works with large Automotive and Industrial customers as their trusted advisor to transform their Machine Learning workloads and migrate to the cloud. His core interests include deep learning and serverless technologies. Outside of work, he likes to spend time with his family and explore a wide range of music.

“}]] In this post, we guide you through the phases of customizing SLMs on AWS, with a specific focus on automotive terminology for diagnostics as a Q&A task. We begin with the data analysis phase and progress through the end-to-end process, covering fine-tuning, deployment, and evaluation. We compare a customized SLM with a general purpose LLM, using various metrics to assess vocabulary richness and overall accuracy. Read More Advanced (300), Amazon Bedrock, Amazon SageMaker, Amazon SageMaker Studio, Generative AI, AI/ML

Customize small language models on AWS with automotive terminology Bruno Pistone AWS Machine Learning Blog