Nexus Notes: The Art of Multi-Agent Design

A Technical Guide to Multi-Agent AI

Alfeo Sabay — Sat, 02 Aug 2025 19:00:37 GMT

The world of supply chain management is undergoing a seismic shift. I’ve seen the ‘before’ picture firsthand; years ago, I led an engineering team building a B2B Demand Planning Engine. While powerful for its time, it was a world of static, sequential processes that are simply no match for today's market volatility. The future now belongs to a new paradigm: a sentient, self-optimizing supply chain powered by a collaborative crew of AI agents. This isn't science fiction; it's the next evolution in enterprise automation, transforming the supply chain from a rigid set of operations into a responsive, digital nervous system.

For technical stakeholders, the question isn't why but how. This guide provides a condensed technical blueprint for designing, training, and deploying a sophisticated multi-agent system for autonomous supply chain management.

The Architectural Blueprint: A Hierarchy of Specialists

At its core, a multi-agent system tackles complexity by breaking down a massive problem into manageable tasks assigned to specialized agents. For a mission-critical process like supply chain management, a hierarchical architecture provides the necessary control, auditability, and alignment with business goals.

We use LangGraph, a framework for building stateful, agentic workflows, to orchestrate this hierarchy. A master SupplyChainSupervisor agent sits at the top, managing the global plan and delegating tasks to a crew of worker agents.

The workflow operates as a continuous, self-optimizing loop:

Demand Forecasting: The DemandForecastingAgent analyzes historical data and real-time market signals to predict future demand.
Inventory & Risk Analysis: The forecast is passed to the InventoryManagementAgent to check against current stock levels, while a RiskComplianceAgent scans for potential disruptions.
Procurement & Logistics: Based on this analysis, a nested team of agents manages supplier selection (SupplierLiaisonAgent) and plans optimal shipping routes (LogisticsOptimizationAgent).
Quality Assurance & Execution: A QualityAssuranceAgent reviews the final plan against a predefined rubric before the Supervisor executes the plan by interacting with external systems.
Continuous Monitoring: The system constantly monitors for new events, ready to trigger a re-planning cycle at a moment's notice.

Engineering the Agent Crew: Beyond Basic Prompting

The intelligence of this system comes from a crew of specialized agents, each strategically engineered for its role. We employ a mix of powerful frontier models, smaller fine-tuned models, and sophisticated training techniques.

1. The SupplyChainSupervisor (Orchestrator)

Role: The master controller of the LangGraph workflow.
Training Strategy: This agent is not fine-tuned. Its intelligence is procedural, defined by meticulously engineered prompts and the coded logic of the LangGraph state machine.
Model: A fast, capable model with excellent tool-use, like Llama 3 8B-Instruct or Gemini 1.5 Flash, is ideal for low-latency delegation.

2. The DemandForecastingAgent (The Analyst)

Role: Predict future demand by synthesizing historical data and market trends.
Training Strategy: A hybrid approach is used for maximum accuracy:

Fine-Tuning: A model like Claude 3.5 Sonnet or a fine-tuned Mistral 7B is trained on a dataset of historical sales data paired with ground-truth outcomes. This teaches the model the fundamental demand patterns of the business.
Retrieval-Augmented Generation (RAG): The agent is equipped with a RAG tool connected to a vector database containing market research reports, economic indicators, and news feeds. Before generating a forecast, it retrieves relevant, up-to-the-minute context, dramatically improving its accuracy and relevance.

3. The InventoryManagementAgent (The Decision-Maker)

Role: Monitor stock levels and decide when and how much to reorder.
Training Strategy: This agent is fine-tuned to be a specialized decision engine.

Fine-Tuning: A smaller model like Llama 3 8B is trained on a dataset of inventory scenarios. Each data point consists of inputs (current stock, demand forecast, lead times) and the optimal output action (e.g., { "action": "CREATE_PURCHASE_ORDER", "quantity": 5000 }), derived from established inventory formulas like Economic Order Quantity (EOQ). This transforms the LLM into a reliable, cost-effective tool for a specific, repetitive task.

4. The QualityAssuranceAgent & The Human-in-the-Loop (The Guardian)

Role: The critical junction for ensuring trust and accuracy. This agent reviews the system's proposed plans and flags anomalies for human review.
Training Strategy: The focus here is on the process, not the model.

LLM-as-Judge: A powerful frontier model like GPT-4o is given a detailed, expert-defined rubric. It is prompted to evaluate the final plan against this rubric, providing scores and justifications. If any criterion fails, the plan is automatically flagged.
Human-in-the-Loop (HITL): Flagged plans are sent to a human expert via a dedicated UI. The expert's role is twofold: 1) to make the final decision and correct the plan, and 2) to provide structured feedback on why the plan was flawed.
The Feedback Loop: This expert feedback is invaluable. Each correction becomes a high-quality data point that is fed back into the fine-tuning datasets for the other agents. This creates a virtuous cycle where the system learns directly from expert oversight, continuously improving its accuracy and reliability over time.

Production Deployment on AWS

An enterprise-grade agentic system requires a scalable and resilient infrastructure. Our blueprint leverages a container-based approach orchestrated with Kubernetes.

Containerization: Each agent is packaged as a microservice using Docker and stored in Amazon Elastic Container Registry (ECR). This isolates dependencies and allows for independent scaling.
Orchestration: Amazon Elastic Kubernetes Service (EKS) manages the container lifecycle. We use a multi-node group strategy:

CPU Node Group: For standard services like the API gateway and message brokers.
GPU Node Group: For the agent services that require GPU acceleration (e.g., DemandForecastingAgent), using instances from the P3 or G5 families.

Autoscaling: We use Karpenter, an AWS-built cluster autoscaler, to provision right-sized EC2 instances (including GPUs) on-demand, optimizing both performance and cost.
Communication: A hybrid model is used. Low-latency internal requests between agents use gRPC, load-balanced by an Application Load Balancer (ALB). Asynchronous handoffs to external systems use a managed message broker like Amazon MQ for RabbitMQ for resilience.

The Path Forward

Building a multi-agent system is a journey from automation to autonomy. By combining hierarchical control, specialized agent training, and a robust human-in-the-loop validation process, we can construct a supply chain that doesn't just execute commands but senses, reasons, and adapts. This is the foundation for creating a truly sentient supply chain—one that is not only more efficient and resilient but also a powerful competitive advantage in an increasingly unpredictable world.

Architecting HIPAA-Compliant Cloud-Native AI/LLM Services on AWS

Alfeo Sabay — Sat, 11 Nov 2023 03:48:07 GMT

Introduction

In the ever-evolving landscape of healthcare technology, the fusion of Cloud-Native architecture with Artificial Intelligence (AI) and Large Language Models (LLM) is redefining patient care. As organizations harness the power of cloud platforms, such as AWS, to develop sophisticated AI/LLM solutions, ensuring compliance with the Health Insurance Portability and Accountability Act (HIPAA) becomes paramount. This comprehensive exploration dives into the intricacies of designing HIPAA-compliant Cloud-Native AI/LLM services on AWS, with each technical facet tailored to the unique challenges of healthcare AI.

LLM Inference Design Considerations

In the realm of designing HIPAA-compliant Cloud-Native AI/LLM services on AWS, the choice of Large Language Models (LLM) is a crucial consideration. Organizations often opt for well-established LLM frameworks like OpenAI's GPT-3 or BERT for their natural language processing capabilities in healthcare applications. The careful selection of an LLM depends on the specific requirements of the AI/LLM service, such as the complexity of language understanding, context retention, and data leak prevention. Moreover, privacy concerns in healthcare underscore the importance of leveraging private LLM inferences. AWS facilitates this by offering services like Amazon SageMaker for deploying models within a Virtual Private Cloud (VPC), ensuring that LLM inferences are performed in a secure and isolated environment. This approach aligns with HIPAA compliance, as it adds an additional layer of protection to patient data processed through language models, assuring confidentiality and meeting the stringent privacy requirements of healthcare regulations.

Image Credit: AWS HIPAA Reference Architecture

Foundational Security: Integrating Security by Design into AI/LLM

The foundation of a Cloud-Native AI/LLM service lies in the principle of Security by Design. AWS Identity and Access Management (IAM) plays a pivotal role in managing access to AI/LLM models and datasets. Encryption, facilitated by AWS Key Management Service (KMS), is strategically embedded into the architecture, securing both at-rest and in-transit data. This design ensures that sensitive patient information processed by AI/LLM algorithms is shielded, aligning seamlessly with HIPAA's stringent security requirements.

Decoupled Architecture: Enhancing Scalability and Privacy in AI/LLM

AI/LLM services should adopt a decoupled architecture, leveraging AWS-native Kubernetes service and microservices. This design not only enhances the scalability of AI/LLM models but also aligns with the privacy requirements stipulated by HIPAA. Each microservice operates independently, allowing for isolated updates and reducing the potential attack surface which are critical considerations when dealing with sensitive patient data.

Data Encryption: Safeguarding Patient Data in AI/LLM Processing

In the context of AI/LLM services, data encryption is a critical component in maintaining HIPAA compliance. Utilizing AWS KMS, our Cloud-Native AI/LLM service ensures that patient data is encrypted before, during, and after processing. This robust encryption strategy not only adheres to HIPAA standards but also fortifies the AI/LLM service against potential security threats.

Access Controls and Identity Management: Ensuring Authorized Usage

AI/LLM services often involve multiple stakeholders accessing and interacting with models and datasets. AWS IAM, in this scenario, plays a pivotal role in defining granular access controls. By adhering to the principle of least privilege, access to AI/LLM resources is restricted, ensuring that only authorized personnel can utilize and modify the models. This aligns seamlessly with HIPAA's access control mandates, crucial in safeguarding patient data processed by AI/LLM algorithms.

Audit Trails : Tracking AI/LLM Model Interactions for Compliance

HIPAA compliance demands detailed audit trails, a requirement met by AWS CloudTrail in a Cloud-Native AI/LLM service. Every interaction with AI/LLM models, including data inputs and outputs can be meticulously logged. These logs not only satisfy regulatory compliance but also provide a comprehensive view of how AI/LLM models are utilized, aiding in both auditing and improving the overall AI/LLM service.

Secure APIs: Enabling Interoperability in AI/LLM Solutions

Interoperability is a key consideration in healthcare AI/LLM solutions. Secure APIs, facilitated by AWS API Gateway and following the OAuth standard, enable seamless integration with other healthcare systems while adhering to HIPAA's encryption and access control requirements. This ensures that data exchanged between AI/LLM services and external systems complies with privacy and security standards.

Disaster Recovery and Data Resilience: Ensuring Continuity in AI/LLM Operations

For AI/LLM services, data resilience and disaster recovery are paramount. AWS-native services like Amazon S3 for data storage and Amazon Aurora for databases contribute to the robustness of our Cloud-Native AI/LLM service. Automated backup mechanisms and geo-redundancy ensure that patient data remains accessible, even in the face of unforeseen disruptions, aligning with HIPAA's requirements for data availability and integrity.

Compliance as Code: Automating Governance and Compliance in AI/LLM

In the world of AI/LLM, automation becomes a powerful ally. Compliance as Code principles, implemented through AWS CloudFormation, define and deploy the entire infrastructure of our Cloud-Native AI/LLM service. AWS Config continuously monitors compliance, automatically remediating any deviations. This approach not only streamlines the governance of AI/LLM systems but also ensures ongoing HIPAA compliance.

Ongoing Compliance Monitoring : Adapting to Evolving AI/LLM Regulations

AI/LLM systems are subject to evolving regulatory landscapes. Regular vulnerability scans, facilitated by AWS-native services like Amazon Inspector, ensure that the AI/LLM service remains resilient to emerging threats. Continuous compliance monitoring, with tools like AWS Security Hub, empowers the AI/LLM service to adapt proactively to evolving HIPAA regulations.

Conclusion: Navigating the Future of Healthcare with AI/LLM Designs

In conclusion, the design of HIPAA-compliant Cloud-Native AI/LLM services on AWS exemplifies the delicate balance between innovation and regulatory adherence. By intertwining the capabilities of AWS cloud infrastructure with the intricacies of AI and LLM, healthcare organizations can pioneer the future of patient-centric, data-driven care. This approach not only meets current HIPAA standards but also positions healthcare AI/LLM systems to navigate the evolving landscape of regulatory requirements, ensuring a future where cutting-edge technology aligns seamlessly with the principles of privacy, security, and patient well-being. The amalgamation of AWS's robust cloud services with AI/LLM solutions sets the stage for a healthcare revolution where innovation and compliance coexist harmoniously.

References:

US Department of Health and Human Services: https://www.hhs.gov/hipaa/index.html

Architecting for HIPAA Security and Compliance on Amazon Web Services: https://docs.aws.amazon.com/whitepapers/latest/architecting-hipaa-security-and-compliance-on-aws/architecting-hipaa-security-and-compliance-on-aws.html

HIPAA Reference Architecture on AWS: https://aws.amazon.com/solutions/implementations/compliance-hipaa/

Health Insurance Portability and Accountability Act (HIPAA) Security Rule 2003: https://docs.aws.amazon.com/audit-manager/latest/userguide/HIPAA.html

🚀 🔥 AI Medical Assistant Part 2: LLM and Chatbot Deployment

Alfeo Sabay — Wed, 20 Sep 2023 20:29:47 GMT

Introduction:

Let's revisit the key highlights from Part 1 of our journey. In our pursuit of optimizing training costs while aligning with smaller GPU instances, we meticulously configured a PEFT/LoRA model, implementing 4-bit quantization (QLoRA) through BitsAndBytesConfig. This strategic setup, executed prior to its utilization in the training script, effectively mitigated memory consumption during the training phase.

With the publicly available MedDialog dataset, the comprehensive fine-tuning endeavor demanded a total of 97 hours, spanning across 2 epochs, culminating in the convergence of the learning rate at 4.55e-8 and the training loss stabilizing at 1.71. The utilization of AWS SageMaker spot instances significantly reduced expenditure, albeit extending the training period due to intermittent interruptions, ultimately amounting to a remarkable 60% cost reduction, from an estimated $687.73 to $287.92. Notably, these insights underscore the intricate balance between time and expenditure.

Throughout this process, we harnessed the power of the W & B platform, seamlessly integrated with the wandb API within our training script, enabling us to diligently monitor our progress.

The overarching lesson derived from this phase is the pivotal trade-off between time efficiency and cost-effectiveness, particularly on larger GPU instances. It is noteworthy that our customized md-assistant model is readily accessible through MLPipes on the Hugging Face model hub, adhering to llama2 license terms.

Now, as we move forward armed with our finely-tuned model, we embark on the exciting journey of deploying it onto a production-grade AWS inference point and constructing a Streamlit Chatbot, serving as a Minimum Viable Product (MVP) User Interface. Furthermore, with the Chatbot in place, we gain the ability to rigorously assess the accuracy and relevance of our model's outputs. The next phase beckons with anticipation—let's proceed! 🚀

Deployment on AWS SageMaker:

The deployment process for our md-assistant model on AWS SageMaker exemplifies simplicity and efficiency. Leveraging the SageMaker platform, we establish an inference point with precision. This entails the creation of model and endpoint configurations, followed by the seamless deployment of our model into a Docker Container. Notably, we opted for Hugging Face DLC Containers, a strategic choice that aligns harmoniously with our model's architecture and encapsulates the essence of our project's versatility and extensibility. As in the fine-tuning stage, we use a 4 GPU instance (ml.g5.12xlarge) to host the inference point. Below, you will find an illustrative snippet that sheds light on the elegance and efficacy of this deployment process, a testament to the seamless fusion of cutting-edge AI technologies with SageMaker's robust infrastructure.

By returning the huggingface_model object in addition to the predictor object, we are able to delete and clean up these resources at a later time to avoid unnecessary inference hosting charges.

Configuring endpoints for real-time interaction

In the realm of a comprehensive production deployment, it is customary to establish a RESTful API endpoint, traditionally crafted using the Flask framework. This well-established practice is fortified with robust authentication and security measures, safeguarding the integrity of the system. However, within the context of our specific project, where the test client, embodied as the Streamlit Chatbot, operates from an AWS CLI-enabled local machine, we have chosen a streamlined approach. We have strategically implemented the Langchain LLM endpoint object, meticulously designed for SageMaker, and ingeniously integrated it within a ConversationChain object enriched with memory. This astute decision is a testament to our commitment to alignment with precise requirements and optimization for efficiency. While delving into the intricacies of Langchain implementation is beyond the scope of this discussion, a forthcoming blog may offer a comprehensive exploration of its integration within a Chatbot QA use-case. Below, you will find a code snippet that vividly exemplifies the elegance and efficiency inherent in our deployment strategy, highlighting its adaptability and versatility.

The Streamlit Chatbot

And within chat-md.py we access the endpoint through the ConversationChain…

Upon user input into the chatbot, a pivotal event unfolds—the invocation of the on_click_callback() function. Within this function lies the core of our implementation, where we execute the ConversationChain object. This masterful orchestration involves processing the input text and summoning the deployed md-assistant to procure a precise and context-aware reply.

The default appearance of Streamlit's seemingly straightforward components left me yearning for more, prompting me to delve into the world of CSS customization to create a more aesthetically pleasing user interface. For an in-depth exploration of this customization process, I recommend perusing Fanilo Andrianasolo's GitHub repository, which offers valuable insights into this artful endeavor. Additionally, you can find the chatbot's code within the Chatbot folder on my GitHub repository dedicated to this project, providing a comprehensive resource for those seeking further details and inspiration.

Here’s a screen recording of the md-assistant chatbot in operation 🤓

Model Performance Evaluation:

Given the fine-tuned nature of the md-assistant model to cater specifically to clinician-to-patient dialogues, our testing protocol entails the rigorous examination of simulated user inputs that have never previously crossed paths with the md-assistant model. This meticulous approach is indispensable for assessing accuracy and contextual relevance. In my comprehensive testing endeavor, I subjected the model to 100 out-of-bag questions, and the results unequivocally showcased responses that remained impeccably contextual and entirely pertinent to the addressed topics. Moreover, the model's versatility was evident as it responded to the same questions with slight variations, preserving context, thanks to the 'repetition_penalty' setting of 1.03 and a 'temperature' value of 0.7. However, it is essential to emphasize that the true accuracy of the llm response can only be comprehensively evaluated by licensed medical clinicians—a step we have yet to undertake.

In a real-world, commercially deployed application of the md-assistant llm model, a critical phase would encompass Subject Matter Expert (SME) evaluation, instrumental for model testing and validation. Furthermore, standard operating procedure for deploying a fine-tuned chat model like this involves an exhaustive scrutiny of dialog and meticulous data anonymization, ensuring compliance with the stringent requirements of HIPAA. In such scenarios, the model can be further fine-tuned to align precisely with a specific clinician's advisory history, a departure from the MedDialog public dataset, which encompasses a vast array of over 250,000 utterances, spanning 51 medical categories and 96 specialties, from diverse patients and medical professionals.

Conclusion and Key Takeaways

Our journey, spanning both fine-tuning and deployment stages, has uncovered several key insights and achievements:

1. Optimized Training: We diligently tailored our PEFT/LoRA model, implementing 4-bit quantization (QLoRA) through BitsAndBytesConfig, successfully mitigating memory consumption during training—a crucial step towards cost-effective GPU utilization.

2. Efficiency vs. Cost: Balancing time efficiency and cost-effectiveness in GPU instance selection was paramount. Utilizing AWS SageMaker spot instances achieved a remarkable 60% cost reduction, albeit with extended training periods.

3. Monitoring with W & B: Our integration of the W & B platform and wandb API provided invaluable progress monitoring, ensuring fine-tuning precision.

4. Deployment Excellence: Deploying on AWS SageMaker was a seamless process, with Docker Container utilization, highlighting our strategic choice of Hugging Face DLC Containers for compatibility and adaptability.

5. Real-time Interaction: Implementing Langchain LLM endpoints within a ConversationChain object, a deviation from traditional RESTful API approaches, showcased our commitment to tailored efficiency.

6. UI Customization: We delved into CSS customization to enhance the Streamlit Chatbot's user interface, offering an improved user experience.

7. Model Testing: Rigorous testing of the fine-tuned model with simulated user inputs revealed contextually precise responses. Repetition_penalty settings allowed for nuanced variations, emphasizing adaptability.

8. SME Evaluation: In a real-world, commercial deployment, Subject Matter Expert (SME) evaluation and HIPAA-compliant data anonymization would be pivotal, showcasing the model's flexibility in adapting to specific clinician advisory histories.

In essence, our journey underscores the fusion of cutting-edge AI technologies with efficient deployment strategies and meticulous model testing. While we've made significant strides, the path ahead holds exciting possibilities as we continue to refine and adapt our md-assistant LLM model to the dynamic field of clinician-patient interactions.

Project Demo:

If you're eager to witness a live demonstration of our end-to-end Chatbot application in action, don't hesitate to reach out to me directly at MLPipes. I'd be delighted to schedule a convenient time to showcase the demo. While we're unable to maintain a live deployment of the model due to GPU hosting costs, you can alternatively engage with me via the MLPipes Chatbot.

Should you have a specific LLM Customization project in mind and seek to explore possibilities, I encourage you to initiate contact with me at MLPipes. I'm more than willing to engage in a productive discussion about your project needs. At MLPipes, our expertise lies in Machine Learning Engineering and LLM Customization, and we're eager to collaborate on your next endeavor. 🚀

For those who found value in this blog, we extend an invitation to seize the opportunity for a complimentary subscription below.

Subscribe now

References:

Meta Llama 2

Philip Schmid: How to deploy Large Language Models (LLMs) to Amazon SageMaker using new Hugging Face LLM DLC

Sam Rawal: Llama2 Chat Templater

Langchain API

Fanilo Andrianasolo: Social Media Tutorials Github - Fantastic examples on how to add css to Streamlit!!

Alfeo Sabay: MD Assistant Github

Alfeo Sabay: MLPipes Hugging Face Model Hub - Where you can access this fantastic model!! 💪🚀😁

"Building an AI Medical Assistant Part 1: LLama2 Fine-Tuning with Hugging Face Containers, QLoRA and PEFT With WANDB in AWS Sagemaker Spot Instances to cut LLM customization costs.”

Alfeo Sabay — Fri, 01 Sep 2023 13:00:29 GMT

I. Introduction

Generative AI is currently gaining a lot of momentum. Modern Large Language Models have the ability to generate relevant responses to human queries. This has motivated many new AI startups to apply this capability to everyday business tasks, automate workflows and processes, and improve work products.

One potential workflow that comes to mind is a clinical conversation between a clinician and a patient. Clinics and medical care providers have existing documentation on these clinical scenarios. Large Language Models are particularly good at this. With anonymized patient conversation data, can we train a Large Language Model to respond to patient questions? How effective will this be? What benefits can modern clinical practice gain from modern Large Language Models?

In part 1 of this newsletter, we cover the data pre-processing and model fine-tuning using QLoRA and Hugging Face PEFT on AWS Sagemaker spot instances and see the resulting cost savings!
In part 2, we will be deploying the fine-tuned llama-2-13b-chat-hf (md-assistant model) to an inference container in AWS Sagemaker and we will build a simple chat user interface to avail the services provided by the new chat model.

Amazon SageMaker Spot Instances, Hugging Face DLC containers, and Weights and Biases API

In this two-part newsletter, we will dive into the use of Amazon SageMaker spot instances, Hugging Face DLC containers, and the Weights and Biases API to fine-tune the LLaMA2-13B-chat-hf model on a patient-clinician conversation dataset. This workflow has potential applications in clinical practice, where Large Language Models can be trained to respond to patient questions as a tool for the clinician. We will discuss the benefits of using SageMaker spot instances, including cost savings and scalability, as well as the advantages of using Hugging Face DLC containers for model deployment. Additionally, we will explore the role of the Weights and Biases API in monitoring and optimizing machine learning experiments.

II. Background

Overview of the chosen model architecture: LLaMA-2-13B-chat-hf

LLaMA2 is a family of pre-trained language models developed by Meta AI, which have gained popularity among researchers and practitioners in natural language processing (NLP) due to their impressive performance across various benchmarks. Additionally, Among the many available variants, we selected the LLaMA2-13B-chat-hf model for our fine-tuning experiment. But why did we choose this particular model?

One of the main reasons behind my choice was the fact that the LLaMA2-13B-chat-hf model has been pre-trained on a diverse range of text data, including conversational datasets like Cornell Movie Dialog Corpus and OpenSubtitles. This pre-training objective allows the model to learn patterns and structures commonly found in human dialogues, making it well-suited for generating coherent and contextually appropriate responses in chatbot scenarios. In other words, the "chat" in the model name reflects its chat question and answer pre-training, which aligns with our goal of developing a conversational AI system for healthcare professionals.

Another important factor was the balance between model size and computational requirements. Compared to smaller models like LLaMA2-7B, the LLaMA2-13B-chat-hf model offers better performance and more capacity to handle complex conversations. However, larger models like LLaMA2-70B come with increased computational demands and longer training times, which could limit our ability to experiment with different hyperparameter configurations within a reasonable time frame. By selecting the mid-sized LLaMA2-13B-chat-hf model, we were able to strike a good balance between these competing factors.

Major factors in choosing LLama 2 models pertaining to licensing.

LLama2 is an open-source implementation of the language model, LLaMA.
The project is released under the Apache License 2.0, which allows for free use, modification, and distribution of the software.
The Apache License 2.0 is a permissive open-source license that allows for free use, modification, and distribution of software.
The license provides a perpetual, world-wide, non-exclusive, no-charge, royalty-free license to use, reproduce, modify, and distribute the software.
LLama2 uses other open-source libraries and frameworks that are released under their own licenses, such as PyTorch (MIT License) and transformers (Apache License 2.0).
The open-source licensing of LLama2 allows for maximum flexibility and freedom for users to use, modify, and distribute the software as they see fit.
The licensing aligns with the goals of the open-source community, which emphasizes collaboration, transparency, and freely available software.

Fine-tuning the model is like doing the fit and finish of the raw material which is the pre-trained model

Fine-tuning pre-trained language models for specific domains is crucial for achieving optimal performance in various applications, especially in scenarios where the model needs to understand domain-specific terminology, concepts, and nuances. In the context of patient-clinician conversations, fine-tuning a pre-trained language model can significantly improve its ability to comprehend medical jargon, diagnose diseases, and provide relevant treatment recommendations. In essence, the pre-trained model is the raw material and fine-tuning the model is the process of carving and shaping the raw material into a beautiful piece of sculpture!

Here are some reasons why fine-tuning pre-trained language models is important for specific domains like patient-clinician conversations:

Domain-specific vocabulary: Medical conversations often involve specialized terminology that may not be present in general language datasets. By fine-tuning a pre-trained model on a dataset containing medical terms and phrases commonly used in patient-clinician interactions, the model becomes more adept at understanding the unique vocabulary associated with this domain.
Concept drift: The underlying distribution of data in different domains can differ significantly, leading to a phenomenon known as concept drift. Fine-tuning a pre-trained model helps adapt it to the new domain, ensuring that it can capture subtle variations in language usage, sentiment, and topics that are specific to patient-clinician conversations.
Contextual understanding: Patient-clinician conversations often involve complex dialogues that require an understanding of the context, including the patient's medical history, symptoms, and treatment plans. Fine-tuning a pre-trained model enables it to better grasp the relationships between these elements and provide more accurate responses.
Personalization: Every patient's situation is unique, and clinicians must consider individual factors when making decisions about diagnosis and treatment. By fine-tuning a language model on a dataset that reflects the diversity of patients and clinicians, the model can learn to recognize patterns and tailor its responses to suit each person's needs. Think about the correlations that the LLM can point out in a particular patient’s history.
Ethical considerations: Healthcare is a sensitive domain, and there are ethical concerns surrounding the use of AI in patient care. Fine-tuning a pre-trained model on a dataset that adheres to privacy regulations and respects patient autonomy helps ensure that the model's responses align with ethical principles and standards.
Improved accuracy: Fine-tuning a pre-trained model typically leads to improved accuracy compared to using a generic, pre-trained model. This is because the model has learned to recognize patterns and relationships specific to the target domain, resulting in fewer errors and more effective communication.
Efficient use of resources: Fine-tuning a pre-trained model requires less data and computational resources than training a model from scratch. By leveraging the knowledge captured by the pre-trained model, we can adapt it to the target domain more efficiently and effectively.
Faster adaptation to new tasks: Once a pre-trained model has been fine-tuned for a specific domain, it can quickly adapt to new tasks within that domain. This is particularly useful in healthcare, where new treatments, technologies, and regulations emerge regularly, and the model needs to be able to respond accordingly.
Enhanced interpretability: Fine-tuning a pre-trained model can help make its internal workings more transparent and interpretable. By analyzing the model's weights and activations, we can gain insights into how it processes domain-specific language and which features it deems most important.
Better handling of out-of-distribution inputs: Fine-tuning a pre-trained model improves its ability to handle unexpected or out-of-distribution inputs that may arise in real-world applications. This is critical in healthcare, where unusual cases or unforeseen situations can have significant consequences.

So, fine-tuning pre-trained language models for specific domains like patient-clinician conversations is essential for achieving high accuracy, efficiency, and ethical considerations. By adapting these models to the unique characteristics of the target domain, we can develop more effective and reliable AI language models that support clinicians in providing better patient care.

III. System Design

Pre-requisites (Disclaimer: I am not sponsored by any of these companies. I am a paid subscriber to AWS, Weights & Biases and Hugging Face)

AWS Account:

Cloud Computing Services - Amazon Web Services (AWS)

HuggingFace Account: Free signup

Hugging Face – Pricing

Weights & Biases Account: Free signup

W&B Docs | Weights & Biases Documentation

Meta LLama-2 approval for access:

Llama access request form - Meta AI

Amazon SageMaker and its benefits for machine learning and AI development

With AWS SageMaker, the process of fine-tuning a pre-trained language model (LLM) becomes significantly simpler and more efficient for machine learning engineers. By leveraging the cloud infrastructure provided by SageMaker, engineers can focus solely on the fine-tuning task without worrying about the underlying infrastructure.

Here are some ways in which SageMaker streamlines the LLM fine-tuning process:

No Infrastructure Setup: SageMaker eliminates the need for engineers to perform low level set up and manage infrastructure, such as spinning up containers, managing data storage, and configuring network security. This saves time and effort, allowing engineers to focus on the core task of fine-tuning the LLM.
Easy Access to Data: SageMaker provides integrated data management capabilities, making it easy for engineers to access and manipulate data for fine-tuning. This includes data loading, preprocessing, and feature engineering, all of which can be performed within the SageMaker framework.
Automated Hyperparameter Tuning: SageMaker automates the hyperparameter tuning process, allowing engineers to focus on other aspects of the fine-tuning task. This feature saves time and reduces the risk of overfitting or underfitting the model.
Support for Various Frameworks: SageMaker supports a variety of machine learning frameworks, including TensorFlow, PyTorch, and Scikit-learn. This means that engineers can use their preferred framework for LLM fine-tuning, further simplifying the process.
Flexible Deployment Options: Once the fine-tuning process is complete, SageMaker provides flexible deployment options, including hosting the model in a SageMaker endpoint, deploying it to AWS Lambda, or exporting it to a containerized application. This enables engineers to easily integrate the fine-tuned LLM into their desired environment.
Time Savings: By leveraging SageMaker's automated infrastructure provisioning, data management, and hyperparameter tuning capabilities, engineers can save a significant amount of time compared to setting up and managing the infrastructure themselves. This allows them to focus on the fine-tuning task at hand and deliver high-quality LLM models more rapidly.
Improved Productivity: With SageMaker, engineers can work more efficiently and avoid tedious, repetitive tasks. They can focus on developing and refining their LLM models, leading to improved productivity and better model performance.
Better Collaboration: SageMaker facilitates collaboration among team members, enabling them to work together more effectively. Features like version control, reproducibility, and shared notebooks simplify the collaborative fine-tuning process, ensuring that everyone is on the same page.
Cost Optimization: SageMaker provides optimized computing resources that adjust to meet changing demand. This means that engineers can minimize costs associated with LLM fine-tuning while still achieving optimal results.
Security and Compliance: SageMaker adheres to strict security and compliance standards, giving engineers peace of mind regarding data privacy and protection. This allows them to focus on the fine-tuning task without worrying about potential security breaches or non-compliance issues.

By using AWS SageMaker for LLM fine-tuning, machine learning engineers can offload the burden of managing infrastructure and focus exclusively on optimizing their models. This leads to increased productivity, simplified collaboration, reduced costs, and improved model performance, ultimately resulting in better outcomes for their organization.

Introduction to Hugging Face DLC containers and their integration with SageMaker

HuggingFace DLC (Deep Learning Container) is a containerization technology specifically designed for deep learning models. It provides a simple and efficient way to package and distribute deep learning models and their dependencies, allowing developers to focus on building models instead of managing infrastructure.

DLCs are built on top of Docker and provide a standardized way to package models, datasets, and other dependencies required for training and inference. They support a wide range of deep learning frameworks, including TensorFlow, PyTorch, and Keras.

SageMaker, on the other hand, is a fully managed service provided by Amazon Web Services (AWS) that makes it easy to build, train, and deploy machine learning models at scale. It provides a variety of features, including automatic model tuning, hyperparameter optimization, and deployment of models to production environments.

Advantages of integrating Hugging Face DLCs with SageMaker

Simplified model packaging: Hugging Face DLCs provide a standardized way to package models and their dependencies, making it easy to manage and distribute models across different environments.
Faster model deployment: By using DLCs, you can quickly deploy models to SageMaker, reducing the time and effort required to set up and configure environments.
Improved reproducibility: DLCs ensure that models are trained and deployed consistently across different environments, which improves reproducibility and reduces the risk of errors caused by inconsistent environments.
Easier collaboration: DLCs make it easier for data scientists to collaborate on projects by providing a standardized way to exchange models and reproduce experiments.
Better resource utilization: SageMaker's integration with DLCs allows you to take advantage of spot instances and other cost-effective compute resources, reducing the cost of training and deploying models.

Fine-tuning Set-up:

When leveraging the MedDialog dataset to fine-tune a pre-trained LLaMA-2-13B-chat-hf model, there are several key considerations to keep in mind to ensure optimal performance and efficiency.

First and foremost, it is crucial to select an appropriate batch size for training. In this case, a batch size of 2 per device is recommended to strike a balance between resource utilization and training speed. By doing so, the model can be trained efficiently without sacrificing too much time or computational resources.

Next, the number of epochs must be carefully chosen. In this scenario, running the experiment for 3 epochs should suffice for the model to converge properly and achieve satisfactory results. Selecting the right number of epochs is critical, as it impacts both the accuracy of the model and the time required for training.

AdamW optimization is also vital for achieving optimal model performance. AdamW is a well-known algorithm that adapts the learning rate for each parameter individually, based on the magnitude of the gradient. This helps to prevent overshooting or undershooting the optimal learning rate, resulting in faster convergence and better model accuracy.

Another important aspect to consider is the choice of instance type. For this particular use case, an ml.g5.4xlarge SageMaker spot instance is recommended. Not only does this instance type offer powerful GPU acceleration, but it also comes with a discounted pricing model thanks to Amazon Elastic Compute Cloud (EC2) Spot Instances. By leveraging EC2 Spot Instances, SageMaker can automatically handle bid management and instance selection, streamlining the process and minimizing costs.

Monitoring the fine-tuning progress is equally important. From the smart folks at Weights & Biases ( https://wand.ai), the wandb API provides comprehensive monitoring capabilities for SageMaker/HuggingFace experiments, including essential metrics like loss, accuracy, and validation accuracy. Real-time monitoring enables engineers to identify potential issues early on and make informed decisions throughout the experiment. Additionally, it lets you version and iterate on datasets, evaluate model performance, reproduce models, visualize results and spot regressions, and share findings with colleagues.

In the fine-tuning process (run_clm.py), I simply imported wandb and using my api key, logged-in to the wandb.ai platform within my training script. By adding the following to the python script, this was made possible. Pretty simple and you can view the real-time progress as shown below. The different colors represent different traces post spot instance interruption. Yes it remembers the last trace and depicts the newer trace in a different color!

# making
import wandb
.
.
.
.
# login to wandb instrumentation platform
    if args.wandb_api_key:
        print(f'logging in to wandb.....')
        wandb.login(anonymous='never', key=args.wandb_api_key)
.
.
.
# add report_to='wandb'parameter
# Define training args
    output_dir = args.output_dir
    training_args = TrainingArguments(
        output_dir=output_dir,
        resume_from_checkpoint=True,
        overwrite_output_dir=True,
        per_device_train_batch_size=args.per_device_train_batch_size,
        bf16=args.bf16,  # Use BF16 if available
        learning_rate=args.lr,
        num_train_epochs=args.epochs,
        gradient_checkpointing=args.gradient_checkpointing,
        # logging strategies
        logging_dir=f"{output_dir}/logs",
        logging_strategy="steps",
        logging_steps=10,
        #warmup_steps=100,
        save_steps=50,
        save_strategy="steps",
        report_to="wandb",
        run_name=f'md-asistant-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
    )

Finally, it's worth noting that using spot instances has its advantages over on-demand instances. Besides being significantly cheaper, spot instances grant access to a larger pool of available instances, making it easier to launch experiments promptly. Additionally, spot instances allow for greater flexibility in scaling resources up or down according to changing demands. Of course, there are some risks associated with spot instances, such as potential instance termination due to fluctuations in EC2 spot instance supply and demand. Nevertheless, SageMaker's integration with EC2 Spot Instances mitigates these risks while maximizing the benefits.

By taking these factors into account when fine-tuning the LLaMA-2-13B-chat-hf model on the MedDialog dataset, experienced ML engineers can create a robust and cost-efficient chatbot solution that delivers high-quality user experiences while optimizing resource utilization.

Dataset: description of the patient-clinician conversation dataset used for fine-tuning

First introduced in a 2004 paper by Xuehai He et al., MedDialog consists of two large-scale medical dialogue datasets that capture conversations between patients and doctors across various medical domains.

What makes MedDialog particularly interesting for business applications is its scope and diversity. The dataset contains over 250,000 utterances from both patients and doctors, spanning 51 different medical categories and 96 specialties. This wealth of information provides a unique opportunity for machine learning models to learn patterns and relationships within medical dialogues, which can ultimately enhance decision-making processes in healthcare settings.

Developing conversational AI systems that can facilitate patient-doctor interactions using clinic or provider specific dialog datasets is a valid use case. By analyzing the language used in medical consultations, these systems can better understand patient concerns and provide personalized recommendations for treatment options. This not only improves patient satisfaction but also streamlines the consultation process for doctors, allowing them to focus on more complex cases.

MedDialog Dataset Preprocessing

The code provided (modified from https://github.com/philschmid/sagemaker-huggingface-llama-2-samples) is responsible for formatting medical dialog data into a suitable format for training a language model. Here's an overview of the steps involved in this process:

Loading the Data: The first step is to load the medical dialog data from a dataset file. This is done using the load_dataset function, which returns a pandas dataframe containing the data.
Removing Unwanted Columns: The next step is to remove any unwanted columns from the dataframe. In this case, we only need the "text" column, so we remove all other columns using the remove_columns parameter of the df.map() method.
Formatting Samples: We then apply a custom function called template_dataset to each row of the dataframe. This function takes a sample and formats it according to the required format for our language model. Specifically, it adds a system prompt and user prompt to each sample, separated by a newline character.
Chunking and Tokenizing: After formatting the samples, we use another custom function called chunk to split the text into smaller chunks. Each chunk has a maximum length of 2048 tokens. Any remaining tokens are saved as a global variable called remainder to be used in the next batch. Within each chunk, we also tokenize the text using the tokenizer function.
Preparing Labels: Once we have our chunks of text, we create labels for them. The labels are simply copies of the input IDs.
Saving the Data: Finally, we save the processed data to disk using the save_to_disk method. The output path is specified using the training_input_path variable which is an s3 bucket.

The code loads medical dialog data, removes unnecessary columns, formats each sample with system and user prompts, chunks and tokenizes the text, prepares labels, and saves the processed data to disk. These steps are necessary to prepare the data for training a language model capable of generating appropriate responses to medical queries.

######################################################
# Pre-processing of medical_dialog dataset from hf hub.
# Stores formatted, tokenized, chunked 
# training data to s3 bucket.
# Derived from @philschmid hugginface-llama-2-samples
# on a different hf dataset
######################################################

import sagemaker
import boto3
from random import randint
from itertools import chain
from functools import partial
from datasets import load_dataset
from random import randrange
import json
import pandas as pd
from transformers import AutoTokenizer

#sagemaker_session_bucket='mlpipes-sm'                                # us-west-2
sagemaker_session_bucket='mlpipes-03-29-2023-asabay'                  # us-east-1
role_name = 'Sagemaker-mle'
dataset_name = 'medical_dialog'
dataset_lang = 'en'
model_id = 'meta-llama/Llama-2-13b-chat-hf'
# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}
# sess = sagemaker.Session()

# fetch tokenizer pad_token
def fetch_tokenizer(model_id):
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token
    return tokenizer

tokenizer = fetch_tokenizer(model_id)

# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
def init_sagemaker(role, session_bucket):
    if session_bucket is None and sess is not None:
        # set to default bucket if a bucket name is not given
        session_bucket = sess.default_bucket()
    try:
        role = sagemaker.get_execution_role()
    except ValueError:
        iam = boto3.client('iam')
        role = iam.get_role(RoleName=role_name)['Role']['Arn']

    sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
    return (sess, role)

# load dataset and remove un-used fields
def load_and_extract(dataset_name, dataset_lang):
    dataset = load_dataset(dataset_name, dataset_lang)
    dataset = dataset['train'].remove_columns(['file_name', 'dialogue_id', 'dialogue_url'])
    return dataset

# function to format samples to llama-2-chat-hf format
# which is:
# [INST] <>
# System prompt
# <>
# User prompt [/INST] Model answer 
def format_dialogue(sample):
    instruction = f"[INST]{sample['dialogue_turns']['utterance'][0]}[/INST]"
    response = f"{sample['dialogue_turns']['utterance'][1]}"
    # join all the parts together
    prompt = "\\n".join([i for i in [instruction, response] if i is not None])
    return '' + prompt + ''

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_dialogue(sample)}{tokenizer.eos_token}"
    return sample

# chunk and tokenize
def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result

def process_data():
    sm_session, _ = init_sagemaker(role_name, sagemaker_session_bucket)
    ds = load_and_extract(dataset_name, dataset_lang)
    ds = ds.map(template_dataset)
    print(ds[randint(0, len(ds))]["text"])  # print random sample
    lm_dataset = ds.map(
        lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(ds.features)
            ).map(partial(chunk, chunk_length=2048),
            batched=True,
        )
    print(f"Total number of samples: {len(lm_dataset)}")
    # save train_dataset to s3
    training_input_path = f's3://{sm_session.default_bucket()}/processed/llama/md_dialouge/train'
    lm_dataset.save_to_disk(training_input_path)
    print("uploaded data to:")
    print(f"training dataset to: {training_input_path}")

if __name__ == '__main__':
    process_data()

Fine-Tuning with QLora and PEFT for Cost Savings

As a way to reduce training costs on AWS Sagemaker, the QLoRA method was used. This would allow the use of smaller GPU instances which in this case is a single GPU ml.g5.4xlarge sagemaker spot training instance (16 cvpu, 64G mem, 1 gpu, 24G total gpu mem, A10G). The on-demand cost in aws us-east-1 region is $2.03 per hour, discounted by up to 90% if using spot instances. In this project, the discount is around 65% using this setup.

QLoRA (Quantized LORA) is a novel method for efficient finetuning of quantized large language models (LLMs) proposed by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer in their research paper titled "QLORA: Efficient Finetuning of Quantized LLMs." The authors present a systematic approach to fine-tune quantized LLMs for downstream natural language processing tasks while maintaining competitive performance and reducing computational requirements.

The authors address the challenge of finetuning large language models that have been quantized, which results in loss of precision and degraded performance. They propose QLoRA, which leverages the strengths of two existing techniques: LORA (Latent Optimization Regularization Algorithm) and quantization. QLoRA introduces a regularization term that encourages the model to learn discrete representations that are close to the original continuous representations. This term is combined with the standard cross-entropy loss, resulting in a hybrid objective function that enables efficient finetuning of quantized LLMs.

The authors evaluate QLoRA on several benchmark datasets and compare its performance to full-precision and quantized baselines. Their results show that QLoRA achieves competitive performance with the full-precision baseline while providing significant computational savings. Specifically, they report that QLoRA achieves 97% of the full-precision performance on the GLUE benchmark while being 3.8x more computationally efficient.

The authors also investigate the effectiveness of QLoRA across different levels of quantization and observe that it consistently outperforms the quantized baseline across all levels. Furthermore, they demonstrate that QLoRA can be used for few-shot learning, adapting the model to new tasks with only a handful of labeled examples.

In the training script, the llama2 model is instantiated with quantization_config parameter passed in as a BitsAndBytesConfig object that specifies:

# from , run_clm.py
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
model = AutoModelForCausalLM.from_pretrained(
        args.model_id,
        use_cache=False
        if args.gradient_checkpointing
        else True,  # this is needed for gradient checkpointing
        device_map="auto",
        quantization_config=bnb_config,
    )

Further, PEFT is implemented to strike a middle ground between full-fine tuning which is resource intensive, and feature engineering. Parameter-Efficient Fine-Tuning (PEFT) is a novel approach to adapting pre-trained language models (PLMs) to various downstream tasks without fine-tuning all model parameters. PEFT selectively updates a small number of extra parameters, striking a balance between performance and efficiency. Recent advancements in PEFT techniques have achieved performance comparable to full fine-tuning while significantly reducing computational and storage costs. PEFT offers a promising solution for scaling up NLP models while minimizing resource requirements, making it a valuable tool in shaping the future of natural language processing, particularly in resource-constrained scenarios.

The PEFT setup is done below:

# from , run_clm.py
def create_peft_model(model, gradient_checkpointing=True, bf16=True):
    from peft import (
        get_peft_model,
        LoraConfig,
        TaskType,
        prepare_model_for_kbit_training,
    )
    from peft.tuners.lora import LoraLayer

    # prepare int-4 model for training
    model = prepare_model_for_kbit_training(
        model, use_gradient_checkpointing=gradient_checkpointing
    )
    if gradient_checkpointing:
        model.gradient_checkpointing_enable()

    # get lora target modules
    modules = find_all_linear_names(model)
    print(f"Found {len(modules)} modules to quantize: {modules}")

    peft_config = LoraConfig(
        r=64,
        lora_alpha=16,
        target_modules=modules,
        lora_dropout=0.1,
        bias="none",
        task_type=TaskType.CAUSAL_LM,
    )

    model = get_peft_model(model, peft_config)

Finally, the entire training_function prepares a llama-2-13b-chat-hf model with QLora and PEFT set-up parameters for fine-tuning as shown before trainer.train() is called. By the way, you will also see how this training script checks if it is recovering from a spot instance or a user stop interruption by checking for the last_checkpoint. In the TrainingArguments object instantiation prior to the Trainer object instantiation, you can see the parameters required for spot instance training. Checkpointing is enabled by setting save_strategy as “steps” and save_steps=10 sets a checkpoint every 10 steps to the s:3// checkpoint location. This is how training recovery is made possible. See code below:

# modified from 
def training_function(args):
    # set seed
    set_seed(args.seed)

    dataset = load_from_disk(args.dataset_path)

    # load model from the hub with a bnb config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    model = AutoModelForCausalLM.from_pretrained(
        args.model_id,
        use_cache=False
        if args.gradient_checkpointing
        else True,  # this is needed for gradient checkpointing
        device_map="auto",
        quantization_config=bnb_config,
    )

    # create peft config
    model = create_peft_model(
        model, gradient_checkpointing=args.gradient_checkpointing, bf16=args.bf16
    )

    # Define training args
    output_dir = args.output_dir
    training_args = TrainingArguments(
        output_dir=output_dir,
        resume_from_checkpoint=True,
        overwrite_output_dir=True,
        per_device_train_batch_size=args.per_device_train_batch_size,
        bf16=args.bf16,  # Use BF16 if available
        learning_rate=args.lr,
        num_train_epochs=args.epochs,
        gradient_checkpointing=args.gradient_checkpointing,
        # logging strategies
        logging_dir=f"{output_dir}/logs",
        logging_strategy="steps",
        logging_steps=10,
        #warmup_steps=100,
        save_steps=50,
        save_strategy="steps",
        report_to="wandb",
        run_name=f'md-asistant-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
    )

    # Create Trainer instance
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        data_collator=default_data_collator,
    )

    # check if checkpoint exists. if so continue training from where we left off, 
    # this is only for spot instances
    if get_last_checkpoint(args.output_dir) is not None:
        logger.info("***** continue training *****")
        last_checkpoint = get_last_checkpoint(args.output_dir)
        print(f'**********got last checkpoint = {last_checkpoint}**********************')
        trainer.train(resume_from_checkpoint=last_checkpoint)
    else:
        print('!!!!!!!!!!!!!!INITIAL TRAINING RUN!!!!!!!!!!!!!!!!!!!!')
        trainer.train() # no checkpoints found

    sagemaker_save_dir="/opt/ml/model/" # local container directory
    if args.merge_weights:
        # merge adapter weights with base model and save
        # save int 4 model
        trainer.model.save_pretrained(output_dir, safe_serialization=False)
        # clear memory
        del model
        del trainer
        torch.cuda.empty_cache()

        # load PEFT model in fp16
        model = AutoPeftModelForCausalLM.from_pretrained(
            output_dir,
            low_cpu_mem_usage=True,
            torch_dtype=torch.float16,
        )  
        # Merge LoRA and base model and save
        model = model.merge_and_unload()        
        model.save_pretrained(
            sagemaker_save_dir, safe_serialization=True, max_shard_size="2GB"
        )
    else:
        trainer.model.save_pretrained(
            sagemaker_save_dir, safe_serialization=True
        )

    # save tokenizer for easy inference
    tokenizer = AutoTokenizer.from_pretrained(args.model_id)
    tokenizer.save_pretrained(sagemaker_save_dir)

IV. Results

Fine-tuning traces on Weights & Biases platform.

http://wandb.ai

Model cost performance: How much did I save?

Table 1: On-demand versus Spot Instance cost on ml.g5.4xlarge

As can be seen in Table 1, using a small single GPU instance will take longer (147hrs in 3 epochs) to train with a spot instance discounted cost of $104.44. The total cost of using an on-demand instance would be $298.41 versus $104.44 for the entire fine-tuning cycle. This is a modest dataset size that without using QLoRA, PEFT, and Spot Instances could have fine-tuning costs exceeding a thousand dollars. Depending on the resources and time available, we have to balance cost time, and accuracy to deliver the best product possible. In part 2 I plan to make trial runs with larger aws sagemaker instances to compare cost and time and perhaps identify a sweet spot for cost efficiency. In the meantime, we can see very significant fine-tuning cost savings by using QLoRA, PEFT, and AWS Spot Instances.

I encourage you to explore and try different LLM fine-tuning setups using this newsletter article and find what works for you and share your experiences as we learn from each other.

In part 2 of this newsletter article, we will analyze and test the tuned model, deploy it to an inference instance, and build a chat UI for you to try😀

VI. References

All code in this project can be found here.

AWS Sagemaker Platform

Hugging Face PEFT

QLoRA: Efficient Finetuning of Quantized LLMs

MedDialog: Two Large-scale Medical Dialogue Datasets

Philschmid Blog on Spot Instances and Hugging Face Transformers

Philschmid Github on Sagemaker-HuggingFace llama-2 fine-tuning and deployment

Weights & Biases platform documentation

Unraveling the Hidden Magic of Vector Databases 🔍✨

Alfeo Sabay — Sat, 22 Jul 2023 11:43:55 GMT

Introduction:

Today, I'm thrilled to share the transformative potential of vector databases in the realm of data-driven decision making. Over the years, I've witnessed how businesses have harnessed the full potential of analytics, and vector databases have emerged as a game-changer, revolutionizing AI and ML architectures. Today, I'm excited to take you on a journey to explore how this cutting-edge technology can be applied in a real-world scenario: a Movie Recommender system, similar to what powers personalized recommendations on platforms like Netflix or Prime Video.

How Vector Databases Drive Personalized Movie Recommendations:

In our Python code example, I'll showcase the seamless integration of Faiss, a powerful vector database from Meta, to create a sophisticated Movie Recommender. By leveraging Faiss alongside Word2Vec embeddings, I'll demonstrate how this technology powers accurate and efficient similarity searches based on movie preferences. This innovative approach enables streaming platforms to offer users a highly personalized and engaging movie-watching experience.

Unlocking the Secrets of Faiss in Movie Recommendations:

Through the code example, I'll reveal how Faiss efficiently organizes and indexes movie vectors, ensuring rapid retrieval of the most relevant movie recommendations for each user. By combining the power of vector databases with state-of-the-art AI algorithms, this system elevates the movie discovery process, enticing users to explore new genres and undiscovered gems.

Join me as we dive into the Python code, where you'll witness firsthand the incredible capabilities of vector databases in action. Let's embark on this exciting journey together to unravel the power of vector databases and their potential to revolutionize your data-driven endeavors, just like the Movie Recommender we're about to unveil.

How the Recommendation System Works

Let me walk you through the technology behind our movie recommendation system. At the core of this system are Word2Vec ( here I’m using the gensim python library, it’s super fast ) embeddings, a powerful technique that represents movie titles as dense vectors in a high-dimensional space. This enables the system to capture semantic relationships between movies, facilitating efficient comparison and similarity calculations.

Word2Vec uses a neural network to learn word embeddings from vast amounts of textual data, such as movie titles in our case. Each word is mapped to a dense vector, where similar words are placed closer together in the vector space. This linguistic context allows us to establish connections between movie titles, even those that may not share identical words but are conceptually related.

To show how these similarity searches work, let’s use Faiss, a remarkable vector database from Meta ( though there are many to choose from such as Pinecone, Milvus, Weaviate etc ). Faiss optimizes the organization and indexing of movie vectors, enabling us to find similar movies rapidly based on user input. By employing Faiss alongside Word2Vec embeddings, we achieve lightning-fast response times, providing users with an exceptional movie discovery experience.

Now that we understand the technology behind our recommendation system, let's dive into it!

Building the Recommendation System

I ran this example python code on AWS Sagemaker with a ml.g4dn.2xlarge single GPU notebook instance with 32 GiB of memory, but you should be able to run this on a local machine preferably with a GPU. I opted to use the Kaggle TMDB 3000+ Movie Dataset-2023 .

Loading Word2Vec Model and Dataset: We start by importing necessary libraries, capturing messages (output), and loading a pre-trained Word2Vec model from the gensim library.

Preprocessing Text Data: The 'Movie_Name' column in the dataset contains movie titles that need to be preprocessed before using them as input to the Word2Vec model. The function preprocess_text is removes non-alphanumeric characters and convert the text to lowercase. The cleaned movie titles are then stored in a new column called 'title_cleaned' as shown here.

Creating Item Vectors: The function item_name_to_vector converts movie titles to their vector representations using the Word2Vec model. For each movie in the dataset, a dictionary 'item_vectors' is created, where the key is the movie title, and the value is its corresponding vector. Movie titles not found in the Word2Vec vocabulary are excluded from the 'item_vectors' dictionary.

Setting up Faiss Index: The 'item_vectors' are then converted to a NumPy array, which is used to initialize a Faiss index with L2 (Euclidean) distance metric. The item vectors are added to the Faiss index, and the index is saved to a file for future use.

Shown below is a partial output of the print statement above showing the curated vector items. We will use some of these vector “items” later to represent user preference data collected from user interaction.

Getting Similar Movie Recommendations: A function called find_similar_items is defined to get similar movie recommendations based on user input. Users can input their preferences or interests, and the function uses Faiss to find similar movies to the user's input. The similarity is based on the Word2Vec vectors of the movie titles. The function then displays the top-k similar movie recommendations.

Sample Usage: The notebook provides sample usage examples to demonstrate how to use the recommendation system. It shows how to get recommendations for a general movie preference (using find_similar_items) and how to get personalized recommendations for a user (using get_user_recommendations) based on their liked movies.

Here’s the output for User 3:

Conclusion

In conclusion, vector databases have revolutionized data-driven decision making, as exemplified by our movie recommendation system. Leveraging Word2Vec embeddings and Faiss, we provided users with highly personalized movie suggestions. Across industries, vector databases drive revenue by delivering personalized experiences, detecting fraud, and optimizing decision-making. With their speed and accuracy, these databases unlock untapped potential in recommendation systems and data analysis. Embrace the vector database revolution for smarter decisions and unparalleled user experiences, propelling us towards a data-powered future. Let's continue pushing the boundaries of AI and ML together.

If you would like to kick the tires and take this sample code for a spin for a deeper look, you can clone the code in this blog from my Github repository.