[[{“value”:”
In this tutorial, we build an advanced voice AI agent using Hugging Face’s freely available models, and we keep the entire pipeline simple enough to run smoothly on Google Colab. We combine Whisper for speech recognition, FLAN-T5 for natural language reasoning, and Bark for speech synthesis, all connected through transformers pipelines. By doing this, we avoid heavy dependencies, API keys, or complicated setups, and we focus on showing how we can turn voice input into meaningful conversation and get back natural-sounding voice responses in real time. Check out the FULL CODES here.
!pip -q install "transformers>=4.42.0" accelerate torchaudio sentencepiece gradio soundfile
import os, torch, tempfile, numpy as np
import gradio as gr
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
DEVICE = 0 if torch.cuda.is_available() else -1
asr = pipeline(
"automatic-speech-recognition",
model="openai/whisper-small.en",
device=DEVICE,
chunk_length_s=30,
return_timestamps=False
)
LLM_MODEL = "google/flan-t5-base"
tok = AutoTokenizer.from_pretrained(LLM_MODEL)
llm = AutoModelForSeq2SeqLM.from_pretrained(LLM_MODEL, device_map="auto")
tts = pipeline("text-to-speech", model="suno/bark-small")
We install the necessary libraries and load three Hugging Face pipelines: Whisper for speech-to-text, FLAN-T5 for generating responses, and Bark for text-to-speech. We set the device automatically so that we can use GPU if available. Check out the FULL CODES here.
SYSTEM_PROMPT = (
"You are a helpful, concise voice assistant. "
"Prefer direct, structured answers. "
"If the user asks for steps or code, use short bullet points."
)
def format_dialog(history, user_text):
turns = []
for u, a in history:
if u: turns.append(f"User: {u}")
if a: turns.append(f"Assistant: {a}")
turns.append(f"User: {user_text}")
prompt = (
"Instruction:n"
f"{SYSTEM_PROMPT}nn"
"Dialog so far:n" + "n".join(turns) + "nn"
"Assistant:"
)
return prompt
We define a system prompt that guides our agent to stay concise and structured, and we implement a format_dialog function that takes past conversation history along with the user input and builds a prompt string for the model to generate the assistant’s reply. Check out the FULL CODES here.
def transcribe(filepath):
out = asr(filepath)
text = out["text"].strip()
return text
def generate_reply(history, user_text, max_new_tokens=256):
prompt = format_dialog(history, user_text)
inputs = tok(prompt, return_tensors="pt", truncation=True).to(llm.device)
with torch.no_grad():
ids = llm.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
do_sample=True,
top_p=0.9,
repetition_penalty=1.05,
)
reply = tok.decode(ids[0], skip_special_tokens=True).strip()
return reply
def synthesize_speech(text):
out = tts(text)
audio = out["audio"]
sr = out["sampling_rate"]
audio = np.asarray(audio, dtype=np.float32)
return (sr, audio)
We create three core functions for our voice agent: transcribe converts recorded audio into text using Whisper, generate_reply builds a context-aware response from FLAN-T5, and synthesize_speech turns that response back into spoken audio with Bark. Check out the FULL CODES here.
def clear_history():
return [], []
def voice_to_voice(mic_file, history):
history = history or []
if not mic_file:
return history, None, "Please record something!"
try:
user_text = transcribe(mic_file)
except Exception as e:
return history, None, f"ASR error: {e}"
if not user_text:
return history, None, "Didn't catch that. Try again?"
try:
reply = generate_reply(history, user_text)
except Exception as e:
return history, None, f"LLM error: {e}"
try:
sr, wav = synthesize_speech(reply)
except Exception as e:
return history + [(user_text, reply)], None, f"TTS error: {e}"
return history + [(user_text, reply)], (sr, wav), f"User: {user_text}nAssistant: {reply}"
def text_to_voice(user_text, history):
history = history or []
user_text = (user_text or "").strip()
if not user_text:
return history, None, "Type a message first."
try:
reply = generate_reply(history, user_text)
sr, wav = synthesize_speech(reply)
except Exception as e:
return history, None, f"Error: {e}"
return history + [(user_text, reply)], (sr, wav), f"User: {user_text}nAssistant: {reply}"
def export_chat(history):
lines = []
for u, a in history or []:
lines += [f"User: {u}", f"Assistant: {a}", ""]
text = "n".join(lines).strip() or "No conversation yet."
with tempfile.NamedTemporaryFile(delete=False, suffix=".txt", mode="w") as f:
f.write(text)
path = f.name
return path
We add interactive functions for our agent: clear_history resets the conversation, voice_to_voice handles speech input and returns a spoken reply, text_to_voice processes typed input and speaks back, and export_chat saves the entire dialog into a downloadable text file. Check out the FULL CODES here.
with gr.Blocks(title="Advanced Voice AI Agent (HF Pipelines)") as demo:
gr.Markdown(
"##
Advanced Voice AI Agent (Hugging Face Pipelines Only)n"
"- **ASR**: openai/whisper-small.enn"
"- **LLM**: google/flan-t5-basen"
"- **TTS**: suno/bark-smalln"
"Speak or type; the agent replies with voice + text."
)
with gr.Row():
with gr.Column(scale=1):
mic = gr.Audio(sources=["microphone"], type="filepath", label="Record")
say_btn = gr.Button("
Speak")
text_in = gr.Textbox(label="Or type instead", placeholder="Ask me anything…")
text_btn = gr.Button("
Send")
export_btn = gr.Button("
Export Chat (.txt)")
reset_btn = gr.Button("
Reset")
with gr.Column(scale=1):
audio_out = gr.Audio(label="Assistant Voice", autoplay=True)
transcript = gr.Textbox(label="Transcript", lines=6)
chat = gr.Chatbot(height=360)
state = gr.State([])
def update_chat(history):
return [(u, a) for u, a in (history or [])]
say_btn.click(voice_to_voice, [mic, state], [state, audio_out, transcript]).then(
update_chat, inputs=state, outputs=chat
)
text_btn.click(text_to_voice, [text_in, state], [state, audio_out, transcript]).then(
update_chat, inputs=state, outputs=chat
)
reset_btn.click(clear_history, None, [chat, state])
export_btn.click(export_chat, state, gr.File(label="Download chat.txt"))
demo.launch(debug=False)
We build a clean Gradio UI that lets us speak or type and then hear the agent’s response. We wire buttons to our callbacks, maintain chat state, and stream results into a chatbot, transcript, and audio player, all launched in one Colab app.
In conclusion, we see how seamlessly Hugging Face pipelines enable us to create a voice-driven conversational agent that listens, thinks, and responds. We now have a working demo that captures audio, transcribes it, generates intelligent responses, and returns speech output, all inside Colab. With this foundation, we can experiment with larger models, add multilingual support, or even extend the system with custom logic. Still, the core idea remains the same: we can bring together ASR, LLM, and TTS into one smooth workflow for an interactive voice AI experience.
Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post How to Build an Advanced End-to-End Voice AI Agent Using Hugging Face Pipelines? appeared first on MarkTechPost.
“}]] [[{“value”:”In this tutorial, we build an advanced voice AI agent using Hugging Face’s freely available models, and we keep the entire pipeline simple enough to run smoothly on Google Colab. We combine Whisper for speech recognition, FLAN-T5 for natural language reasoning, and Bark for speech synthesis, all connected through transformers pipelines. By doing this, we
The post How to Build an Advanced End-to-End Voice AI Agent Using Hugging Face Pipelines? appeared first on MarkTechPost.”}]] Read More Agentic AI, AI Agents, Artificial Intelligence, Editors Pick, Technology, Voice AI