Do you have a lot of incredible information to share, but cant find the motivation to create your first podcast episode? It can be easier than you think to get started with the AI tools available today. Imagine dumping all of your useful notes, research, and documents into a folder, then asking AI to read all the documents and write a podcast, and out comes a beautiful podcast transcript for you to perform. Now picture that same AI taking the transcript, and generating the entire podcast using AI voices? It’s truly amazing what this technology can do! Strap in – here’s how you can build it!
What tools are we going to use?
- ElevenLabs for voice cloning and text-to-audio
- LlamaIndex for indexing our content and handling LLM prompts
Create voice clones
To start, we need to record audio of the voices needed for our podcast – the host(s) and guest(s). This step is important to create a unique voice for your podcast to set you apart from the pack of voice podcasters out there.
Crafting the perfect voice clone: a comprehensive guide
Creating a voice clone is exciting because it bridges the gap between technology and personalization. However, getting it right requires attention to detail, starting with recording what will be used to create voice clones. Here’s how to navigate the process for the best results.
Preparing to clone
Timeline expectations: The journey to your voice clone starts with understanding the time commitment. Custom models demand fine-tuning and training. We recommend preparing for a timeline of approximately 4 weeks, though this is an estimate and can vary based on demand and other factors.
Recording best practices
To ensure your voice clone mirrors your expectations, each step in the recording process plays a pivotal role. Best practices:
- Professional equipment: The foundation of a great voice clone is high-quality recording equipment. We suggest using an XLR microphone with a dedicated audio interface for optimal fidelity. Consider starting with models like the Audio Technica AT2020 or Rode NT1, paired with interfaces such as Focusrite.
- Pop filter usage: Minimize plosives in your recordings by employing a pop filter. It’s a simple addition that significantly enhances audio clarity.
- Optimal microphone distance: The distance from your microphone to the person recording can drastically affect the sound quality. A good rule of thumb is to maintain about two fists distance, though this may vary depending on the recording type you’re aiming for.
- Clarity is key: Aim for a noise-free recording environment to avoid any unwanted background sounds. The cleaner your audio input, the better the cloning results.
- Room acoustics: Recording in an acoustically-treated space minimizes echoes and background noise. If professional treatment isn’t feasible, improvise with thick materials like duvets or quilts to dampen your space.
Finetuning Your Audio
- Pre-processing: For those desiring a specific audio output (e.g., a polished podcast sound), consider pre-processing your audio to eliminate long pauses or fillers like “uhm”s and “ahm”s, which the AI will replicate.
- Volume control: Maintain a consistent and clear volume without causing distortion. Aiming for an audio level between -23dB and -18dB RMS, with a true peak of -3dB, is ideal for balance.
- Sufficient audio length: Quality and quantity matter here. Providing at least 30 minutes of high-quality audio, aiming closer to 3 hours, will greatly enhance the cloning process. If uploading hours of audio, dividing it into 30-minute segments eases the process.
Final Steps Before Cloning
- Uploading your audio: Once you hit upload, the audio samples are locked in. Ensure accuracy in what you submit for cloning.
- Verify your voice: Verification is the final checkpoint. Using similar equipment and tone as your recorded samples is crucial for a smooth verification process.
By adhering to these guidelines, you’re setting the stage for a successful voice cloning experience. Remember, the quality of your clone largely depends on the care and precision applied throughout this process. Let’s embark on this remarkable technological journey together.
Uploading the voice clone recording
We now have a professional audio sample that can be used to generate our voice clone we can upload the sample and wait for our audio clone to be generated. This process can take up to 4 weeks for the professional voice cloning so its best to upload and start preparing other parts of your podcast. There are open-source options for voice cloning, but not as high-quality as ElevenLabs and for the price, it’s worth it to use the best.
Lets get started uploading our voice sample.
3. Click on Professional Voice Cloning
4. Upload your voice sample
Once subscribed to the Creator version, you will notice that the character limit is 110,000. The average length of a podcast is has about 7500 words with a runtime of 45 minutes. Each word has about 5 characters. So the average length book is about 37,500 characters. With the Creator version, you have the option to “Enable usage-based billing (surpass 110,000 characters). You can turn on the toggle.
Creating the podcast transcript
With any creative project, getting started is the hardest part. That is why we are going to use AI to create a first draft that we can edit and enhance.
Before we get started we need to iron out details about the show:
- How many hosts you will have in your show?
- Will your show feature guests?
Now that we have determined the number of voices we will need, we can generate a rough draft of the podcast transcript using LlamaIndex.
Set up our environment
Installing LlamaIndex and loading documents into your RAG system.
LlamaIndex is a powerful Python package that allows you to efficiently index and query your document data to augment language models like GPT-4.
To get started, make sure you have Python 3.7+ installed. Open up a terminal or command prompt and simply run:
pip install llama-index elevenlabs
Once LlamaIndex is installed, it’s time to load your documents. LlamaIndex supports a wide variety of data sources including .txt files, PDFs, URLs, Google Docs, Notion pages, and more. For this example, let’s assume you have a directory full of .txt files you want to index.
First, import the necessary modules:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, \
StorageContext, load_index_from_storage
import os
def createVectorIndex(path):
PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
# load the documents and create the index
documents = SimpleDirectoryReader(path).load_data()
index = VectorStoreIndex.from_documents(documents)
# store it for later
index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
# load the existing index
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)
return index
This function takes a path to your document directory, loads the text files using the SimpleDirectoryReader, and builds a simple vector index.
Calling the function with your document path will create a local file in ./storage
containing your indexed documents, ready for querying!
vectorIndex = createVectorIndex('data/mydocuments')
Write the code to generate transcripts
Now that we have all of our documents properly indexed and stored in a vector database, we can begin to create our code for generating a podcast transcript. To get started we want to work in high-level outline steps and work our way down to the content of the transcript. In order to get our LLM to generate enough content for an entire podcast, we have to break this down into steps of no more than 500 words each step so we can generate enough content to fill an entire episode.
First, let’s generate a podcast outline that will serve as our task list for breaking up the work into smaller sections. The outline will have topics and documents to reference during the writing of the transcript. This strategy will save context window space during the writing of the transcript so every topic writer does not have to read a large set of documents.
Transcript prompt
Create a prompt to explain the task and how you want the response to be returned.
prompt = """
Objective: Write a podcast episode outline using all the information given.
Podcast subject: <INSERT_PODCAST_SUBJECT>
Podcast cast: <INSERT_HOST_NAME> (host), <INSERT_HOST_NAME> (host), <INSERT_GUEST_NAME>(guest)
Podcast est runtime: 45 minutes
return response in a json format that can be parsed using python `json.loads()`.
Example response format:
{
"outline": [
{
"topic": "<intro_topic>",
"documents_to_reference": [<doc1>, <doc2>]
},
{
"topic": "<topic>",
"documents_to_reference": [<doc1>, <doc2>]
},
{
"topic": "<topic>",
"documents_to_reference": [<doc1>, <doc2>]
},
{
"topic": "<topic>",
"documents_to_reference": [<doc1>, <doc2>]
},
{
"topic": "<topic>",
"documents_to_reference": [<doc1>, <doc2>]
},
{
"topic": "<topic>",
"documents_to_reference": [<doc1>, <doc2>]
},
{
"topic": "<outro_topic>",
"documents_to_reference": [<doc1>, <doc2>]
}
]
}
"""
# Adjust top k for the amount of information needed to create the outline
query_engine = vectorIndex.as_query_engine(similarity_top_k=10)
response = query_engine.query(prompt)
outline_response = json.loads(str(response))
This will generate our outline and return back a parsable json format we can use to generate the transcript step by step. With each outline topic and documents needed to write the topic, we need to create another prompt and let our LLM create each speakers transcript to cover the topic.
Create speaker transcripts
transcript_prompt_template = """
Podcast Outline:
{outline}
Current Transcript:
{transcript}
Podcast subject: <INSERT_PODCAST_SUBJECT>
Podcast cast: <INSERT_HOST_NAME> (host), <INSERT_HOST_NAME> (host), <INSERT_GUEST_NAME>(guest)
Podcast est runtime: 45 minutes
Documents to reference during writing transcript:
{documents_to_reference}
Write the podcast transcript only for the topic "{topic}" and use as many speaker transcripts as needed."
return response in a json format that can be parsed using python `json.loads()`.
Example response format:
{
"transcripts": [
{
"speaker": "<speaker_name>",
"transcript": "<transcript>"
}
...
]
}
"""
full_transcript = []
full_outline_text = '\n'.join('-' + section.get('topic') for section in outline_response.get('outline'))
for section in outline_response.get('outline'):
topic_text = section.get('topic')
documents_to_reference = section.get('documents_to_reference')
documents_to_reference_text = '\n'.join(doc for doc in documents_to_reference)
current_transcript = '\n'.join(f'{speaker.get("speaker")}: {speaker.get("transcript")}' for speaker in full_transcript)
transcript_prompt = transcript_prompt_template.replace('{topic}', topic_text).replace('{outline}', full_outline_text)\
.replace('{transcript}', current_transcript).replace('{documents_to_reference}', documents_to_reference_text)
transcript_response = query_engine.query(transcript_prompt)
transcript_dict_response = json.loads(str(transcript_response))
for speaker in transcript_dict_response.get('transcripts', []):
full_transcript.append(speaker)
print(full_transcript)
# save transcript
The above code will generate transcripts for each topic and add them to the full transcript container that we will use later for requesting audio. At this point, we have an outline and a full transcript of the podcast from information we provided in our documents.
Create audio data using ElevenLabs
Next, lets take our entire transcript and create audio clips for each section. At the end we could join these together into a single audio and upload to a podcast platform. It is important to remember to save these clips as individuals in case you want to add music in post production editing. We will need to gather ElevenLabs voice id’s for this next section and create a dictionary of name to voice id. Once we have this mapping we can correctly map different voice audio to the corresponding speaker transcript.
speaker_name_to_voice_id = {
'<INSERT_SPEAKER_NAME>': '29vD33N1CtxCmqQRPOHJ',
'<INSERT_SPEAKER_NAME>': '2EiwWnXFnvU5JabPnv8n',
'<INSERT_SPEAKER_NAME>': '21m00Tcm4TlvDq8ikWAM'
}
for speaker in full_transcript:
audio = generate(
api_key='<INSERT_API_KEY>',
text=speaker.get('transcript'),
voice=Voice(
voice_id=speaker_name_to_voice_id[speaker.get('speaker')],
settings=VoiceSettings(stability=0.71, similarity_boost=0.5, style=0.0, use_speaker_boost=True)
)
)
speaker['audio'] = base64.b64encode(audio).decode('utf-8')
print('returned audio', len(audio))
print(full_transcript)
# Save full transcript
with open('audio_data.json', 'w') as file:
json.dump(full_transcript, file)
Now that you have a full outline, transcript and audio recordings for your podcast, its time to edit and upload!
AI audio content production
Generating a podcast from start to finish using AI tools can revolutionize your content creation process. By harnessing the power of voice cloning technology from ElevenLabs and document indexing with LlamaIndex, you can transform your notes, research, and documents into a captivating podcast episode with minimal effort.
The process begins with creating unique voice clones for your podcast hosts and guests, ensuring a distinct audio identity. By following best practices for recording and fine-tuning your audio samples, you lay the foundation for high-quality voice cloning results.
Next, indexing your content using LlamaIndex allows you to efficiently retrieve relevant information for generating a podcast outline and transcript. By breaking down the task into manageable steps and leveraging the power of language models like GPT-4, you can create a comprehensive podcast script that covers all the desired topics.
While AI-generated podcasts may require some editing and post-production work, the time and effort saved in the content creation process are significant. By embracing these cutting-edge technologies, you can focus on refining your message and delivering high-quality content to your audience more efficiently than ever before. These technologies also open up new possibilities for multi-lingual support, helping content reach listeners worldwide.
As AI continues to advance, tools like ElevenLabs and LlamaIndex will undoubtedly play a crucial role in shaping the future of podcasting and content creation. By staying at the forefront of these developments and experimenting with innovative workflows, you can unlock new opportunities for growth and engagement in the ever-evolving world of digital media.
If you are not able to code or cannot follow the example above, use the web interface of ElevenLabs and a web based RAG application like Parallel AI.