@happyvertical/smrt-voice
TTS voice profiles with two creation modes (AI design or audio cloning), VoiceOutput with word-level timings for lip-sync, and audio sample validation.
Overview
smrt-voice manages voice profiles for AI-powered text-to-speech synthesis. A VoiceProfile is created in one of two mutually exclusive modes — AI design (from a natural language prompt) or
cloning (from audio samples) — and generated TTS output carries word-level timings that feed lip-sync
into smrt-video.
Installation
npm install @happyvertical/smrt-voiceQuick Start
import { VoiceProfile, VoiceSample, VoiceOutput } from '@happyvertical/smrt-voice';
import { createAssetRuntime, ASSET_ROLES } from '@happyvertical/smrt-assets';
const runtime = await createAssetRuntime({ db, storage });
// Mode 1: Voice design -- AI generates from prompt
const designed = new VoiceProfile({
name: 'News Anchor',
language: 'en-US',
gender: 'male',
designPrompt: 'Warm, authoritative male voice with clear enunciation',
defaultSpeed: 1.0, // 0.5 - 2.0
defaultPitch: 0, // -20 to 20 semitones
});
await designed.save();
// Mode 2: Voice cloning -- replicate from audio sample(s)
const sampleAsset = await runtime.storeSourceAsset(
'sample.wav', wavBytes,
{ mimeType: 'audio/wav', role: ASSET_ROLES.source_document },
);
const cloned = new VoiceProfile({
name: 'Custom Voice',
language: 'en-US',
sampleAssetId: sampleAsset.id,
});
await cloned.save();
// Add training samples (minimum 3 seconds, quality != low)
const sample = new VoiceSample({
voiceProfileId: cloned.id,
assetId: sampleAsset.id,
duration: 5.2,
transcription: 'Hello, this is a test recording for voice cloning.',
quality: 'high',
sampleRate: 48000,
format: 'wav',
isPrimary: true,
});
await sample.save();
// TTS output with word-level timing for lip-sync
const output = new VoiceOutput({
voiceProfileId: designed.id,
sourceText: 'Welcome to the evening news.',
audioAssetId: 'asset-789',
duration: 2.8,
wordTimings: [
{ word: 'Welcome', start: 0.0, end: 0.4 },
{ word: 'to', start: 0.4, end: 0.5 },
{ word: 'the', start: 0.5, end: 0.6 },
{ word: 'evening', start: 0.6, end: 1.0 },
{ word: 'news', start: 1.0, end: 1.3 },
],
});
output.getWordAtTime(0.7); // { word: 'evening', start: 0.6, end: 1.0 }Core Models
VoiceProfile
class VoiceProfile extends SmrtObject {
name: string
language: string
gender: 'male' | 'female' | 'neutral'
designPrompt?: string // AI voice design (mutually exclusive with sampleAssetId)
sampleAssetId?: string // Cloned from audio (mutually exclusive with designPrompt)
defaultSpeed: number // 0.5 - 2.0
defaultPitch: number // -20 to 20 semitones
voiceData?: Record<string, any> // Provider-specific (opaque, no schema)
status: 'pending' | 'processing' | 'ready' | 'failed'
provider: string // TTS provider, defaults to 'qwen3-tts'
errorMessage?: string // Populated when status === 'failed'
get isCloned(): boolean
get isDesigned(): boolean
get isReady(): boolean
}The provider field defaults to 'qwen3-tts'. It is a plain string tag
recording which TTS provider created the voice — there is no provider abstraction layer.
VoiceSample
class VoiceSample extends SmrtObject {
voiceProfileId: string
assetId: string // Audio asset stored via smrt-assets
duration: number // Seconds
transcription?: string
quality: 'low' | 'medium' | 'high'
sampleRate?: number
format?: string
isPrimary: boolean
get meetsMinDuration(): boolean // >= 3 seconds
get isSuitableForCloning(): boolean // >= 3 sec AND quality != low
}VoiceOutput (extends Content)
class VoiceOutput extends Content {
voiceProfileId: string
sourceText: string
audioAssetId: string
duration: number
wordTimings: WordTiming[] // [{ word, start, end }] in seconds
audioMetadata?: VoiceOutputMetadata // sampleRate, format, channels, bitDepth, provider, model
get wordCount(): number
get wordsPerSecond(): number
get hasWordTimings(): boolean
getWordAtTime(seconds: number): WordTiming | null
}Asset integration
Both training samples (sampleAssetId) and generated audio (audioAssetId) reference Asset rows stored via smrt-assets. Use createAssetRuntime() + storeSourceAsset() / storeDerivedAsset() to write the bytes, and rely on serveAsset() when delivering audio to clients.
Best Practices
DOs
- Use
designPromptXORsampleAssetId(mutually exclusive modes) - Check
isSuitableForCloningbefore using samples (3+ sec, not low quality) - Use
getWordAtTime()for precise lip-sync alignment in smrt-video - Check
isReadybefore using a profile for TTS generation - Set
tenantId: nullfor global / default voice profiles - Store audio bytes via
AssetRuntime.storeSourceAsset()rather than ad-hoc writes
DON'Ts
- Don't set both
designPromptandsampleAssetIdon the same profile - Don't expect the framework to generate
wordTimings— they come from the TTS provider - Don't rely on the 3-second minimum being enforced in the constructor (documented only)
- Don't assume status transitions are enforced (manual status setting is possible)
- Don't depend on a specific
voiceDataschema (provider-specific, opaque)