@happyvertical/smrt-voice

TTS voice profiles with two creation modes (AI design or audio cloning), VoiceOutput with word-level timings for lip-sync, and audio sample validation.

v0.29.34Voice DesignCloningLip-Sync Timings

Overview

smrt-voice manages voice profiles for AI-powered text-to-speech synthesis. A VoiceProfile is created in one of two mutually exclusive modes — AI design (from a natural language prompt) or cloning (from audio samples) — and generated TTS output carries word-level timings that feed lip-sync into smrt-video.

Installation

bash
npm install @happyvertical/smrt-voice

Quick Start

typescript
import { VoiceProfile, VoiceSample, VoiceOutput } from '@happyvertical/smrt-voice';
import { createAssetRuntime, ASSET_ROLES } from '@happyvertical/smrt-assets';

const runtime = await createAssetRuntime({ db, storage });

// Mode 1: Voice design -- AI generates from prompt
const designed = new VoiceProfile({
  name: 'News Anchor',
  language: 'en-US',
  gender: 'male',
  designPrompt: 'Warm, authoritative male voice with clear enunciation',
  defaultSpeed: 1.0,   // 0.5 - 2.0
  defaultPitch: 0,     // -20 to 20 semitones
});
await designed.save();

// Mode 2: Voice cloning -- replicate from audio sample(s)
const sampleAsset = await runtime.storeSourceAsset(
  'sample.wav', wavBytes,
  { mimeType: 'audio/wav', role: ASSET_ROLES.source_document },
);

const cloned = new VoiceProfile({
  name: 'Custom Voice',
  language: 'en-US',
  sampleAssetId: sampleAsset.id,
});
await cloned.save();

// Add training samples (minimum 3 seconds, quality != low)
const sample = new VoiceSample({
  voiceProfileId: cloned.id,
  assetId: sampleAsset.id,
  duration: 5.2,
  transcription: 'Hello, this is a test recording for voice cloning.',
  quality: 'high',
  sampleRate: 48000,
  format: 'wav',
  isPrimary: true,
});
await sample.save();

// TTS output with word-level timing for lip-sync
const output = new VoiceOutput({
  voiceProfileId: designed.id,
  sourceText: 'Welcome to the evening news.',
  audioAssetId: 'asset-789',
  duration: 2.8,
  wordTimings: [
    { word: 'Welcome', start: 0.0, end: 0.4 },
    { word: 'to',      start: 0.4, end: 0.5 },
    { word: 'the',     start: 0.5, end: 0.6 },
    { word: 'evening', start: 0.6, end: 1.0 },
    { word: 'news',    start: 1.0, end: 1.3 },
  ],
});
output.getWordAtTime(0.7); // { word: 'evening', start: 0.6, end: 1.0 }

Core Models

VoiceProfile

typescript
class VoiceProfile extends SmrtObject {
  name: string
  language: string
  gender: 'male' | 'female' | 'neutral'
  designPrompt?: string       // AI voice design (mutually exclusive with sampleAssetId)
  sampleAssetId?: string      // Cloned from audio (mutually exclusive with designPrompt)
  defaultSpeed: number        // 0.5 - 2.0
  defaultPitch: number        // -20 to 20 semitones
  voiceData?: Record<string, any>  // Provider-specific (opaque, no schema)
  status: 'pending' | 'processing' | 'ready' | 'failed'
  provider: string            // TTS provider, defaults to 'qwen3-tts'
  errorMessage?: string       // Populated when status === 'failed'

  get isCloned(): boolean
  get isDesigned(): boolean
  get isReady(): boolean
}

The provider field defaults to 'qwen3-tts'. It is a plain string tag recording which TTS provider created the voice — there is no provider abstraction layer.

VoiceSample

typescript
class VoiceSample extends SmrtObject {
  voiceProfileId: string
  assetId: string             // Audio asset stored via smrt-assets
  duration: number            // Seconds
  transcription?: string
  quality: 'low' | 'medium' | 'high'
  sampleRate?: number
  format?: string
  isPrimary: boolean

  get meetsMinDuration(): boolean      // >= 3 seconds
  get isSuitableForCloning(): boolean  // >= 3 sec AND quality != low
}

VoiceOutput (extends Content)

typescript
class VoiceOutput extends Content {
  voiceProfileId: string
  sourceText: string
  audioAssetId: string
  duration: number
  wordTimings: WordTiming[]   // [{ word, start, end }] in seconds
  audioMetadata?: VoiceOutputMetadata  // sampleRate, format, channels, bitDepth, provider, model

  get wordCount(): number
  get wordsPerSecond(): number
  get hasWordTimings(): boolean
  getWordAtTime(seconds: number): WordTiming | null
}

Asset integration

Both training samples (sampleAssetId) and generated audio (audioAssetId) reference Asset rows stored via smrt-assets. Use createAssetRuntime() + storeSourceAsset() / storeDerivedAsset() to write the bytes, and rely on serveAsset() when delivering audio to clients.

Best Practices

DOs

  • Use designPrompt XOR sampleAssetId (mutually exclusive modes)
  • Check isSuitableForCloning before using samples (3+ sec, not low quality)
  • Use getWordAtTime() for precise lip-sync alignment in smrt-video
  • Check isReady before using a profile for TTS generation
  • Set tenantId: null for global / default voice profiles
  • Store audio bytes via AssetRuntime.storeSourceAsset() rather than ad-hoc writes

DON'Ts

  • Don't set both designPrompt and sampleAssetId on the same profile
  • Don't expect the framework to generate wordTimings — they come from the TTS provider
  • Don't rely on the 3-second minimum being enforced in the constructor (documented only)
  • Don't assume status transitions are enforced (manual status setting is possible)
  • Don't depend on a specific voiceData schema (provider-specific, opaque)

Related Modules