Whisper ASR with Pyannote Speaker Diarization

Upload an audio file to get a transcript with speaker diarization. This demo uses openai/whisper-small for ASR and pyannote/speaker-diarization-3.1 for diarization. A Hugging Face token with access to pyannote/speaker-diarization-3.1 is required. Please set it as an HF_TOKEN environment variable before launching (see script comments).
Note: For long audios or high concurrent usage, consider using a GPU and models like whisper-large-v3.

Upload Audio File (WAV, MP3, FLAC, etc.)

ASR Batch Size

1 32

ASR Chunk Length (seconds)

1 30

ASR Language

Diarization: Number of Speakers (optional)

Expected total number of speakers (positive integer, or leave empty for auto-detect).

Diarization: Min Speakers (optional)

Minimum number of speakers to detect (positive integer, or leave empty for auto-detect).

Diarization: Max Speakers (optional)

Maximum number of speakers to detect (positive integer, or leave empty for auto-detect).

Diarized Transcript

Full ASR Transcript

Status Message

Examples

Upload Audio File (WAV, MP3, FLAC, etc.)	ASR Batch Size	ASR Chunk Length (seconds)	ASR Language