Speech-to-text APIs have a size limit. Ours is 24MB. Our users upload files up to 5GB.
This is the story of how we built the pipeline that bridges that gap, what broke along the way, and the fallback layers we added to make it resilient. It sounds straightforward. It was not.
Why we needed a splitting pipeline
The approach is conceptually simple. Take a large audio file, split it into 10-minute segments, send each segment to the transcription API, and merge the results in chronological order.
FFmpeg handles the splitting. It extracts segments without re-encoding when possible, which means a 2-hour file can be chunked in seconds rather than minutes. Each chunk is transcribed in parallel, and the results are reassembled by index.
The basic pipeline worked on day one. Everything that follows is what we learned once real users started uploading real files.
Browser recordings broke the first assumption
The first production issue arrived within days of launch. Users recording audio directly in the browser produce WebM files with incomplete container headers. These files play fine in a browser, but when FFmpeg tries to read their metadata, it returns nothing. No duration, no codec information.
Without duration metadata, we cannot determine how many chunks to create. The pipeline failed silently.
We solved this with three fallback layers, each catching what the previous one misses:
Extended analysis. When the standard probe fails, we retry with parameters that force FFmpeg to scan more data before giving up. This catches most browser-recorded files.
File-size estimation. When even the extended probe fails, we estimate duration from file size using format-specific bitrate approximations. A WebM file at roughly 96kbps Opus encoding means approximately 12,000 bytes per second. These estimates are rough, but they do not need to be precise. We are estimating duration to decide how many chunks to create, not to display timestamps.
Format conversion. For particularly stubborn WebM files, we convert to MP3 first using error-tolerant input options. The conversion produces a container that FFmpeg can read reliably.
What this changed: Instead of failing on ~15% of browser recordings, the pipeline now handles every format we have encountered. The layered fallback pattern became a template we reuse elsewhere: try the fast path, fall back to the tolerant path, convert as a last resort.
Two merge strategies for different codecs
After transcription, we need to merge chunks back into a single audio file for playback. We discovered that different codecs require different merge approaches.
Fast path: concat demuxer. For MP3 and formats where chunks can be concatenated without re-encoding, FFmpeg's concat demuxer joins them in a single pass. No transcoding means the operation completes in milliseconds regardless of file size.
Slow path: concat filter. For WebM with Opus codec, stream-level concatenation requires re-encoding. The concat filter processes each chunk through the decode-encode pipeline. Slower, but produces a valid output file.
The pipeline detects the input format and selects the appropriate strategy automatically. We initially tried to use the fast path for everything and hit corrupted output files that were difficult to diagnose. Making the codec detection explicit eliminated an entire class of subtle bugs.
Parallel processing with ordered reassembly
Each chunk is transcribed by the speech-to-text API independently. We process all chunks in parallel to minimize wall time. A 2-hour file with 12 chunks, each taking about 30 seconds to transcribe, finishes in roughly 30 seconds instead of 6 minutes.
But the results must come back in order. We assign each chunk an index at split time and sort by that index after all parallel jobs complete.
Speaker diarization adds complexity. If Speaker A is talking at the end of chunk 3 and the beginning of chunk 4, the same speaker needs to be recognized across the boundary. The speech-to-text API handles per-chunk diarization, and we reconcile speaker labels during merge.
What this changed: Processing time scales with chunk duration, not file duration. An 8-hour recording with 48 chunks takes about 2 minutes, because all chunks process concurrently. The limiting factor is API concurrency, not local processing.
Preventing command injection
Audio processing tools are command-line programs. When user input touches a command-line argument, command injection is possible. A file named recording; rm -rf /tmp/*.mp3 interpolated directly into a shell command would be catastrophic.
We found a real command injection vulnerability during our pre-launch security audit. It was rated CVSS 9.8.
Our path sanitization resolves to an absolute path, rejects shell metacharacters, normalizes against path traversal, and validates the path is within the expected temporary directory. We also switched from string interpolation to array-based command arguments where possible, which avoids shell interpretation entirely.
What this changed: This vulnerability shaped how we think about any code path where user input reaches a system boundary. We now treat file paths with the same suspicion as SQL queries.
Automated cleanup at three levels
The pipeline creates temporary files. Chunks live on disk during processing. Converted files accumulate. If a job fails midway, orphaned files remain. Without cleanup, storage grows indefinitely.
We handle this at three levels:
Job-level cleanup. After a transcription completes (success or failure), the processor deletes all temporary chunk files and intermediate conversion products.
Hourly zombie sweep. A cron job scans for transcriptions stuck in processing for more than 24 hours, marks them as failed, and deletes their orphaned files from storage.
Retention-based cleanup. A separate cron handles long-term retention, deleting audio files older than the configured window.
What this changed: Storage costs are predictable and bounded. More importantly, users never see transcriptions stuck in a permanent "processing" state. The system acknowledges failure and lets them retry.
What this pipeline taught us about resilience
The pipeline is not elegant. It is a collection of fallbacks, retries, and edge case handlers accumulated over months of production use. But it processes thousands of files reliably, handles every audio format browsers and mobile devices produce, and recovers gracefully from every failure mode we have encountered.
Fallback layers beat optimistic assumptions. Our first version assumed clean metadata, consistent codecs, and reliable API calls. Production taught us that every assumption needs a fallback. The three-layer probe (standard → extended → file-size estimate) is the clearest example.
Parallel processing is worth the complexity. The ordered reassembly adds code and potential for bugs. But reducing an 8-hour file from 6 minutes to 2 minutes of processing time is the kind of improvement users notice immediately.
Cleanup is infrastructure, not housekeeping. Without automated cleanup at multiple levels, the system would accumulate corrupted state and orphaned files. Self-healing processes are as important as the happy-path pipeline.
Sometimes resilient beats elegant.



