Technical overview
WordCut started as one question: how much of video editing can be automated without taking creative control away from the editor?
Click any step to expand
Upload
Drop in a video and say what you want. That one message becomes the starting point for the whole pipeline.
Intent parsing
This is where messy human language gets turned into a clean list of steps the system can actually execute.
Transcription
Once every word has a timestamp, the video becomes searchable, cuttable, and editable through language.
Semantic segmentation
This was one of the most interesting parts to build. The system finds moments based on meaning, not just silence or scene changes.
Video processing
This is where I realized the hard part was not the AI. FFmpeg has to make the edit real.
Render & export
Everything the user built in the editor has to line up exactly with what comes out the other end as an MP4.
Frontend
This is where the user feels the product: chat, preview, timeline, subtitles, and controls. The challenge was making AI automation still feel editable and personal, not like a black box.
Backend orchestration
The backend acts like the conductor. It figures out which step runs next, keeps track of files, calls the AI models, and hands work off to the video tools at the right time.
Video workers
This layer does the heavy lifting. FFmpeg and Remotion turn decisions into actual frames, clips, subtitles, and exports. Most of the production pain lived here.
The engineering challenges that aren't obvious from the outside
Video pipeline reliability
I thought the AI would be the hardest part. It wasn't. The hard part was making FFmpeg reliably cut, reframe, concatenate, and burn subtitles across formats, resolutions, codecs, and edge cases I didn't know existed.
Timeline state coherence
The editor is constantly balancing multiple truths at once: what the user sees in the preview, what the timeline stores, what the backend has processed, and what will actually come out in the export.
Preview vs. final render parity
The canvas preview has to feel instant. The export has to be accurate. Matching those two worlds, an HTML5 canvas and FFmpeg's ASS subtitle engine, was harder than it looks because they handle text layout differently.
Infrastructure constraints
Rendering video is expensive. Remotion is powerful, but running Chromium inside a small container taught me a lot about memory limits, OOM kills, and what production deployment actually costs.
WordCut is not about replacing editors. It is about removing the repetitive parts so creators can spend more time on taste, pacing, and story. Building it made me realize that the future of creative tools is not just automation. It is giving people faster ways to stay in control.