Overview

DeepFocus YT Video Pipeline

Automated pipeline for producing 2-4 hour ambient "window view" YouTube videos. Drop clips in a folder, run one command, get a complete video with looped scenery, mixed audio, color grading, thumbnail, and metadata.

What It Produces

Long-form ambient videos for focus, relaxation, and study. Each video features a seamless looping window view (beach, mountain, city, rainforest) with layered ambient audio and gentle color grading.

Target: 2-4 hours per video, upload-ready for YouTube.

Tech Stack

  • Python — orchestration + scene config
  • FFmpeg — video processing, looping, grading
  • Freesound API — ambient audio sourcing
  • YouTube Data API v3 — metadata + upload

9 Scene Configurations

beach_sunrise beach_sunset mountain_cabin mountain_snow alpine_meadow tokyo_rain london_rain nyc_skyline rainforest_morning
💡
Clip Intake Model: No AI-generated clips needed. You generate or source 6-10 video clips per scene externally, drop them into .tmp/clips/{scene_id}/, and the pipeline handles the rest.
Architecture

Pipeline Architecture

End-to-end flow from raw clips to YouTube-ready video. Each stage is a discrete, testable step.

Full Pipeline Flow

Clip Intake6-10 clips per scene
Scene ConfigLoad params
Audio FetchFreesound API
Loop BuildFFmpeg xfade
Color GradeLUT + curves
Audio MixLayer + normalize
Final RenderH.264 encode
ThumbnailAuto-generate
UploadYouTube API v3

Key Files

File Structure
pipeline.pyMain orchestrator — runs all stages
scene_config.pyScene definitions and parameters
audio_mixer.pyFreesound fetch + layered mixing
loop_builder.pySeamless loop via FFmpeg xfade
color_grader.pyPer-scene LUT and curve adjustments
thumbnail_gen.pyAuto thumbnail from best frame
uploader.pyYouTube Data API v3 upload

Directory Layout

# Working directories .tmp/ clips/ beach_sunrise/ # 6-10 source clips mountain_cabin/ tokyo_rain/ ... audio/ # Downloaded ambient renders/ # Final outputs thumbnails/ # Generated thumbs scenes/ # Scene config JSONs luts/ # Color grading LUTs
⚠️
FFmpeg required: The pipeline depends on FFmpeg with libx264 and libfdk_aac. Ensure ffmpeg and ffprobe are on your PATH before running.
Production

Scene Configurations

Each scene defines the visual style, audio profile, color grading, and loop parameters. 9 configs ship by default.

Scene Configuration Table

All 9 Scene Configs
beach_sunrise Golden hour beach, gentle waves warm Waves, seagulls, breeze 3-4 hrs
beach_sunset Dusk beach, deep orange tones sunset Waves, crickets, distant chatter 2-3 hrs
mountain_cabin Cozy cabin view of mountains natural Fire crackle, wind, birds 3-4 hrs
mountain_snow Snowy peak, cold blue tones cold Wind, snow crunch, silence 2-3 hrs
alpine_meadow Green meadow, wildflowers vibrant Birds, stream, rustling grass 3-4 hrs
tokyo_rain Neon-lit Tokyo street, rain cyber Rain, traffic, city hum 2-3 hrs
london_rain Victorian window, grey drizzle muted Rain on glass, distant traffic 2-3 hrs
nyc_skyline Penthouse view, city lights cool City ambient, distant sirens 3-4 hrs
rainforest_morning Dense canopy, morning mist lush Exotic birds, water drips, insects 3-4 hrs

Config Structure

{ "scene_id": "beach_sunrise", "display_name": "Beach Sunrise", "target_duration_hrs": 3, "color_grade": { "lut": "warm_golden.cube", "saturation": 1.15, "contrast": 1.05 }, "audio_tags": ["ocean-waves", "seagulls", "beach-wind"], "xfade_duration": 2.0, "resolution": "3840x2160" }
💡
Adding a new scene: Create a JSON config in scenes/, create a matching folder in .tmp/clips/{new_scene_id}/, and add 6-10 clips. The pipeline auto-discovers new scenes.
Input

Clip Intake Model

You generate or source clips externally, then drop them into the intake folder. The pipeline handles normalization, sequencing, and looping from there.

Intake Flow

Source Clips6-10 per scene
Drop into folder.tmp/clips/{scene_id}/
Auto-normalizeResolution, FPS, codec
Sequence & LoopFrame-aligned xfade

Clip Requirements

Specification
Count per scene6-10 clips (more variety = better loops)
Duration each15-60 seconds (pipeline will loop to fill target duration)
Resolution4K (3840x2160) preferred. 1080p minimum. Pipeline upscales if needed.
Framerate24, 30, or 60 FPS. Pipeline normalizes to scene config FPS.
FormatMP4 (H.264) or MOV. No variable framerate.
ContentStatic or slow-pan window views. No abrupt movements or scene changes.
AudioCan have audio (stripped during processing) or be silent.
🚨
No variable framerate clips. Screen recordings and phone videos often have VFR. Run ffprobe -v error -select_streams v:0 -show_entries stream=r_frame_rate,avg_frame_rate clip.mp4 to check. If r_frame_rate and avg_frame_rate differ, re-encode first.

Folder Structure Example

.tmp/clips/ beach_sunrise/ clip_01_waves_wide.mp4 clip_02_waves_close.mp4 clip_03_horizon_pan.mp4 clip_04_palm_silhouette.mp4 clip_05_foam_detail.mp4 clip_06_golden_light.mp4 clip_07_birds_passing.mp4 clip_08_tide_receding.mp4
Naming convention: Files are processed in alphabetical order. Use numbered prefixes (clip_01_, clip_02_) to control sequence order. Descriptive suffixes are optional but helpful.
Audio

Audio Mixing

Ambient audio is sourced from Freesound API based on scene tags, then layered, looped, and normalized to match the video duration.

Audio Pipeline

Scene Tagse.g. "ocean-waves"
Freesound SearchAPI query + filter
DownloadWAV/FLAC preferred
Layer & Mix3-5 tracks blended
NormalizeLUFS target: -16

Layering Strategy

Audio Layers
Base layer Continuous ambient (waves, rain, wind) 0 dB
Texture layer Secondary ambient (rustling, distant hum) -6 dB
Detail layer Intermittent sounds (birds, drips) -12 dB
Atmosphere Very low presence (room tone, sub-bass) -18 dB

Freesound API

  • API Key: stored in .env as FREESOUND_API_KEY
  • Rate limit: 2000 requests/day (free tier)
  • Preferred formats: WAV > FLAC > OGG
  • Min duration: 30 seconds per sample
  • License filter: CC0 or CC-BY only
  • Cache: Downloaded files cached in .tmp/audio/
⚠️
Attribution tracking: CC-BY sounds require attribution. The pipeline auto-generates an attribution list saved to .tmp/renders/{scene_id}_credits.txt. Include this in the YouTube description.
Core

Seamless Loop Technique

The secret to natural-looking long videos: frame-aligned crossfade transitions using FFmpeg's xfade filter. No visible cuts, no jarring jumps.

How xfade Looping Works

  1. Normalize all clips Transcode to matching resolution, FPS, and pixel format. Ensures frame-level alignment.
  2. Calculate transition points For each clip pair, compute the xfade offset: clip_duration - xfade_duration. Default xfade is 2 seconds.
  3. Chain xfade filters FFmpeg filter_complex chains xfade between every adjacent clip. Last clip crossfades back into first for seamless loop.
  4. Extend to target duration The looped sequence is repeated via stream_loop or concat to fill the full 2-4 hour target.
  5. Apply color grading LUT file applied as a video filter in the same FFmpeg command. Per-scene saturation and contrast curves.

FFmpeg xfade Example

ffmpeg \ -i clip_01.mp4 -i clip_02.mp4 -i clip_03.mp4 \ -filter_complex " [0:v][1:v]xfade=transition=fade:duration=2:offset=13[v01]; [v01][2:v]xfade=transition=fade:duration=2:offset=26[vout] " \ -map "[vout]" -c:v libx264 -preset slow \ -crf 18 loop_sequence.mp4
🚨
Frame alignment is critical. If clips have different FPS or pixel formats, xfade will fail or produce artifacts. The normalize step handles this, but never skip it. All clips must be identical in format before chaining.

Supported Transitions

  • fade — default, most natural for ambient
  • dissolve — softer blend
  • wiperight — for city scenes
  • smoothup — gentle vertical pan effect

Duration Guidelines

  • 2.0s xfade — standard for nature scenes
  • 3.0s xfade — very slow, dreamy feel
  • 1.0s xfade — faster cuts for city scenes
  • Set in scene config xfade_duration
Output

Thumbnail & Metadata

Auto-generated thumbnails from the best frame, plus SEO-optimized titles, descriptions, and tags for YouTube.

Thumbnail Generation

  1. Frame extraction FFmpeg extracts 1 frame per second from the loop sequence.
  2. Quality scoring Each frame scored on: sharpness (Laplacian variance), color vibrancy (saturation histogram), composition (rule of thirds).
  3. Overlay text Scene title and "DeepFocus" branding overlaid using Pillow/PIL. Font, position, and shadow configurable per scene.
  4. Export at 1280x720 YouTube thumbnail spec. Saved to .tmp/thumbnails/{scene_id}.jpg

YouTube Metadata Template

Metadata Fields
Title{Scene Name} Window View | {Duration}hrs Ambient for Study & Focus
DescriptionScene description + audio credits + timestamps + channel links
TagsScene-specific + generic ambient tags (max 500 chars total)
CategoryMusic (ID: 10) for ambient content
PrivacyUnlisted by default, manually publish after review
PlaylistAuto-added to scene-type playlist (Nature, City, etc.)
💡
YouTube Data API v3: Upload quota is 1,600 units/day. Each upload costs 1,600 units = 1 video per day max. Plan batch uploads accordingly. API key in .env as YOUTUBE_API_KEY.
Quick Start

Quick Start Guide

Everything you need to go from zero to a rendered video.

Prerequisites

  • Python 3.10+
  • FFmpeg with libx264 (on PATH)
  • Freesound API key (apply here)
  • YouTube Data API v3 credentials (for upload step only)

Step-by-Step

  1. Install dependencies pip install -r requirements.txt
  2. Configure .env Set FREESOUND_API_KEY and optionally YOUTUBE_API_KEY
  3. Choose a scene Pick from the 9 scene configs (e.g., beach_sunrise)
  4. Drop clips into intake folder Place 6-10 clips in .tmp/clips/beach_sunrise/
  5. Run the pipeline python pipeline.py --scene beach_sunrise
  6. Review output Final video in .tmp/renders/, thumbnail in .tmp/thumbnails/
  7. Upload (optional) python pipeline.py --scene beach_sunrise --upload

Common Commands

# Full pipeline for one scene python pipeline.py --scene beach_sunrise # Skip audio fetch (use cached) python pipeline.py --scene beach_sunrise --skip-audio # Render only (no upload) python pipeline.py --scene tokyo_rain --render-only # Generate thumbnail only python thumbnail_gen.py --scene mountain_cabin # List all available scenes python pipeline.py --list-scenes

DO

  • Use 4K clips for best quality
  • Provide at least 6 clips per scene
  • Check xfade duration matches scene mood
  • Review the loop sequence before full render
  • Include Freesound attribution in descriptions
  • Cache audio downloads for reuse

DON'T

  • Don't use variable framerate clips
  • Don't mix landscape and portrait clips
  • Don't use clips with abrupt movement
  • Don't skip the normalize step
  • Don't upload more than 1 video/day (API quota)
  • Don't delete .tmp/audio/ cache unnecessarily
🎉
Walkthrough complete! You now understand the full DeepFocus YT pipeline — from clip intake through audio mixing, seamless looping, and YouTube upload. Bookmark this page for reference.