DMB — Projet Bruno Knowledge Base

Vue d'ensemble

170h

De cours Bruno Guyot

Phases du projet

~12h

Effort total estimé

∞

Valeur long terme

Architecture du pipeline

YouTube
170h vidéos

→

yt-dlp
Extraction SRT

→

Python
Nettoyage + Chunking

→

Supabase
pgvector embeddings

→

MCP Server
search_bruno()

→

Claude Desktop
Query en langage naturel

Prérequis

Python 3.11+ (déjà installé)

Supabase (déjà utilisé pour AdsMind)

Claude Desktop avec MCP (configuré ce soir)

yt-dlp (à installer — pip install yt-dlp)

Liste des URLs YouTube de Bruno

Clé API pour embeddings (Anthropic ou OpenAI)

Collecter les URLs YouTube

Effort : ~2h · Priorité : faire en premier

Inventorier toutes les vidéos de Bruno

Les vidéos sont éparpillées (pas en playlists). Tu dois les rassembler manuellement. Crée un fichier bruno-urls.txt avec une URL par ligne.

Astuce : va sur la chaîne YouTube de Bruno → onglet "Vidéos" → trie par date. Utilise l'extension Chrome "Export YouTube URLs" pour extraire en batch.

Catégoriser les vidéos (optionnel mais recommandé)

Ajoute un tag après chaque URL pour retrouver le thème ensuite. Format suggéré :

bruno-urls.txthttps://youtube.com/watch?v=XXXXX | scaling-pmax | Scaling PMAX : la méthode complète
https://youtube.com/watch?v=YYYYY | structure-compte | Structure de compte e-commerce
https://youtube.com/watch?v=ZZZZZ | encheres-smart-bidding | Smart Bidding avancé
https://youtube.com/watch?v=AAAA | search-terms | Gestion des search terms négatifs

Les catégories recommandées : structure-compte, scaling, encheres, search-terms, pmax, tracking, meta-ads, strategie, audit, reporting, compliance, shopping, youtube-ads, crm, autre

Vérifier l'accès

Certaines vidéos peuvent être non listées (lien direct requis) ou privées (pas extractibles). Teste avec :

PowerShellpip install yt-dlp
yt-dlp --list-subs "URL_DE_TEST"

Si ça retourne des sous-titres dispo (fr, auto-generated), c'est bon.

Les vidéos privées ou protégées par mot de passe ne pourront pas être transcrites avec yt-dlp. Si certaines sont sur une plateforme de formation (Teachable, etc.), il faudra un autre workflow.

Extraire les transcripts

Effort : ~2h · Semi-automatique

Créer le dossier de travail

PowerShellmkdir C:\Users\Steve\Documents\bruno-kb
cd C:\Users\Steve\Documents\bruno-kb
mkdir transcripts-raw
mkdir transcripts-clean
mkdir chunks

Script d'extraction batch

Ce script lit ton fichier d'URLs et extrait tous les sous-titres automatiquement :

extract_transcripts.pyimport subprocess
import os

URLS_FILE = "bruno-urls.txt"
OUTPUT_DIR = "transcripts-raw"

os.makedirs(OUTPUT_DIR, exist_ok=True)

with open(URLS_FILE, "r", encoding="utf-8") as f:
    lines = f.readlines()

for i, line in enumerate(lines):
    line = line.strip()
    if not line or line.startswith("#"):
        continue

    parts = line.split("|")
    url = parts[0].strip()
    tag = parts[1].strip() if len(parts) > 1 else "general"
    title = parts[2].strip() if len(parts) > 2 else f"video_{i}"

    safe_title = "".join(c if c.isalnum() or c in " -_" else "" for c in title)
    filename = f"{i:03d}_{tag}_{safe_title}"

    print(f"\n[{i+1}/{len(lines)}] Extracting: {title}")

    cmd = [
        "yt-dlp",
        "--write-auto-sub",
        "--sub-lang", "fr",
        "--skip-download",
        "--convert-subs", "srt",
        "-o", os.path.join(OUTPUT_DIR, filename),
        url
    ]

    try:
        subprocess.run(cmd, check=True, capture_output=True, text=True)
        print(f"  ✅ Done: {filename}")
    except subprocess.CalledProcessError as e:
        print(f"  ❌ Failed: {e.stderr[:200]}")

print(f"\n🎉 Extraction complete! Check {OUTPUT_DIR}/")

Lancer avec :

PowerShellcd C:\Users\Steve\Documents\bruno-kb
python extract_transcripts.py

Temps estimé : ~30 secondes par vidéo = ~2h30 pour 170 vidéos de 1h. Lance-le le soir et laisse tourner.

Vérifier les résultats

Après extraction, vérifie combien de fichiers .srt ont été créés :

PowerShell(Get-ChildItem .\transcripts-raw\*.srt).Count

Si certains manquent, c'est que la vidéo n'avait pas de sous-titres auto-générés. Pour celles-là, voir Phase 2b.

[Phase 2b] Vidéos sans sous-titres — Whisper

Pour les vidéos sans sous-titres auto YouTube, télécharge l'audio et transcris avec Whisper :

PowerShell# Installe whisper
pip install openai-whisper

# Télécharge l'audio seul
yt-dlp -x --audio-format mp3 -o "audio_%(title)s.%(ext)s" "URL_VIDEO"

# Transcris
whisper "audio_fichier.mp3" --language fr --output_format srt

Whisper en local est lent (~1x temps réel sur CPU). Alternative : utiliser l'API Whisper d'OpenAI (0.006$/min = ~61$ pour 170h) pour aller plus vite.

Nettoyer et chunker les transcripts

Effort : ~2h · Script Python

Script de nettoyage SRT → Markdown

Transforme les fichiers .srt (timestamps + texte fragmenté) en Markdown propre avec métadonnées :

clean_srt.pyimport re, os, json

RAW_DIR = "transcripts-raw"
CLEAN_DIR = "transcripts-clean"
os.makedirs(CLEAN_DIR, exist_ok=True)

def clean_srt(filepath):
    """Parse SRT, merge lines, remove timestamps, keep paragraph breaks."""
    with open(filepath, "r", encoding="utf-8") as f:
        content = f.read()

    # Remove SRT sequence numbers and timestamps
    content = re.sub(r"\d+\n\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}\n", "", content)
    # Remove remaining numbers-only lines
    content = re.sub(r"^\d+$", "", content, flags=re.MULTILINE)
    # Collapse multiple newlines
    content = re.sub(r"\n{3,}", "\n\n", content)
    # Remove leading/trailing whitespace per line
    lines = [line.strip() for line in content.split("\n")]
    # Merge short consecutive lines into paragraphs
    paragraphs = []
    current = []
    for line in lines:
        if not line:
            if current:
                paragraphs.append(" ".join(current))
                current = []
        else:
            current.append(line)
    if current:
        paragraphs.append(" ".join(current))

    return "\n\n".join(paragraphs)

def extract_timestamps_map(filepath):
    """Extract timestamp→text mapping for linking back to YouTube."""
    with open(filepath, "r", encoding="utf-8") as f:
        content = f.read()
    timestamps = {}
    blocks = re.findall(
        r"(\d{2}:\d{2}:\d{2}),\d{3} --> \d{2}:\d{2}:\d{2},\d{3}\n(.+?)(?=\n\n|\Z)",
        content, re.DOTALL
    )
    for ts, text in blocks:
        h, m, s = ts.split(":")
        seconds = int(h)*3600 + int(m)*60 + int(s)
        timestamps[seconds] = text.replace("\n", " ").strip()
    return timestamps

for filename in os.listdir(RAW_DIR):
    if not filename.endswith(".srt"):
        continue

    filepath = os.path.join(RAW_DIR, filename)
    clean_text = clean_srt(filepath)
    ts_map = extract_timestamps_map(filepath)

    # Extract metadata from filename (format: 001_tag_title.fr.srt)
    base = filename.replace(".fr.srt", "").replace(".srt", "")
    parts = base.split("_", 2)
    idx = parts[0] if parts else "000"
    tag = parts[1] if len(parts) > 1 else "general"
    title = parts[2].replace("_", " ") if len(parts) > 2 else base

    # Save clean transcript
    output_path = os.path.join(CLEAN_DIR, f"{base}.md")
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(f"# {title}\n\n")
        f.write(f"**Catégorie :** {tag}\n")
        f.write(f"**Source :** Bruno Guyot\n\n")
        f.write(f"---\n\n")
        f.write(clean_text)

    # Save timestamps mapping (for YouTube link generation later)
    ts_path = os.path.join(CLEAN_DIR, f"{base}_timestamps.json")
    with open(ts_path, "w", encoding="utf-8") as f:
        json.dump({"title": title, "tag": tag, "timestamps": ts_map}, f,
                  ensure_ascii=False, indent=2)

    print(f"✅ {filename} → {base}.md ({len(clean_text)} chars)")

print(f"\n🎉 Cleaning complete! {len(os.listdir(CLEAN_DIR))//2} transcripts in {CLEAN_DIR}/")

Script de chunking

Découpe chaque transcript en morceaux de ~500 tokens (~2000 caractères) avec overlap de 200 chars pour garder le contexte :

chunk_transcripts.pyimport os, json

CLEAN_DIR = "transcripts-clean"
CHUNKS_DIR = "chunks"
CHUNK_SIZE = 2000  # ~500 tokens
OVERLAP = 200

os.makedirs(CHUNKS_DIR, exist_ok=True)

all_chunks = []

for filename in sorted(os.listdir(CLEAN_DIR)):
    if not filename.endswith(".md"):
        continue

    filepath = os.path.join(CLEAN_DIR, filename)
    with open(filepath, "r", encoding="utf-8") as f:
        content = f.read()

    # Extract metadata from first lines
    lines = content.split("\n")
    title = lines[0].replace("# ", "").strip() if lines else filename
    tag = "general"
    for line in lines[:5]:
        if line.startswith("**Catégorie :**"):
            tag = line.replace("**Catégorie :**", "").strip()
            break

    # Get the body text (after the --- separator)
    body_start = content.find("---")
    body = content[body_start+3:].strip() if body_start > 0 else content

    # Chunk with overlap
    pos = 0
    chunk_idx = 0
    while pos < len(body):
        end = pos + CHUNK_SIZE
        # Try to break at a sentence boundary
        if end < len(body):
            last_period = body[pos:end].rfind(". ")
            if last_period > CHUNK_SIZE * 0.5:
                end = pos + last_period + 2

        chunk_text = body[pos:end].strip()
        if len(chunk_text) > 50:  # Skip tiny chunks
            chunk = {
                "id": f"{filename.replace('.md','')}_{chunk_idx:03d}",
                "source_file": filename,
                "title": title,
                "category": tag,
                "chunk_index": chunk_idx,
                "text": chunk_text,
                "char_count": len(chunk_text)
            }
            all_chunks.append(chunk)
            chunk_idx += 1

        pos = end - OVERLAP

    print(f"  {filename}: {chunk_idx} chunks")

# Save all chunks
output_path = os.path.join(CHUNKS_DIR, "all_chunks.json")
with open(output_path, "w", encoding="utf-8") as f:
    json.dump(all_chunks, f, ensure_ascii=False, indent=2)

print(f"\n🎉 Total: {len(all_chunks)} chunks saved to {output_path}")
print(f"   Average chunk size: {sum(c['char_count'] for c in all_chunks)//len(all_chunks)} chars")

Embeddings + Supabase

Effort : ~3h (dont temps machine) · Setup Supabase + batch embeddings

Créer la table Supabase

Dans le dashboard Supabase → SQL Editor, exécute :

SQL — Supabase-- Active l'extension pgvector
create extension if not exists vector;

-- Table des chunks
create table bruno_chunks (
  id text primary key,
  source_file text not null,
  title text not null,
  category text not null,
  chunk_index integer not null,
  text text not null,
  char_count integer,
  youtube_url text,
  embedding vector(1536),
  created_at timestamp with time zone default now()
);

-- Index pour la recherche vectorielle
create index on bruno_chunks
  using ivfflat (embedding vector_cosine_ops)
  with (lists = 100);

-- Fonction de recherche sémantique
create or replace function search_bruno(
  query_embedding vector(1536),
  match_count int default 5,
  filter_category text default null
)
returns table (
  id text,
  title text,
  category text,
  text text,
  youtube_url text,
  similarity float
)
language plpgsql
as $$
begin
  return query
  select
    bc.id, bc.title, bc.category, bc.text, bc.youtube_url,
    1 - (bc.embedding <=> query_embedding) as similarity
  from bruno_chunks bc
  where (filter_category is null or bc.category = filter_category)
  order by bc.embedding <=> query_embedding
  limit match_count;
end;
$$;

Script de génération d'embeddings et upload

Ce script génère les embeddings via l'API OpenAI (text-embedding-3-small, le moins cher) et les insère dans Supabase :

generate_embeddings.pyimport json, os, time
from openai import OpenAI
from supabase import create_client

# Config
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "sk-...")
SUPABASE_URL = os.environ.get("SUPABASE_URL", "https://xxx.supabase.co")
SUPABASE_KEY = os.environ.get("SUPABASE_KEY", "eyJ...")
CHUNKS_FILE = "chunks/all_chunks.json"
BATCH_SIZE = 50  # OpenAI permet jusqu'à 2048 inputs par requête

openai = OpenAI(api_key=OPENAI_API_KEY)
supabase = create_client(SUPABASE_URL, SUPABASE_KEY)

with open(CHUNKS_FILE, "r", encoding="utf-8") as f:
    chunks = json.load(f)

print(f"Processing {len(chunks)} chunks...")

for i in range(0, len(chunks), BATCH_SIZE):
    batch = chunks[i:i+BATCH_SIZE]
    texts = [c["text"] for c in batch]

    # Generate embeddings
    response = openai.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )

    # Upload to Supabase
    rows = []
    for j, chunk in enumerate(batch):
        rows.append({
            "id": chunk["id"],
            "source_file": chunk["source_file"],
            "title": chunk["title"],
            "category": chunk["category"],
            "chunk_index": chunk["chunk_index"],
            "text": chunk["text"],
            "char_count": chunk["char_count"],
            "embedding": response.data[j].embedding,
        })

    supabase.table("bruno_chunks").upsert(rows).execute()
    print(f"  Batch {i//BATCH_SIZE + 1}/{(len(chunks)-1)//BATCH_SIZE + 1}: {len(batch)} chunks uploaded")
    time.sleep(0.5)  # Rate limiting

print(f"\n🎉 All {len(chunks)} chunks embedded and stored in Supabase!")

Coût estimé : text-embedding-3-small = 0.02$/1M tokens. Pour ~170h de transcripts (~2M tokens), ça coûte environ 0.04$. Quasi gratuit.

MCP Server "Bruno Knowledge Base"

Effort : ~2h · Le serveur MCP custom

Créer le serveur MCP

Un petit serveur Python qui expose un outil search_bruno(query, category?, count?) :

bruno_mcp_server.pyimport os, json
from mcp.server import Server
from mcp.types import Tool, TextContent
from openai import OpenAI
from supabase import create_client

# Config
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
SUPABASE_URL = os.environ["SUPABASE_URL"]
SUPABASE_KEY = os.environ["SUPABASE_KEY"]

openai_client = OpenAI(api_key=OPENAI_API_KEY)
supabase = create_client(SUPABASE_URL, SUPABASE_KEY)
server = Server("bruno-knowledge-base")

@server.list_tools()
async def list_tools():
    return [
        Tool(
            name="search_bruno",
            description=(
                "Recherche dans la base de connaissances de Bruno Guyot "
                "(170h de cours Google Ads). Retourne les passages les plus "
                "pertinents par recherche sémantique. Utiliser pour : "
                "méthodologie enchères, structure de compte, scaling, PMAX, "
                "search terms, tracking, compliance, stratégie."
            ),
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "La question ou le sujet à rechercher"
                    },
                    "category": {
                        "type": "string",
                        "description": "Filtrer par catégorie (optionnel)",
                        "enum": [
                            "structure-compte", "scaling", "encheres",
                            "search-terms", "pmax", "tracking", "meta-ads",
                            "strategie", "audit", "reporting", "compliance",
                            "shopping", "youtube-ads", "crm", "general"
                        ]
                    },
                    "count": {
                        "type": "integer",
                        "description": "Nombre de résultats (défaut: 5, max: 10)",
                        "default": 5
                    }
                },
                "required": ["query"]
            }
        )
    ]

@server.call_tool()
async def call_tool(name, arguments):
    if name != "search_bruno":
        return [TextContent(type="text", text=f"Outil inconnu: {name}")]

    query = arguments["query"]
    category = arguments.get("category")
    count = min(arguments.get("count", 5), 10)

    # Generate query embedding
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=[query]
    )
    query_embedding = response.data[0].embedding

    # Search Supabase
    results = supabase.rpc("search_bruno", {
        "query_embedding": query_embedding,
        "match_count": count,
        "filter_category": category
    }).execute()

    if not results.data:
        return [TextContent(type="text", text="Aucun résultat trouvé.")]

    # Format results
    output = f"## Résultats Bruno KB — \"{query}\"\n\n"
    for i, r in enumerate(results.data):
        similarity_pct = round(r["similarity"] * 100, 1)
        output += f"### [{i+1}] {r['title']} ({r['category']}) — {similarity_pct}% pertinent\n"
        if r.get("youtube_url"):
            output += f"🔗 {r['youtube_url']}\n"
        output += f"\n{r['text']}\n\n---\n\n"

    return [TextContent(type="text", text=output)]

if __name__ == "__main__":
    import asyncio
    from mcp.server.stdio import stdio_server

    async def main():
        async with stdio_server() as (read, write):
            await server.run(read, write, server.create_initialization_options())

    asyncio.run(main())

Configurer dans Claude Desktop

Ajoute ce serveur dans ton claude_desktop_config.json à côté du Google Ads MCP :

claude_desktop_config.json (extrait){
  "mcpServers": {
    "googleAdsServer": {
      "...": "ta config existante"
    },
    "brunoKB": {
      "command": "C:\\Users\\Steve\\Documents\\bruno-kb\\.venv\\Scripts\\python.exe",
      "args": ["C:\\Users\\Steve\\Documents\\bruno-kb\\bruno_mcp_server.py"],
      "env": {
        "OPENAI_API_KEY": "sk-...",
        "SUPABASE_URL": "https://xxx.supabase.co",
        "SUPABASE_KEY": "eyJ..."
      }
    }
  }
}

Tester !

Redémarre Claude Desktop et teste avec ces prompts :

Prompts de testQue dit Bruno sur le scaling des campagnes PMAX ?

Comment Bruno structure un compte e-commerce avec beaucoup de produits ?

Quelle est la méthodologie de Bruno pour gérer les enchères Smart Bidding ?

Selon Bruno, quand faut-il passer de maximize clicks à target CPA ?

Cherche dans les cours de Bruno ce qu'il dit sur la compliance DEBT_SERVICES

Le résultat final

Tu demandes n'importe quoi à Claude Desktop sur Google Ads, et il combine trois sources :

1. Ses connaissances générales (formation Anthropic)
2. Tes données live via le MCP Google Ads (Cohnen)
3. La méthodologie de Bruno via le MCP Bruno KB

Exemple : "Le CPA de Blick Frères a augmenté de 30% cette semaine. Regarde les change events pour comprendre pourquoi, et cherche dans les cours de Bruno comment il gère ce type de situation."

Claude pull les données live, trouve les changements, ET cite la méthodologie de Bruno pour recommander la marche à suivre. C'est comme avoir Bruno qui regarde par-dessus ton épaule.

Planning suggéré

Semaine 1 (2h) — Phase 1 : collecter et catégoriser les URLs YouTube
Semaine 1 soir (lancer et laisser tourner) — Phase 2 : extraction des transcripts
Semaine 2 (2h) — Phase 3 : nettoyage et chunking
Semaine 2 (3h) — Phase 4 : setup Supabase + embeddings
Semaine 2-3 (2h) — Phase 5 : MCP server + configuration Claude Desktop
Semaine 3 — Test et itération