Friday, December 5, 2025

Pyhton: playwright to get audio url stream

There are many tool to find url audio stream from radio station's page, here are a few list:

  1. playwright
  2. Selenium
  3. Puppeteer

To use Wget to find url audo on statik and simple web  

$ wget --spider -r -l 5 -nv -w 1 --no-clobber -A .mp3,.m3u8 "https://www.example.com/audio/stations" 2>&1 | grep -E '\.(mp3|m3u8)$' | awk '{print $3}' | sort -u > out_audio_urls.txt

Install playwright on debian with virtual environment:

myuser@mypc:~$ mkdir spider_playwright
myuser@mypc:~$ cd spider_playwright/
myuser@mypc:~/spider_playwright$ python3 -m venv venv
myuser@mypc:~/spider_playwright$ source venv/bin/activate
(venv) myuser@mypc:~/spider_playwright$ pip install playwright
Collecting playwright
  Downloading playwright-1.56.0-py3-none-manylinux1_x86_64.whl.metadata (3.5 kB)
Collecting pyee<14,>=13 (from playwright)
  Downloading pyee-13.0.0-py3-none-any.whl.metadata (2.9 kB)
Collecting greenlet<4.0.0,>=3.1.1 (from playwright)
  Using cached greenlet-3.2.4-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (4.1 kB)
Collecting typing-extensions (from pyee<14,>=13->playwright)
  Using cached typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Downloading playwright-1.56.0-py3-none-manylinux1_x86_64.whl (46.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46.3/46.3 MB 2.5 MB/s eta 0:00:00
Using cached greenlet-3.2.4-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (610 kB)
Downloading pyee-13.0.0-py3-none-any.whl (15 kB)
Using cached typing_extensions-4.15.0-py3-none-any.whl (44 kB)
Installing collected packages: typing-extensions, greenlet, pyee, playwright
Successfully installed greenlet-3.2.4 playwright-1.56.0 pyee-13.0.0 typing-extensions-4.15.0
(venv) myuser@mypc:~/spider_playwright$ playwright install
Downloading Chromium 141.0.7390.37 (playwright build v1194) from https://cdn.playwright.dev/dbazure/download/playwright/builds/chromium/1194/chromium-linux.zip
173.9 MiB [====================] 100% 0.0s
Chromium 141.0.7390.37 (playwright build v1194) downloaded to /home/myuser/.cache/ms-playwright/chromium-1194
Downloading Chromium Headless Shell 141.0.7390.37 (playwright build v1194) from https://cdn.playwright.dev/dbazure/download/playwright/builds/chromium/1194/chromium-headless-shell-linux.zip
104.3 MiB [====================] 100% 0.0s
Chromium Headless Shell 141.0.7390.37 (playwright build v1194) downloaded to /home/myuser/.cache/ms-playwright/chromium_headless_shell-1194
Downloading Firefox 142.0.1 (playwright build v1495) from https://cdn.playwright.dev/dbazure/download/playwright/builds/firefox/1495/firefox-debian-13.zip
96.7 MiB [====================] 100% 0.0s
Firefox 142.0.1 (playwright build v1495) downloaded to /home/myuser/.cache/ms-playwright/firefox-1495
Downloading Webkit 26.0 (playwright build v2215) from https://cdn.playwright.dev/dbazure/download/playwright/builds/webkit/2215/webkit-debian-13.zip
88.1 MiB [====================] 100% 0.0s
Webkit 26.0 (playwright build v2215) downloaded to /home/myuser/.cache/ms-playwright/webkit-2215
Downloading FFMPEG playwright build v1011 from https://cdn.playwright.dev/dbazure/download/playwright/builds/ffmpeg/1011/ffmpeg-linux.zip
2.3 MiB [====================] 100% 0.0s
FFMPEG playwright build v1011 downloaded to /home/myuser/.cache/ms-playwright/ffmpeg-1011

create file myspider.py and customize to your requirements, make it executable

(venv) myuser@mypc:~/spider_playwright$ chmod u+x myspider.py 

source code myspider.py modify it to meet your requirements 

import asyncio
from playwright.async_api import async_playwright

# generated and adjust by chatgpt.com

AUDIO_HINTS = [
    ".mp3", ".aac", ".m3u8", ".ogg", ".opus", ".wav",
    "stream", "/proxy/", "/live", "radio"
]

async def scan_audio_stream(url: str, listen_seconds: int = 15):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()

        # ----- Detect manual browser close -----
        browser_disconnected = asyncio.Event()

        def on_browser_disconnect():
            print("\n[BROWSER CLOSED] Exiting application.")
            browser_disconnected.set()

        browser.on("disconnected", on_browser_disconnect)
        # ---------------------------------------

        found = set()

        async def on_response(res):
            u = res.url.lower()
            ct = res.headers.get("content-type", "").lower()

            if "audio" in ct or any(k in u for k in AUDIO_HINTS):
                if u not in found:
                    print(f"[NEW AUDIO STREAM] {u}")
                    found.add(u)

        page.on("response", on_response)

        print("Opening page...")
        await page.goto(url, wait_until="domcontentloaded")
        await asyncio.sleep(2)

        # ---- Find audio player iframe ----
        candidate_frames = [
            f for f in page.frames
            if any(kw in f.url.lower() for kw in ["player", "radio", "embed"])
        ]
        if not candidate_frames:
            candidate_frames = page.frames  # fallback

        print("Attempting to click play...")
        for f in candidate_frames:
            try:
                for sel in ["button[aria-label='Play']", "button.play", ".jp-play", "button"]:
                    btn = await f.query_selector(sel)
                    if btn:
                        print(f"Clicked play in iframe → {f.url}")
                        await btn.click()
                        break
            except:
                pass

        print(f"\nListening for audio streams up to {listen_seconds} seconds...\n")

        # ----- Wait for 15 seconds OR browser close -----
        try:
            await asyncio.wait_for(browser_disconnected.wait(), timeout=listen_seconds)
        except asyncio.TimeoutError:
            print("\n[TIMEOUT] Done scanning.")
        # -------------------------------------------------

        await browser.close()
        return list(found)


if __name__ == "__main__":
    results = asyncio.run(scan_audio_stream(
        "https://www.example.com", # CHANGE HERE
        listen_seconds=15     # ← Fast scan
    ))

    print("\n===== ALL STREAMS FOUND =====")
    for s in results:
        print(s)

Enjoy