Parsers

Parsers are responsible for recognizing and extracting data from URLs of a specific platform.

The PlatformParser Protocol

Every parser implements the PlatformParser protocol:

class PlatformParser(Protocol):
    platform: str
    schemes: ClassVar[set[str]]

    def handles_hostname(self, hostname: str) -> bool: ...
    def parse(self, url: str) -> SocialsURL | None: ...

Attributes

platform: String identifier (e.g., "github", "twitter")
schemes: Set of URL schemes the parser handles (e.g., {"http", "https"} or {"mailto"})

Methods

handles_hostname(hostname): Returns True if the parser can handle URLs from this hostname
parse(url): Parses the URL and returns a typed URL object, or None if not recognized

Built-in Parsers

Parser	Platform	Hostnames	Entity Types
`GitHubParser`	github	github.com	profile, repo
`TwitterParser`	twitter	twitter.com, x.com	profile
`LinkedInParser`	linkedin	linkedin.com	profile, company
`FacebookParser`	facebook	facebook.com, fb.com	profile
`InstagramParser`	instagram	instagram.com	profile
`YouTubeParser`	youtube	youtube.com, youtu.be	channel
`EmailParser`	email	(mailto: scheme)	email
`PhoneParser`	phone	(tel: scheme)	phone

How Parsing Works

The Registry receives a URL
It extracts the scheme (http, https, mailto, etc.)
For http/https URLs, it extracts the hostname and finds the matching parser via handles_hostname()
For other schemes (mailto, tel), it finds the parser that declares that scheme
The matched parser's parse() method is called
The parser returns a typed URL object or None

URL Evolution

URLs change over time. Parsers handle this gracefully:

Domain changes (twitter.com to x.com): The parser's handles_hostname() accepts multiple hostnames
Path changes: The parser handles multiple regex patterns, returning the same typed class

Example of handling multiple domains:

class TwitterParser:
    platform = "twitter"

    def handles_hostname(self, hostname: str) -> bool:
        return hostname in {"twitter.com", "x.com", "www.twitter.com", "www.x.com"}

Parser Example

Here's a simplified view of how GitHubParser works:

REPO_REGEX = re.compile(
    r"^https?://(?:www\.)?github\.com/(?P<owner>[A-Za-z0-9_-]+)/"
    r"(?P<repo>[A-Za-z0-9._-]+)/?$"
)
PROFILE_REGEX = re.compile(
    r"^https?://(?:www\.)?github\.com/(?P<username>[A-Za-z0-9_-]+)/?$"
)

class GitHubParser:
    platform = "github"
    schemes: ClassVar[set[str]] = {"http", "https"}

    def handles_hostname(self, hostname: str) -> bool:
        return hostname in {"github.com", "www.github.com"}

    def parse(self, url: str) -> GitHubProfileURL | GitHubRepoURL | None:
        if match := REPO_REGEX.match(url):
            return GitHubRepoURL(url=url, **match.groupdict())
        if match := PROFILE_REGEX.match(url):
            return GitHubProfileURL(url=url, **match.groupdict())
        return None

Key points: - More specific patterns are tried first (repo before profile) - Named capture groups in regex map directly to URL object fields - Returns None if the URL doesn't match any pattern

Creating Custom Parsers

See Contributing for how to add a new platform parser.