Parsers
Parsers are responsible for recognizing and extracting data from URLs of a specific platform.
The PlatformParser Protocol
Every parser implements the PlatformParser protocol:
class PlatformParser(Protocol):
platform: str
schemes: ClassVar[set[str]]
def handles_hostname(self, hostname: str) -> bool: ...
def parse(self, url: str) -> SocialsURL | None: ...
Attributes
platform: String identifier (e.g.,"github","twitter")schemes: Set of URL schemes the parser handles (e.g.,{"http", "https"}or{"mailto"})
Methods
handles_hostname(hostname): ReturnsTrueif the parser can handle URLs from this hostnameparse(url): Parses the URL and returns a typed URL object, orNoneif not recognized
Built-in Parsers
| Parser | Platform | Hostnames | Entity Types |
|---|---|---|---|
GitHubParser |
github | github.com | profile, repo |
TwitterParser |
twitter.com, x.com | profile | |
LinkedInParser |
linkedin.com | profile, company | |
FacebookParser |
facebook.com, fb.com | profile | |
InstagramParser |
instagram.com | profile | |
YouTubeParser |
youtube | youtube.com, youtu.be | channel |
EmailParser |
(mailto: scheme) | ||
PhoneParser |
phone | (tel: scheme) | phone |
How Parsing Works
- The Registry receives a URL
- It extracts the scheme (http, https, mailto, etc.)
- For http/https URLs, it extracts the hostname and finds the matching parser via
handles_hostname() - For other schemes (mailto, tel), it finds the parser that declares that scheme
- The matched parser's
parse()method is called - The parser returns a typed URL object or
None
URL Evolution
URLs change over time. Parsers handle this gracefully:
- Domain changes (twitter.com to x.com): The parser's
handles_hostname()accepts multiple hostnames - Path changes: The parser handles multiple regex patterns, returning the same typed class
Example of handling multiple domains:
class TwitterParser:
platform = "twitter"
def handles_hostname(self, hostname: str) -> bool:
return hostname in {"twitter.com", "x.com", "www.twitter.com", "www.x.com"}
Parser Example
Here's a simplified view of how GitHubParser works:
REPO_REGEX = re.compile(
r"^https?://(?:www\.)?github\.com/(?P<owner>[A-Za-z0-9_-]+)/"
r"(?P<repo>[A-Za-z0-9._-]+)/?$"
)
PROFILE_REGEX = re.compile(
r"^https?://(?:www\.)?github\.com/(?P<username>[A-Za-z0-9_-]+)/?$"
)
class GitHubParser:
platform = "github"
schemes: ClassVar[set[str]] = {"http", "https"}
def handles_hostname(self, hostname: str) -> bool:
return hostname in {"github.com", "www.github.com"}
def parse(self, url: str) -> GitHubProfileURL | GitHubRepoURL | None:
if match := REPO_REGEX.match(url):
return GitHubRepoURL(url=url, **match.groupdict())
if match := PROFILE_REGEX.match(url):
return GitHubProfileURL(url=url, **match.groupdict())
return None
Key points:
- More specific patterns are tried first (repo before profile)
- Named capture groups in regex map directly to URL object fields
- Returns None if the URL doesn't match any pattern
Creating Custom Parsers
See Contributing for how to add a new platform parser.