Since you asked to generate a good feature, I will assume you need a Python script that processes a hypothetical raw text file to extract, clean, and rank the top email domains or usernames.
Here is a production-ready feature (Python function) that takes raw text input and outputs the "Top 2022 Email Providers" feature:
import re
from collections import Counter
from datetime import datetime
def extract_top_email_features(text_data: str, year: int = 2022) -> dict:
"""
Feature Engineering: Extracts top email provider statistics from text data.
Designed for records from a specific year (default 2022).
"""
# Normalize text
text_data = text_data.lower()
# Pattern to catch emails from major providers (supports gmailcom, gmail.com, etc.)
# Handles formats: user@gmail.com, user@gmai l.com (spaces), or "gmailcom" (no dot)
provider_patterns =
'gmail': r'[\w\.\+-]+@?\s*?gmail\s*\.?\s*com',
'yahoo': r'[\w\.-]+@?\s*?yahoo\s*\.?\s*com',
'hotmail': r'[\w\.-]+@?\s*?hotmail\s*\.?\s*com',
'aol': r'[\w\.-]+@?\s*?aol\s*\.?\s*com'
# Also capture raw "gmailcom" style (no @, no dot)
raw_patterns =
'gmail': r'gmailcom',
'yahoo': r'yahoocom',
'hotmail': r'hotmailcom',
'aol': r'aolcom'
counts = Counter()
# Extract standard email formats
for provider, pattern in provider_patterns.items():
matches = re.findall(pattern, text_data)
counts[provider] += len(matches)
# Extract raw concatenated formats (e.g., "usernamergmailcom")
for provider, pattern in raw_patterns.items():
matches = re.findall(pattern, text_data)
counts[provider] += len(matches)
# Feature: top provider in 2022 dataset
if counts:
top_provider = counts.most_common(1)[0][0]
top_count = counts[top_provider]
else:
top_provider = None
top_count = 0
# Feature: total email mentions for the year
total_mentions = sum(counts.values())
# Feature: provider diversity score (normalized)
diversity = len([c for c in counts.values() if c > 0]) / 4.0 if total_mentions > 0 else 0
return
"year": year,
"provider_counts": dict(counts),
"top_provider": top_provider,
"top_provider_count": top_count,
"total_email_mentions": total_mentions,
"provider_diversity_score": round(diversity, 2),
"feature_timestamp": datetime.now().isoformat()
The Context: The 2022 Landscape
In 2022, the email landscape had settled into a stable duopoly, with Gmail dominating the market and Outlook (Microsoft) holding a strong second place. Yahoo and AOL had become legacy brands, both owned by Verizon Media (later sold to Apollo Global Management), catering mostly to long-time users who preferred not to switch. gmailcom yahoocom hotmailcom aolcom txt 2022 top
Here is the review of each service:
Part 5: A Step-by-Step Guide to Setting Up Your DNS TXT Records for All Four
If your goal in 2022 was to ensure your emails landed in the top of the inbox (not spam), you followed this checklist:
Step 1: Aggregate your providers.
- If you use Gmail to send: Add
include:_spf.google.com to your SPF TXT record.
- If you use Outlook (Hotmail) to send: Add
include:spf.protection.outlook.com.
Step 2: Generate DKIM keys.
- For Gmail: Enable DKIM in Google Admin console -> Apps -> GSuite -> Gmail -> Authenticate email.
- For Yahoo/AOL: Yahoo provides DKIM keys in Yahoo Small Business settings.
Step 3: Create the DMARC TXT record.
_dmarc.yourdomain.com. TXT "v=DMARC1; p=quarantine; rua=mailto:dmarc@yourdomain.com"
- 2022 Pro tip: Start with
p=none to monitor, then move to p=quarantine after 2 weeks.
Step 4: Test your "txt" string.
Use MXToolbox. Enter your domain. Check the "SPF" and "DMARC" sections. A green checkmark means you beat the 2022 filters. Since you asked to generate a good feature
Important Security & Ethical Considerations
Downloading or possessing these files can be risky and illegal:
- Malware Risk: Text files hosted on file-sharing sites (especially those labeled as "hacks" or "combos") frequently contain hidden malware or redirect to malicious websites.
- Legal Issues: In many jurisdictions, possessing or distributing lists of stolen credentials is illegal.
- Ethical Use: Using these lists to access accounts you do not own is a cybercrime (unauthorized access).
If you are researching this for Cybersecurity purposes:
If you are a security professional or student, it is safer to use sanitized datasets designed for educational purposes, such as those provided by Have I Been Pwned or other reputable breach notification services, rather than downloading raw text files from the open web.
If you are concerned your email is on such a list:
You can check the security of your own accounts safely by visiting: The Context: The 2022 Landscape In 2022, the
- Have I Been Pwned: A legitimate site that checks if your email has appeared in known data breaches.
- Google Password Manager: Checks if your saved passwords have been compromised.