OpenPII Watcher

Systematic Detection of Exposed PII in Publicly Shared Documents

84.3% Precision
100% Perfect F1
+84% vs Baseline

Suryakiran Valavala • Arsh Advani • Vijay Arvind Ramamoorthy

UC Santa Cruz • CSE 253: Network Security

The Problem

Exposure Risks

  • Google Docs with "anyone with link" permissions
  • Pastebin posts shared publicly by users
  • Contact lists and signup sheets left open
  • Sensitive data accidentally exposed in dumps

Security Impacts

  • Phishing attacks targeting exposed emails
  • Social engineering using names and phones
  • Identity theft from SSN and financial data
  • Comprehensive profiling from aggregated PII
Users lack tools to check what PII their documents expose before sharing.

Research Question

Can we systematically detect PII exposed through public sharing links?

Accuracy

Using transparent, lightweight pattern matching

Coverage

Emails, Phones, Names, Addresses, SSN

Privacy

100% Client-side processing for safety

Our Solution

Enhanced Detection

Improved regex patterns with false positive filtering

Multi-Platform

Pastebin (95%+) and Google Docs (60-80%)

Privacy-First

100% client-side processing, no server needed

Rigorous Evaluation

16 documents, 192 PII instances, ground truth

Working Demo

Deployed web application on GitHub Pages

Open Source

Complete implementation and evaluation code

System Architecture

Data Layer

URL Parsing • Platform Detection • Content Fetching

Detection Layer

Regex Pattern Matching • False Positive Filtering

Output Layer

Results Aggregation • Risk Assessment • Recommendations

How We Access Different Platforms

Different platforms require different approaches to fetch content. Here's how we handle each:

Pastebin

Simple approach: Pastebin provides a special URL format that allows direct access to raw text.

How it works Transform: pastebin.com/ABC123 → pastebin.com/raw/ABC123
Why it works Pastebin's /raw/ endpoint allows cross-origin requests
Success Rate 95%+

Google Docs

Challenge: Google Docs blocks direct access from browsers (CORS restrictions). We use a two-step approach:

Step 1: Try Direct Attempt direct access (works ~40% of the time)
Step 2: Use Proxy If blocked, use a proxy service (+20-40% success)
Combined Success 60-80%

What We Detect and How Well

F1-Score: A measure of accuracy (1.000 = perfect, 0.000 = no detection). Higher is better.

Email

F1: 1.000 (Perfect!)
user+tag@domain.com

Consistent format makes detection reliable

Address

F1: 0.92 (Excellent)
123 Main Street

Street addresses follow predictable patterns, but some variations exist

SSN

F1: 1.000 (Perfect!)
123-45-6789

Always formatted as XXX-XX-XXXX

Name

F1: 0.723 (Good)
Dr. Mary-Anne O'Brien

Catches 94% of names, but some false positives (e.g., "Credit Card")

Phone

F1: 0.681 (Good)
(555) 123-4567

Handles multiple formats, but misses international numbers

Credit Card

F1: 0.333 (Limited)
4532-1488-0343-6467

Validates with Luhn algorithm, but limited test data

How Well Does It Work?

Tested on: 16 documents with 192 PII instances

Overall Performance

84%
Accuracy

Out of 100, 84 correct

88%
Coverage

Find 88 of 100 PII

100%
Perfect Types

2 types perfect

Results by PII Type

Email
100% Perfect
Perfect!
Address
92%
Excellent
SSN
100% Perfect
Perfect!
Name
72%
Good
Phone
68%
Good

Key Takeaway: Emails and SSN are detected perfectly (100%). Addresses work excellently (92%). Names and phones work well (68-72%).

How Much Better Are We?

Baseline = Simple patterns (basic regex). Our approach = Enhanced patterns with careful design.

Average Improvement: +84%

Our system is nearly twice as good as simple baseline patterns

Side-by-Side Comparison

Email Detection
Baseline (Simple)
37%
Only finds 37 of 100
Our System (Enhanced)
100%
Finds all 100 of 100
+172% Better!
Phone Detection
Baseline (Simple)
34%
Only finds 34 of 100
Our System (Enhanced)
68%
Finds 68 of 100
+100% Better!
Address Detection
Baseline (Simple)
0%
Cannot detect at all
Our System (Enhanced)
92%
Finds 92 of 100
New Capability!

Bottom Line: Simple patterns miss most PII. Our enhanced approach finds much more and adds new capabilities (like address detection).

What We Learned

Four key insights from our evaluation:

1. Regex Works for Structured PII

Finding: Perfect 100% F1-scores for emails, addresses, and SSN.

Why it matters: You don't always need machine learning. For data with consistent formats, simple regex is both effective and transparent.

2. Precision-Recall Trade-offs

Finding: Names: 94.1% recall (catches almost all) but 58.7% precision (some false alarms).

Why it matters: For privacy tools, it's better to alert on potential names (even if some are false) than miss real ones. This is an intentional design choice.

3. Client-Side Processing Works

Finding: Everything runs in the browser, no server needed.

Why it matters: Users' documents never leave their device. Privacy is preserved, and no infrastructure costs. Processing is fast (under 500ms).

4. Enhanced Patterns Matter

Finding: +84.3% improvement over simple baseline patterns.

Why it matters: Our careful pattern design and false positive filtering make a huge difference. You can't just use naive regex and expect good results.

Limitations & Future Work

Current Limitations

  • Name Precision (58.7%): False positives from capitalized common words.
  • Phone Coverage (74.2%): Misses some international formats and extensions.
  • Google Docs: CORS restrictions and authentication issues (60-80% success).

Future Directions

  • Context-Aware Filtering: Using NLP to improve name detection.
  • Browser Extension: For auto-scanning and better Google Docs access.
  • Platform Expansion: Support for Notion, Dropbox Paper, and Sheets.

Live Demo

Real-time Detection Risk Assessment Security Tips 100% Client-Side

Conclusion

Systematic PII detection in shared documents is feasible, effective, and privacy-preserving.

84.3% Precision
100% Structured F1
+84% Improvement