OpenPII Watcher

Systematic Detection of Exposed PII in Publicly Shared Documents

84.3% Precision

100% Perfect F1

+84% vs Baseline

Suryakiran Valavala • Arsh Advani • Vijay Arvind Ramamoorthy

UC Santa Cruz • CSE 253: Network Security

The Problem

Exposure Risks

Google Docs with "anyone with link" permissions
Pastebin posts shared publicly by users
Contact lists and signup sheets left open
Sensitive data accidentally exposed in dumps

Security Impacts

Phishing attacks targeting exposed emails
Social engineering using names and phones
Identity theft from SSN and financial data
Comprehensive profiling from aggregated PII

Users lack tools to check what PII their documents expose before sharing.

Research Question

Can we systematically detect PII exposed through public sharing links?

Accuracy

Using transparent, lightweight pattern matching

Coverage

Emails, Phones, Names, Addresses, SSN

Privacy

100% Client-side processing for safety

Our Solution

Enhanced Detection

Improved regex patterns with false positive filtering

Multi-Platform

Pastebin (95%+) and Google Docs (60-80%)

Privacy-First

100% client-side processing, no server needed

Rigorous Evaluation

16 documents, 192 PII instances, ground truth

Working Demo

Deployed web application on GitHub Pages

Open Source

Complete implementation and evaluation code

System Architecture

Data Layer

URL Parsing • Platform Detection • Content Fetching

↓

Detection Layer

Regex Pattern Matching • False Positive Filtering

↓

Output Layer

Results Aggregation • Risk Assessment • Recommendations

How We Access Different Platforms

Different platforms require different approaches to fetch content. Here's how we handle each:

Pastebin

Simple approach: Pastebin provides a special URL format that allows direct access to raw text.

How it works Transform: pastebin.com/ABC123 → pastebin.com/raw/ABC123

Why it works Pastebin's /raw/ endpoint allows cross-origin requests

Success Rate 95%+

Google Docs

Challenge: Google Docs blocks direct access from browsers (CORS restrictions). We use a two-step approach:

Step 1: Try Direct Attempt direct access (works ~40% of the time)

Step 2: Use Proxy If blocked, use a proxy service (+20-40% success)

Combined Success 60-80%

What We Detect and How Well

F1-Score: A measure of accuracy (1.000 = perfect, 0.000 = no detection). Higher is better.

Email

F1: 1.000 (Perfect!)

user+tag@domain.com

Consistent format makes detection reliable

Address

F1: 0.92 (Excellent)

123 Main Street

Street addresses follow predictable patterns, but some variations exist

SSN

F1: 1.000 (Perfect!)

123-45-6789

Always formatted as XXX-XX-XXXX

Name

F1: 0.723 (Good)

Dr. Mary-Anne O'Brien

Catches 94% of names, but some false positives (e.g., "Credit Card")

Phone

F1: 0.681 (Good)

(555) 123-4567

Handles multiple formats, but misses international numbers

Credit Card

F1: 0.333 (Limited)

4532-1488-0343-6467

Validates with Luhn algorithm, but limited test data

How Well Does It Work?

Tested on: 16 documents with 192 PII instances

Overall Performance

84%

Accuracy

Out of 100, 84 correct

88%

Coverage

Find 88 of 100 PII

100%

Perfect Types

2 types perfect

Results by PII Type

Email

100% Perfect

Perfect!

Address

92%

Excellent

SSN

100% Perfect

Perfect!

Name

72%

Good

Phone

68%

Good

Key Takeaway: Emails and SSN are detected perfectly (100%). Addresses work excellently (92%). Names and phones work well (68-72%).

How Much Better Are We?

Baseline = Simple patterns (basic regex). Our approach = Enhanced patterns with careful design.

Average Improvement: +84%

Our system is nearly twice as good as simple baseline patterns

Side-by-Side Comparison

Email Detection

Baseline (Simple)

37%

Only finds 37 of 100

Our System (Enhanced)

100%

Finds all 100 of 100

+172% Better!

Phone Detection

Baseline (Simple)

34%

Only finds 34 of 100

Our System (Enhanced)

68%

Finds 68 of 100

+100% Better!

Address Detection

Baseline (Simple)

0%

Cannot detect at all

Our System (Enhanced)

92%

Finds 92 of 100

New Capability!

Bottom Line: Simple patterns miss most PII. Our enhanced approach finds much more and adds new capabilities (like address detection).

What We Learned

Four key insights from our evaluation:

1. Regex Works for Structured PII

Finding: Perfect 100% F1-scores for emails, addresses, and SSN.

Why it matters: You don't always need machine learning. For data with consistent formats, simple regex is both effective and transparent.

2. Precision-Recall Trade-offs

Finding: Names: 94.1% recall (catches almost all) but 58.7% precision (some false alarms).

Why it matters: For privacy tools, it's better to alert on potential names (even if some are false) than miss real ones. This is an intentional design choice.

3. Client-Side Processing Works

Finding: Everything runs in the browser, no server needed.

Why it matters: Users' documents never leave their device. Privacy is preserved, and no infrastructure costs. Processing is fast (under 500ms).

4. Enhanced Patterns Matter

Finding: +84.3% improvement over simple baseline patterns.

Why it matters: Our careful pattern design and false positive filtering make a huge difference. You can't just use naive regex and expect good results.

Limitations & Future Work

Current Limitations

Name Precision (58.7%): False positives from capitalized common words.
Phone Coverage (74.2%): Misses some international formats and extensions.
Google Docs: CORS restrictions and authentication issues (60-80% success).

Future Directions

Context-Aware Filtering: Using NLP to improve name detection.
Browser Extension: For auto-scanning and better Google Docs access.
Platform Expansion: Support for Notion, Dropbox Paper, and Sheets.

Live Demo

suryacs719.github.io/cse253-openPII-web

Real-time Detection Risk Assessment Security Tips 100% Client-Side

Conclusion

Systematic PII detection in shared documents is feasible, effective, and privacy-preserving.

84.3% Precision

100% Structured F1

+84% Improvement

Resources

Web Tool

suryacs719.github.io/cse253-openPII-web

Source Code

github.com/SuryaCS719/cse253-openPII

OpenPII Watcher

The Problem

Exposure Risks

Security Impacts

Research Question

Accuracy

Coverage

Privacy

Our Solution

Enhanced Detection

Multi-Platform

Privacy-First

Rigorous Evaluation

Working Demo

Open Source

System Architecture

Data Layer

Detection Layer

Output Layer

How We Access Different Platforms

Pastebin

Google Docs

What We Detect and How Well

Email

Address

SSN

Name

Phone

Credit Card

How Well Does It Work?

Overall Performance

Results by PII Type

How Much Better Are We?

Side-by-Side Comparison

What We Learned

1. Regex Works for Structured PII

2. Precision-Recall Trade-offs

3. Client-Side Processing Works

4. Enhanced Patterns Matter

Limitations & Future Work

Current Limitations

Future Directions

Live Demo

Conclusion

Resources

Web Tool

Source Code

Questions?