ENOSUCHBLOG

Programming, philosophy, pedaling.


Paste Scraping: Lessons (Not) Learned

Mar 3, 2015     Tags: programming, security    

This post is at least a year old.

For those who don’t know, pomf.se is an online service that provides convenient file uploads and online text file hosting. It rolls the services provided by puush and pastebin (among others) into one, making it a one stop shop for rapidly sharing files of all sorts.

Unfortunately for them (and us), their paste URLs look something like this:

1
http://p.pomf.se/6380

and their download URLs look like this:

1
http://p.pomf.se/?dl=6380

Being the curious soul that I am, I naturally tried substituting 6380 for other numbers and, unsurprisingly, a pattern emerged:

Thanks to this pattern, scraping the entire site was a single command away:

1
$ for i in $(seq 1 6380); do wget http://p.pomf.se/?dl=${i} -O ${i}.txt ; done

Ten minutes and 6380 text files later, I have this in my temp folder:

The scrapes.

1
2
$ ls -l *.txt | wc -l
6380 # 6380 text files

Of course, there are plenty of empty files among these, thanks to user-chosen expiration and deletion policies. Getting rid of these is also a single line away:

1
2
3
$ find . -size 0 -print | xargs rm -f
$ ls -l *.txt | wc -l
4273 # 4273 non-empty text files

Then, there are password-protected files. Since scraping these returned an identical error HTML page for each, we can remove them with fdupes:

1
2
3
$ fdupes -rf . | grep -v '^$' | xargs rm -f # we grep to filter out empty lines first
$ ls -l *.txt | wc -l
3251 # 3251 unique, non-empty text files

Now, we’re left with the meat of the scrape: three thousand, two hundred and fifty-one unique text files.

First, we’ll grep for passwords. Nobody would put real passwords on a public, unencrypted pastebin, right?

Wrong:

1
2
3
$ grep -ri 'password' . > ~/password_matches.txt
$ du -h ~/password_matches.txt
116K /home/william/password_matches.txt

Of course, plenty of these matches are garbage:

1
./[redacted].txt:# python2 find_friends.py $username $password < numbers.txt > results.txt

…and plenty aren’t:

1
2
3
4
5
6
7
8
9
./[redacted].txt:Twitter - User: [redacted] Password: [redacted]
./[redacted].txt:Twitter - User: [redacted] Password: [redacted]
./[redacted].txt:Twitter - User: [redacted] Password: [redacted]
./[redacted].txt:Twitter - User: [redacted] Password: [redacted]
./[redacted].txt:Twitter - User: [redacted] Password: [redacted]
./[redacted].txt:Twitter - User: [redacted] Password: [redacted]
./[redacted].txt:Twitter - User: [redacted] Password: [redacted]
./[redacted].txt:Twitter - User: [redacted] Password: [redacted]
...

Yes, those are real Twitter users:

Some poor guy's Twitter.

In all likelyhood, these accounts were stolen by a spammer or password trader. User incompetence aside, one would think that spammers and traders would take better precautions to protect their “business”.

There are plenty of other things we could grep for, as well: Social Security Numbers, formal names, email addresses, and so on. This particular site may not be the best source for those searches, but others certainly are.

Summary

So, what have we learned?

We’ve learned that, with a few lines of shell, we can scrape an entire paste service in minutes, filter its output for uniqueness, and quickly search for private user information that should never have been posted online, much less on an unencrypted, public server.

Pomf.se isn’t the exception, either. There are dozens of public paste and upload services that weaken the privacy of their users by failing to properly distribute and/or randomize their URLs, much less provide encryption or expiry-by-default.

If you the reader are running a public paste service, please do not do these things under any circumstance:

Please, please, do these things:

Finally, for the users out there, think twice before making that personal paste. Scrapers are ubiquitous and trivial to construct, and you cannot “hide in the crowd” from them. Always set a password, and an expiration if available. Above all else, consider switching to a pasting service that offers greater URL entropy for everything but innocuous public pastes.

- William

P.S.: This post isn’t mean to knock Pomf.se in particular. It’s actually a great service, one that I use regularly for (non-private) file sharing. My many thanks to Eric Johansson for providing it to the community.