Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
I'm BaaaaAAAAaccck!
RE: I'm BaaaaAAAAaccck!
#9
Yeah, my bit is mostly as advertised, and my ripping script is open source.  There were lots of fiddly little issues, like Tapatalk intentionally obfuscating their API and pretending they were serving plain HTML files, but eventually I was able to hook into the standard PhpBB API.  I think the domain or the path may have changed in the middle of the rip, that was fun.

Oh yeah, I had to parse the HTML inside an HTML comment, because they chose to remove post titles from the DOM by wrapping them in comment tags.  \o/ All sorts of data was buried under a mountain of divs and spans.  I could tell that this shop was freaking crazy with their dev process because I would fail code review for pretty much everything they changed from the stock code.  Other than that, it was just keeping track of users, and doing a simple randomizing process to get all the threads without anyone noticing.  I mean, it wouldn't look like organic traffic, but based on their web dev team I was pretty confident their sysadmin team wouldn't notice anything in the logs if the numbers weren't sequential.

And then came the part where I was rewriting all of this HTML data back into BBCode -- specifically MyBB, because that's what Bob chose.  This script was much crappier than the download one, but text munging always looks messy.  All of this was to change HTML tags into the bbcode equivalent, but this is less clean than it looks too.  Basically all of the embeds needed to be rewritten -- and some of them were actually things where the forum user had just pasted a link, and Tapatalk automagically turned it into an embed to help with product placement. So I had to un-Crapatalk those.  URLs needed to be upgraded to HTTPS while I was there.  And fix the font sizes given as "15px%".  

It didn't do a depth-first parse, so you basically had to run the script several times until it stopped finding changes, then run it once more.  Yes, super classy.  But it's one of those things where you keep developing until your project is done, and there's no need to actually fix bugs once you have your data set.  Anyway, all this work was done in isolation so Bob could claim to have done most of the work, but we all know that programmers do more work than DBAs, right?

Bob Wrote:My next db project is fixing a date/time shift that crept in when I moved the dates in UNIX era format -- it seems to have ignored the offset from UTC for my time zone and set everything a few hours off.
Haha no, that's totally my location that caused it, I just ignored time zones on input data.  I'm not sure exactly what happened, but it was either 3 or 7 hours off, which would either be my offset from Bob, or my offset from GMT.  Isn't that something as simple as: UPDATE posts SET timestamp = DATE_ADD(timestamp, INTERVAL 3 HOUR) WHERE timestamp < @migration_date; Actually here are the mysql date function docs.
"Kitto daijoubu da yo." - Sakura Kinomoto
Reply


Messages In This Thread
I'm BaaaaAAAAaccck! - by Wiregeek - 04-19-2019, 09:48 PM
RE: I'm BaaaaAAAAaccck! - by Bob Schroeck - 04-19-2019, 09:50 PM
RE: I'm BaaaaAAAAaccck! - by Black Aeronaut - 04-19-2019, 09:53 PM
RE: I'm BaaaaAAAAaccck! - by Wiregeek - 04-19-2019, 10:11 PM
RE: I'm BaaaaAAAAaccck! - by Black Aeronaut - 04-20-2019, 01:07 AM
RE: I'm BaaaaAAAAaccck! - by Bob Schroeck - 04-22-2019, 09:34 PM
RE: I'm BaaaaAAAAaccck! - by DHBirr - 04-20-2019, 05:30 AM
RE: I'm BaaaaAAAAaccck! - by Wiregeek - 04-23-2019, 12:21 AM
RE: I'm BaaaaAAAAaccck! - by Labster - 04-23-2019, 03:45 AM
RE: I'm BaaaaAAAAaccck! - by classicdrogn - 04-23-2019, 06:52 AM
RE: I'm BaaaaAAAAaccck! - by Bob Schroeck - 04-23-2019, 07:11 AM
RE: I'm BaaaaAAAAaccck! - by Labster - 04-24-2019, 02:38 AM

Forum Jump:


Users browsing this thread: 1 Guest(s)