Can you create a table that keeps track of the new thread ID and the old thread ID? That might make it easier to deduplicate.
To get updates, I'll wait until we're locked, then grab everything off of the recent changes back until the day I finished my crawl. It should be easy enough to even limit it to new posts, so long as people aren't editing posts that are more than a day old it should be fine.
I'm actually kind of surprised I found hidden threads, but I guess that's possible. They must not be so hidden to be displayed to logged-out users.
tid=37 has some translation failures on the first post. Basically it looks like Tapatalk is running photobucket through an image proxy which might even be good for privacy. Still, not good in the post data. That's like an easy regex cleanup, though.
Altogether, so many failure modes to cover in conversion. Four different forms of email protection, and obfuscation on top of that. I know the trouble Bob went through because I had similar issues with translation. Special cases for different products: very fun.-- ?×V
To get updates, I'll wait until we're locked, then grab everything off of the recent changes back until the day I finished my crawl. It should be easy enough to even limit it to new posts, so long as people aren't editing posts that are more than a day old it should be fine.
I'm actually kind of surprised I found hidden threads, but I guess that's possible. They must not be so hidden to be displayed to logged-out users.
tid=37 has some translation failures on the first post. Basically it looks like Tapatalk is running photobucket through an image proxy which might even be good for privacy. Still, not good in the post data. That's like an easy regex cleanup, though.
Altogether, so many failure modes to cover in conversion. Four different forms of email protection, and obfuscation on top of that. I know the trouble Bob went through because I had similar issues with translation. Special cases for different products: very fun.-- ?×V