@morganni
I've used HTTracker, and yes, it will snag EVERYTHING connected to the site you are scraping for datum.
And usually, using the default settings all of the javascript code and the html pages will be saved, along with the "overscript" of the page layout and an offline version of the master index of the site so you can use it as if you actually were online, as long as you don't try to access offsite links while you actually are offline.
That said, given enough time, it will snag everything, and that will include pages, subpages, source code pages, and thankfully, this means any page still on the server yet still hidden behind those hideous 'we don't want to talk about it" curtains will be accessible, though it may require tweaking the page code manually and removing the offending filter page, as HTTracker rebuilds the code base offline as close as possible to the form it was in online.
Posts: 3,698
Threads: 95
Joined: May 2012
Reputation:
9
I'm not using HTTracker, nor wget, but my own custom code that I wrote
last night. And yes, I'm getting all of the subpages. In fact, since
they're indexed, they're going to be much easier to get than the Main
namespace. Anyway, since there are 170,000+ pages on the wiki, it's going to take quite a while.
What I'm not getting: forums, reviews, edit histories, YKTTW, discussion pages, nor the long-lost TroperTales and FetishFuel pages. It will be a single-time (across a few days) snapshot of the wiki content only. Is the lack of any of that a problem?
@Morganni. Wow, it is in fact worse that I thought at Wikia. Oh, Jimbo. I don't want Wikia getting a piece of this pie either. We have to make the pie higher.
-- ∇×V
@vorticity
Glad you know how to write your own code. My skills in that area are so laughable it's sad.
Also, as for the Wikia, I took a closer look at them, and honestly, while they do have advantage of being able to preview your edits, they lack most of the coding that made pages easy to navigate on TV Tropes, and I found out they already have a similar project in the pipeline to siphon off the index pages and reconstitute their own version of TV Tropes.
In short, don't worry about Wikia getting a piece of the pie, they just decided to procure their own pie mix and make a new one themselves.
Anyhoo, I've already saved quite a bit of what has already been burnt off the TV Tropes servers long before any of us had this discussion (in the form of HTML source code, most of which I reconstructed by hand to resemble their former selves), which I'd be happy to donate to your archive if you'd like.
As for what you're NOT copying, that's fine. The forums wouldn't really be applicable to a new site, edit histories would be pointless, YKTTW would be the same, as would discussion pages. As for Tropers Tales and Fetish Fuel, they'd be nice, but I wouldn't consider their absence to be a deal breaker.
Posts: 3,698
Threads: 95
Joined: May 2012
Reputation:
9
Update: 122,409 pages downloaded. This includes most of the work namespaces by now, as well as things like Trivia. The pie is baking nicely.
Okay, I do have a ethical problem here. I'm not entirely sure what to do about the Tropers/ namespace. I think that making a distribution of it is A Bad Thing, even if it is technically CC-BY-SA content. People have an assumption that they're making the page for that site, and shouldn't have to manage anything else. So I'm left with two choices:
- Don't download the userpages. Easiest on me, and provides the most privacy for users. What little privacy that is. Rely on the Internet Archive to provide a backup (in HTML form only).
-- ∇×V
Posts: 1,382
Threads: 33
Joined: Sep 2007
Reputation:
0
I don't really see what the Troper pages have to do with saving the tropes themselves. I don't really frequent the place, but from what I understand that's by and large personal info on the tropers background, interests and such. There's nothing there you need for research/info about literary devices.
---
The Master said: "It is all in vain! I have never yet seen a man who can perceive his own faults and bring the charge home against himself."
>Analects: Book V, Chaper XXVI
@vorticity
Yeah, I wouldn't worry about the Tropers namespace.
For one, it's most relevant to TV Tropes, hence would be useless for other sites, unless they are defecting to another site (and even then it would probably be better to just cook up a new one).
Also, those contain personal info, so not including them would be prudent in any regard IMO.
Anyhoo, just let me know when your pie is done, and I'll see what I can do about helping you spread the pie mix around.
Posts: 25,537
Threads: 2,060
Joined: Feb 2005
Reputation:
12
What they said. When it comes to personal data, "If you can't protect it, don't collect it."
If a particular contributor wants a Troper page on the new site, he or she can create one.
--
Rob Kelk
"Governments have no right to question the loyalty of those who oppose
them. Adversaries remain citizens of the same state, common subjects of
the same sovereign, servants of the same law."
- Michael Ignatieff, addressing Stanford University in 2012
Posts: 3,698
Threads: 95
Joined: May 2012
Reputation:
9
Okay, thanks for your opinions everyone. I'm not going to download any of the Tropers/ namespace. That will make my download queue much shorter, too.
-- ∇×V
Breaking news.
It seems our plans to archive TV Tropes were wise, as now more stuff, even things blessed by the P5, have now disappeared with no reason given by FE himself.
http://tvtropes.org/pmwik...026600&page=198#4936
Here's hoping our pie is ready to leave the oven soon.
Update: False alarm. Seems his cutlisting was too zealous, most of it has been brought back. Still, archiving seems wise if Google demands another purge.
P.S. - I probably can't seed a torrent very well, but I'd be willing to lend my mediafire account to hosting a DDL mirror of the archive
Posts: 8,933
Threads: 386
Joined: May 2006
Reputation:
3
How big is this collection going to be? I mean, in terms of disk space?
Posts: 3,698
Threads: 95
Joined: May 2012
Reputation:
9
A long time ago, I named the partitions on my hard drives after the characters in Firefly. So when I started this project, I noticed that Zoe was an empty partition, and made a directory called 'trope' there. I got my own joke about 10 minutes later.
[serenity:/Volumes/Zoe/trope] vorticity% ls | grep Main. | wc -l (because ls Main.* dies with this much data)
65260 (pages in the Main namespace)
[serenity:/Volumes/Zoe/trope] vorticity% du -ch .
1.8G total
1.8 GB is an awful lot of text, but 7zip can compress this to about a quarter of the size, based on some testing. So in other words, no bigger than 10 times EPU's creative output.
I'm still 8,100 pages short in the Main namespace, because it's not indexed. I'm not sure if I'm just missing zero-wick articles, or if the page count is listing articles that have been blanked. I'm running some last ditch attempts to find them, but I assume that most of the remaining articles are either unwicked redirects or low-content pages.
Okay, now is the time that I start asking for donations of articles that have been deleted due to the content policy. I'm looking for things in wikicode format -- not HTML, as some have been suggesting. This is the smallest representation of the content, and would be the easiest to port to another wiki. If someone wants to mine old copies of say, this page, for links, I would greatly appreciate it. Remember, the wiki code can be gotten by clicking the "source" button on the page. You can send stuff to me at bslaabs@ [the google mail service].
Also, I'd like to know if anyone has a recommendation for BitTorrent trackers. This enterprise is 100% legal, for what that's worth. Once you get a copy, Rpg1, you can upload it to your DDL service.
-- ∇×V
@vorticity
I assume that 1.8 GB is just HTML source files only, right? Not that I doubt you know what you're doing, but you might have duplicates of some stuff (some pages have the same material but are mere redirects/snowclones with different titles), and pruning any of those might make your archiving easier)
Anyway, let me know when you have the torrent up, and I'll get right on it.
Note: It seems the Rape AND Porn tropes have been purged from TV Tropes (you can't even access the source for the Rape Tropes anymore), as well as a few other things.
I've also been backing up all the stuff the P5 has been deleting, so I'll add that in with the above mentioned copies of the above (I saved the latest version of all the Rape Tropes before the source pages went bye bye), and once I make sure I get all of that bundled together for you, I'll make a quick mediafire or sendspace link and post it either here (if Bob allows), or I'll send you the link via email.
(Everything I have is in HTML source only)
One question, how quick do you need this? I can fast track it if need be, but I'm still hunting down the latest version of a few pages and scouring TVT for the few things not purged from the server (thank goodness for the namespace splits that left fully intact copies of some things in the Main category or vice versa) as well as the Wayback Machine.
Note 2; Given how DDL services are twitchy these days, anyone else willing to use their own DDL account with mediafire/sendspace/depositfiles, etc. to make another backup like I am? What were doing is legal, so I don't think we'll get our accounts thumped, but having several places to get this stuff (via torrent and DDL) would probably be wise.
Posts: 3,698
Threads: 95
Joined: May 2012
Reputation:
9
Redirects are saved as [[redirect:TropePage] only.
Don't worry about getting things done too fast. This is going to take a week or so to get organized, still.
-- ∇×V
@vorticity
Good to know I have a little time.
I have all the Rape Tropes and Porn Tropes (thankfully, they shared overlap on a lot of pages), and I have the latest version of both, either from the IWM or before they were hosed from TVT's servers.
I also have (and am still collecting) all the stuff they are trying to kill on the CVR, including the original versions of pages kept but cleaned.
I also plan to hit up a few other dissident tropers who have hoarded things they didn't want to die for some more, so in about a week, I should have a nice present for you.
Posts: 1,452
Threads: 58
Joined: Apr 2006
Reputation:
0
Then you grab all the 'safe' pages, and pull it all together, right?
When you get it all together, I'd like a copy. Heck, I'd help host it.
My Unitarian Jihad Name is: Brother Atom Bomb of Courteous Debate. Get yours.
I've been writing a bit.
Posts: 27,583
Threads: 2,269
Joined: Sep 2002
Reputation:
21
Rpg1 Wrote:I've also been backing up all the stuff the P5 has been deleting, so I'll add that in with the above mentioned copies of the above (I saved the latest version of all the Rape Tropes before the source pages went bye bye), and once I make sure I get all of that bundled together for you, I'll make a quick mediafire or sendspace link and post it either here (if Bob allows), or I'll send you the link via email. I don't think I'd have a problem with that.
-- Bob
---------
Then the horns kicked in...
...and my shoes began to squeak.
Posts: 8,933
Threads: 386
Joined: May 2006
Reputation:
3
Psshht. Yeah. Ain't like Fast Eddie!Google Adsense wants them anymore. They wouldn't have a leg to stand on if they were to come after us.
Posts: 2,072
Threads: 62
Joined: May 2006
Reputation:
0
http://www.mediafire.com/?ct84r288j08gfa4
This has source for a lot of the pages that were initially cut. (Including a fair number that have returned, mind you.)
In other news, Martello has ragequit. I'm amused. Seems he's also hanging out with a group of people who want to make a new TV Tropes website that has even more stuff cut. So I guess I have to thank him for making the rest of us look good!
-Morgan.
"Cirno! Lend me your power!"
Posts: 3,698
Threads: 95
Joined: May 2012
Reputation:
9
Thanks for that, Morganni. Looks like exactly the kind of thing I was looking for. I'll decide later whether to integrate it, or just leave it as an attachment.
Okay, I'm essentially done with my crawl of the wiki. I was intentionally vague about what I was doing, because I didn't know exactly who was watching the thread. Now that that's over, I can give you a better idea of what I downloaded. I have 180,080 pages in wiki code of 188,536 that the article count lists. Some of the disparity is in namespaces like Sekrit/ which is completely locked down, but the 7518 of the missing articles are from the Main namespace. Main/ does not have an listing page, so I just had to get that by crawling the wiki. Any page which was intentionally blanked would be indistinguishable from from a non-page, so that may be part of the missing data. The remainder is likely zero wick pages that don't attract much attention.
I ended up just crawling the wiki source code only, because this allowed me to both minimize the bandwidth used while saving the data in its most portable, compressed form. I did have to write a lightweight parser to find the links, which wasn't as perfect as I would have liked. It ended up picking up line noise gibberish -- probably parts of external links -- that managed to fit the syntax for WikiWords. I probably could have prevented that, but it took me four hours from inspiration to operational code, and it did the job well for that amount of time. If enough bad links built up, I just asked the TV Tropes parser to preview a page, and tell me which links are redlinks.
I seeded with the page HGamePOVCharacter, which I just happened to have sitting around because it looked like TRS was going to delete it. I probably should have started with IndexIndex, but it eventually got there. To finish off anything I may have lost, I looked through every wick to every single one of the TropesOfLegend. The upshot of crawling the source code is it avoids the messages about "We do not want a page on this topic" -- those display just fine in source code mode. It also keeps all of the redirects as separate pages, which are very useful for finding stuff.
This was quite a fun, complicated project. It's more data than I've dealt with since the last time I ran a weather model in grad school. The final size is 1.95GB on disk, though it should compress nicely. Alright, time to give the router a rest.
-- ∇×V
Posts: 8,933
Threads: 386
Joined: May 2006
Reputation:
3
Cool. I'll go ahead and download both files as soon as I get my external drives setup. I may not be able to contribute much to this project, but at the very least I can keep the untainted backup copies for your guys in case something weird happens and the data gets borked.
Posts: 27,583
Threads: 2,269
Joined: Sep 2002
Reputation:
21
Is there anything that needs to be rescued via the Wayback Machine? I'm just curious.
-- Bob
---------
Then the horns kicked in...
...and my shoes began to squeak.
I have a secret to reveal.
I'm responsible for that mediafire link. In fact, I was really PO'ed about the whole page salt and delete process when it first started so I decided to back everything up and preserve it because I found the whole thing offensive for reasons I won't belabor (and I'm sure the text file in the archive will explain).
I will be updating it again soon (within a day or two at latest) , will that be soon enough?
Also, will you post the torrent link here when you're ready?
P.S. - I was archiving as things wound up on the CVR, so it may have things that were spared but were cleaned, so my versions are the non Bowdlerised versions. Also, I think the "Keep Abreast Of This Index" can be rescued off the Wayback Machine, still looking for more.
Posts: 3,698
Threads: 95
Joined: May 2012
Reputation:
9
Rpg1 Wrote:I have a secret to reveal. I'm responsible for that mediafire link. It's not much of a secret if I can identify you by writing style. Also you're the one here who doesn't know the difference between HTML and wiki code source. The good news is that everything in those files is in the right format, so keep doing what you've been doing.
Go ahead and get the pages from the Wayback Machine that are mentioned here: http://tvtropesmirror.wikia.com/wiki/Ca ... _TV_Tropes Also, these pages: TonariNoOneesan, AHeatForAllSeasons, Anime/Campus, CoedAffairs, OfficeAffairs, Gibo
If you're really feeling adventurous, I can send you a copy of the downloaded pages list by email, and you can grep it against this index on the Internet Archive. Be warned that one of the failure modes for pages is to produce a copy of the home page -- specifically when the page doesn't begin with a lowercase character. Since all of the Wayback Machine URLs are stored as lowercase this makes interpreting the data a bit of challenge.
Huh, I'm starting to wonder if I can just upload the end result to the Internet Archive.
-- ∇×V
@vorticity
Yeah, I figured it was already an Open Secret, I just thought I'd confirm it.
Also, yes, I'm an idiot on HTML code and wiki code. This whole situation is what's been prompting me to learn about them.
BTW, I am feeling adventurous, so feel free to send me an email. I already have most of the pages in question, but a nice list to look at would help.
Also, some of these pages I had to reconstruct by hand (I know enough about HTML source formatting and TV TropesWikia code format to do so for either site), is that acceptable? I even scoured histories where I couldn't find source pages and cut and pasted several pages back together that way.
Posts: 3,698
Threads: 95
Joined: May 2012
Reputation:
9
I forgot to post the link to everything Archive.org has in the Main namespace: http://wayback.archive.org/web/*/ http://tvtropes.org/pmwiki/pmwiki.php/Main/* "5,368 URLs have been captured for this domain." Ouch, that's not very many. Some of those are bad links or source or diffs, so they really don't have that much.
Reconstruct what you can; we're trying to preserve any information we can find. I'm going to attach your section as a separate Deleted Pages part of the archive.
-- ∇×V
|