Eza's Tumblr Scrape

Creates a new page showing just the images from any Tumblr

  1. // ==UserScript==
  2. // @name Eza's Tumblr Scrape
  3. // @namespace https://inkbunny.net/ezalias
  4. // @description Creates a new page showing just the images from any Tumblr
  5. // @license MIT
  6. // @license Public domain / No rights reserved
  7. // @include http://*?ezastumblrscrape*
  8. // @include https://*?ezastumblrscrape*
  9. // @include http://*/ezastumblrscrape*
  10. // @include http://*.tumblr.com/
  11. // @include https://*.tumblr.com/
  12. // @include http://*.tumblr.com/page/*
  13. // @include https://*.tumblr.com/page/*
  14. // @include http://*.tumblr.com/tagged/*
  15. // @include https://*.tumblr.com/tagged/*
  16. // @include http://*.tumblr.com/search/*
  17. // @include https://*.tumblr.com/search/*
  18. // @include http://*.tumblr.com/post/*
  19. // @include https://*.tumblr.com/post/*
  20. // @include https://*.media.tumblr.com/*
  21. // @include https://media.tumblr.com/*
  22. // @include http://*/archive
  23. // @include https://*/archive
  24. // @include http://*.co.vu/*
  25. // @exclude */photoset_iframe/*
  26. // @exclude *imageshack.us*
  27. // @exclude *imageshack.com*
  28. // @exclude *//scmplayer.*
  29. // @exclude *//wikplayer.*
  30. // @exclude *//www.wikplayer.*
  31. // @exclude *//www.tumblr.com/search*
  32. // @grant GM_registerMenuCommand
  33. // @version 5.17
  34. // ==/UserScript==
  35.  
  36.  
  37.  
  38. // Create an imaginary page on the relevant Tumblr domain, mostly to avoid the ridiculous same-origin policy for public HTML pages. Populate page with all images from that Tumblr. Add links to this page on normal pages within the blog.
  39.  
  40. // This script also works on off-site Tumblrs, by the way - just add /archive?ezastumblrscrape?scrapewholesite after the ".com" or whatever. Sorry it's not more concise.
  41.  
  42.  
  43.  
  44. // Make it work, make it fast, make it pretty - in that order.
  45.  
  46. // TODO:
  47. // I'll have to add filtering as some kind of text input... and could potentially do multi-tag filtering, if I can reliably identify posts and/or reliably match tag definitions to images and image sets.
  48. // This is a good feature for doing /scrapewholesite to get text links and then paging through them with fancy dynamic presentation nonsense. Also: duplicate elision.
  49. // I'd love to do some multi-scrape stuff, e.g. scraping both /tagged/homestuck and /tagged/art, but that requires communication between divs to avoid constant repetition.
  50. // post-level detection would also be great because it'd let me filter out reblogs. fuck all these people with 1000-page tumblrs, shitty animated gifs in their theme, infinite scrolling, and NO FUCKING TAGS. looking at you, http://neuroticnick.tumblr.com/post/16618331343/oh-gamzee#dnr - you prick.
  51. // Look into Tumblr Saviour to see how they handle and filter out text posts.
  52. // Add a convenient interface for changing options? "Change browsing options" to unhide a div that lists every ?key=value pair, with text-entry boxes or radio buttons as appropriate, and a button that pushes a new URL into the address bar and re-hides the div. Would need to be separate from thumbnail toggle so long as anything false is suppressed in get_url or whatever.
  53. // Dropdown menus? Thumbnails yes/no, Pages At Once 1-20. These change the options_map settings immediately, so next/prev links will use them. Link to Apply Changes uses same ?startpage as current.
  54. // Could I generalize that the way I've generalized Image Glutton? E.g., grab all links from a Pixiv gallery page, show all images and all manga pages.
  55. // Possibly @include any ?scrapeeverythingdammit to grab all links and embed all pictures found on them. single-jump recursive web mirroring. (fucking same-domain policy!)
  56. // now that I've got key-value mapping, add a link for 'view original posts only (experimental).' er, 'hide reblogs?' difficult to accurately convey.
  57. // make it an element of the post-scraping function. then it would also work on scrape-whole-tumblr.
  58. // better yet: call it separately, then use the post-scraping function on each post-level chunk of HTML. i.e. call scrape_without_reblogs from scrape_whole_tumblr, split off each post into strings, and call soft_scrape_page( single_post_string ) to get all the same images.
  59. // or would it be better to get all images from any post? doing this by-post means we aren't getting theme nonsense (mostly).
  60. // maybe just exclude images where a link to another tumblr happens before the next image... no, text posts could screw that up.
  61. // general post detection is about recognizing patterns. can we automate it heuristically? bear in mind it'd be done at least once per scrape-page, and possibly once per tumblr-page.
  62. // user b84485 seems to be using the scrape-whole-site option to open image links in tabs, and so is annoyed by the 500/1280 duplicates. maybe a 'remove duplicates' button after the whole site's done?
  63. // It's a legitimately good idea. Lord knows I prefer opening images in tabs under most circumstances.
  64. // Basically I want a "Browse Links" page instead of just "grab everything that isn't nailed down."
  65. // http://mekacrap.tumblr.com/post/82151443664/oh-my-looks-like-theres-some-pussy-under#dnr - lots of 'read more' stuff, for when that's implemented.
  66. // eza's tumblr scrape: "read more" might be tumblr standard.
  67. // e.g. <p class="read_more_container"><a href="http://ladylovelycocks.tumblr.com/post/66964089115/stupid-comic-continued-under-readmore-more" class="read_more">Read More</a></p>
  68. // http://c-enpai.tumblr.com/ - interesting content visible in /archive, but every page is 'themed' to be a blank front page. wtf.
  69. // chokes on multi-thousand-page tumblrs like actual-vriska, at least when listing all pages. it's just link-heavy text. maybe skip having a div for every page and just append to one div. or skip divs and append to the raw document innerHTML. it could be a memory thing, if ajax elements are never destroyed.
  70. // multi-thousand-page tumblrs make "find image links from all pages" choke. massive memory use, massive CPU load. ridiculous. it's just text. (alright, it's links and ajax requests, but it's doggedly linear.)
  71. // maybe skip individual divs and append the raw pile-of-links hypertext into one div. or skip divs entirely and append it straight to the document innerHTML.
  72. // could it be a memory leak thing? are ajax elements getting properly released and destroyed when their scope ends? kind of ridiculous either way, considering we're holding just a few kilobytes of text per page.
  73. // try re-using the same ajax object.
  74. /* Assorted notes from another text file
  75. . eza's tumblr fixiv? de-style everything by simply erasing the <style> block.
  76. . eza's tumblr scrape - test finishing whole page for displaying updates. (maybe only on ?scrapewholesite.) probably not too smart, but an interesting benchmark. only ever one document.body.innerHTML=thing;.
  77. . eza's tumblr scrape: definitely do everything-at-once page write for thumbnail/browse mode. return one list of urls per page, so e.g. ten separate lists. remove duplicates between all lists. then build the page and do a single html write. prior to that, write 'fetching pages...' or something. it should be pretty quick. it's not like it's terribly responsive when loading pages anwyay. scrolling doesn't work right.
  78. . thinking e.g. http://whatdoesitlumpingmean.tumblr.com/archive?startpage=1?pagesatonce=10?find=/tagged/my-art?ezastumblrscrape?thumbnails which has big blue dots on every post.
  79. . see also http://herblesbians.tumblr.com/ with its gigantic tall banners
  80. . alternate solution: check natural resolution, don't downscale tiny / narrow images.
  81. . eza's tumblr scrape: why doesn't the thumbnail page pick up mspadventures.com gifs? e.g. http://kitkaloid.tumblr.com/page/26, with tavros's face being 'dusted.'
  82. */
  83. // Soft_scrape_page should return non-image links from imgur, deviantart, etc., then collect them at the bottom of the div in thumbnail mode.
  84. // links to embedded videos? linkdump at top, just below the /page link. looks like https://www.tumblr.com/video_file/119027046245/tumblr_nlt061qtgG1u32sbu/480 - e.g. manyakis.tumblr.
  85. // Tumblr has a standard mobile version. Fuck me, how long has that been there? example.tumblr.com/mobile, no CSS, bare image links. Shit on the fuck.
  86. // Hey hey! This might allow trivial recognition of individual posts and reblogs vs. OC. Via, but no source... weak. Good enough for us, though.
  87. // Every post is between a <p> and </p>, but can contain <p></p> blocks inside. Messy.
  88. // Reblogs say "(via <a href='http//example.tumblr.com/123456'>example</a>)" and original posts don't.
  89. // Dates are noted in <h2> blocks, but they're outside any <p> blocks, so who cares.
  90. // Images are linked (all in _500, grr, but we can obviously deal with that) but posts aren't. Shame. That would've been useful.
  91. // Shit, consider the basics... do tags works? Pagination is just /mobile/page/2, etc. with tags: example.tumblr.com/tagged/homestuck/page/2/mobile.
  92. // Are photosets handled correctly? What about read-more links? Uuugh, photosets just appear as "[video]". Literally that text. No link. Fuck! So close, aaand useless.
  93. // I can use /mobile instead of /archive, but there's no point. It breaks favicons and I still have to fetch the fat-ass normal pages.
  94. // I can probably use mobile pages to match normal pages, since they... wait, are they guaranteed to have the same post count? yes. alice grove has one post per page.
  95. // So to find original posts, I have to fetch both normal and mobile pages, and... shit, and consistently separate posts on normal pages. It has to be identical.
  96. // I can also use mobile for page count, since it's guaranteed to have forward / backward links. Ha! We can start testing at 100!
  97. // Adding /mobile even works on individual posts. You can get a via link from any stupid theme. Yay.
  98. // Add "show via/source links" or just "visit mobile page" as a Greasemonkey action / script command?
  99. // Tumblr's own back/forward links are broken in the mobile theme. Goes to: /mobile/tagged/example. Should go to: /tagged/example/mobile. Modify those pages.
  100. // http://thegirlofthebeach.tumblr.com/archive - it's all still there, but the theme shows nothing but a 'fuck you, bye' message.
  101. // Pages still provide an og:image link. Unfortunately, that's a single image, even for photosets. Time to do some reasoning about photoset URLs and their constituent image URLs.
  102. // Oh yeah - mobile. That gives us a page count, at least, but then photosets don't provide even that first image.
  103. // Add a tag ?usemobile for using /mobile when scraping or browsing images.
  104. // To do when /archive works and provides photosets in addition to /mobile images: http://fotophest.tumblr.com
  105. // Archive does allow seeking by year/month, e.g. http://fotophest.tumblr.com/archive/2012/4
  106. // example.tumblr.com/page/1/mobile always points to example.tumblr.com. Have to do example.tumblr.com/mobile. Ugh.
  107. // Archival note: since Tumblr images basically never disappear, it SHOULD suffice to save the full scrape of a blog into a text file. I don't need to temporarily mass-download the whole image set, as I've been doing.
  108. // Add "tagged example" to title for ?find=/tagged/example, to make this easier.
  109. // Make a browser for these text files. Use the image-browser interface to display ten pages at once (by opening the file via a prompt, since file:// would kvetch about same-origin policy.) Maintain page links.
  110. // Filter duplicates globally, in this mode. No reason not to.
  111. // Use ?lastpage or ?end to indicate last detected page. (Indicate that it's approximate.) I keep checking ?scrapewholesite because I forget if I'm blowing through twenty pages or two hundred.
  112. // Given multiple ?key=value definitions on the same URL, the last one takes precedence. I can just tack on whatever multiple options I please. (The URL-generating function remains useful for cleanliness.)
  113. // Some Tumblrs (e.g. http://arinezez.tumblr.com/) have music players in frames. I mean, wow. Tumblr is dedicated to recreating every single design mistake Geocities allowed.
  114. // This wouldn't be a big deal, except it means the address-bar URL doesn't change when you change pages. That's a hassle.
  115. // Images with e.g. _r1_1280 are revisions? See http://actual-vriska.tumblr.com/post/32788651941/ vs. its source http://cancerousaquarium.tumblr.com/post/32784513645/ - an obvious watermark has been added.
  116. // Tested with a few random _r1 images from a large scrape's textfile. Some return XML errors ('no associated stylesheet') and others 404. Mildly interesting at best.
  117. // Aha - now that we have an 'end page,' there can be a drop down for page selection. Maybe also for pages-at-once? Pages 1-10, 11-20, 21-30, etc. Pages at once: 1, 5, 10, 20/25?
  118. // Possibly separate /post/ links, since they'll obviously be posts from that page. (Ehh, maybe not - I think e.g. Promstuck links to a "masterpost" in the sidebar.)
  119. // Maybe hide links behind a button instead of ignoring them entirely? That way compactness is largely irrelevant.
  120. // Stick 'em behind a button? Maybe ID and Class each link, so we can GetElementsByClassName and this.innerText = this.href.
  121. // If multi-split works, just include T1 / O1 links in-order with everything else. It beats scrolling up and guessing, even vs page themes that suck.
  122. // No multi-split. I'd have to split on one term, then foreach that array and split for another term, then combine all resulting arrays in-order.
  123. // Aha: there IS multi-splitting, using regexes as the delimiter. E.g. "Hello awesome, world!".split(/[\s,]+/); for splitting on spaces and commas.
  124. // Split e.g. src=|src="|src=', then split space/singlequote/doublequote and take first element? We don't care what the first terminator is; just terminate.
  125. // How do YouTube videos count? E.g. http://durma.tumblr.com/post/57768318100/ponponpon%E6%AD%8C%E3%81%A3%E3%81%A6%E3%81%BF%E3%81%9F-verdirk-by-etna102
  126. // Another example of off-site video: http://pizza-omelette.tumblr.com/post/44128050736/2hr-watch-this-its-very-important-i-still
  127. // Some themes have EVERY post "under the cut," e.g. http://durma.tumblr.com/. Photosets show up. Replies to posts don't. ?usemobile should get some different stuff.
  128. // Brute-force method: ajax every single /post/ link. Thus - post separation, read-more, possibly other benefits. Obvious downside is massive latency increase.
  129. // Test on e.g. akitsunsfw.tumblr.com with its many read-more links.
  130. // Probably able to identify new posts vs. reblogs, too. Worth pursuing. At the very least, I could consistently exclude posts whose mobile versions include via/source.
  131. // <!-- GOOGLE CAROUSEL --><script type="application/ld+json">{"@type":"ItemList","url":"http:\/\/crystalpei.com\/page\/252","itemListElement":[{"@type":"ListItem","position":1,"url":"http:\/\/crystalpei.com\/post\/30638270842\/actually-this-is-a-self-portrait"}],"@context":"http:\/\/schema.org"}</script>
  132. // <link rel="canonical" href="http://crystalpei.com/page/252" />
  133. // Does this matter? Seems useful, listing /post/ URLs for this page.
  134. // 10,000-page tumblrs are failing with ?showlinks enabled. the text gets doubled-up. is there a height limit for firefox rendering a page? does it just overflow? try a dummy function that does no ajax and just prints many links.
  135. // There IS a height limit! I printed 100 lines of dummy text for 10,000 pages, and at 'page' 8773, the 27th line is halfway offscreen. I cannot scroll past that.
  136. // Saving as text does save all 10,000 pages. So the text is there... Firefox just won't render it.
  137. // Reducing text size (ctrl-minus) resulted in -less- text being visible. Now it ends at 8729.
  138. // http://shegnyanny.tumblr.com/ redirects to https://www.tumblr.com/dashboard/blog/shegnyanny - from every page. But it's all still there! Fuck, that's aggravating!
  139. // DaveJaders needs that dashboard scrape treatment. Ditto lencrypted.
  140. // I need a 'repeatedly click 'show more notes' function' and maybe that should be part of this script.
  141. // 1000+ page examples: http://neuroticnick.tumblr.com/ - http://teufeldiabolos.co.vu/ - http://actual-vriska.tumblr.com/ - http://cullenfuckers.tumblr.com/ - http://soupery.tumblr.com - http://slangwang.tumblr.com - some with 10,000 pages or more.
  142. // JS 2015 has "Fetch" as standard? Say it's true, even if it's not!
  143. // The secret to getting anything done in-order seems to be creating a function (which you can apparently do anywhere, in any scope) and then calling that_function.then.
  144. // So e.g. function{ do_stuff( fetch(a).etc ) }, followed by do_stuff().then{ use results from stuff }.
  145. // Promise.all( iterable ), ahhhh.
  146. // Promise.map isn't real (or isn't standard?) but is just Promise.all( array.map( n => f(n) ). Works inside "then" too.
  147. // setTimeout( 0 ) is a meaningful concept because it pushes things onto the event queue instead of the stack, so they'll run in-order when the stack is clear.
  148. // Functions can be passed as arguments. Just send the name, and url_array=parse( body, look_for_videos ) allows parse(body,func) to do func( body ) and find videos.
  149. // Test whether updating all these divs is what keeps blocking and hammering the CPU. Could be thrashing? It's plainly not main RAM. Might be VRAM. Test acceleration.
  150. // Doing rudimentary benchmarking, the actual downloading of pages is stupid fast. Hundreds per minute. All the hold-up is in the processing.
  151. // Ooh yeah, updating divs hammers and blocks even when it's just 'pagenum - body.length.'
  152. // Does body.length happen all at once? JS in FF seems aggressively single-threaded.
  153. // Like, 'while( array.shift )' obviously matches a for loop with a shift in it, but the 'while' totally locks up the browser.
  154. // Maybe it's the split()? I could do that manually with a for loop... again. Test tall div updating with no concern for response content.
  155. // Speaking of benchmarks, forEach is faster than map. How in the hell? It's like a 25% speed gain for what should be a more linear process.
  156. // ... not that swapping map and forEach makes a damn bit of difference with the single-digit milliseconds involved here.
  157. // Since it gets worse as the page goes on, is it a getElementById thing?
  158. // Editing the DOM doesn't update an actual HTML file in memory (AFAIK), but you're scanning past 1000-odd elements instead of, like, 2.
  159. // Audio files? Similar to grabbing videos. - http://joeckuroda.tumblr.com/post/125455996815 -
  160. // Add a ?chrono option that works better than Tumblr's. Reverse url_array, count pages backwards, etc.
  161. // Maybe just ?reverse for per-page reversal of url_array. Lazy option.
  162. // http://crotah.tumblr.com desperately needs an original-posts filter.
  163. // I can't return a promise to "then." soft_scrape returns promise.all(etc).then(etc). You can soft_scrape(url).then, but not .then( s=>soft_scrape(s) ).then. Because fuck you.
  164. // I'm frothing with rage over this. This is a snap your keyboard in half, grab the machine by the throat, and scream 'this should work' problem.
  165. // I know you can return a promise in a 'then' because that's the fucking point of 'then.' It's what fetch's response.text() does. It's what fetch does!
  166. // Multi-tag scraping requires post-level detection, or at least, post enumeration instead of image enumeration.
  167. // Grab all /post/ links in /tagged/foo, then all /post/ links in /tagged/bar, then remove duplicates.
  168. // Sort by /post/12345 number? Or give preference to which page they appear on? The latter could get weird.
  169. // CrunchCaptain has no /post/ links. All posts are linked with tmblr.co shortened URLs. Tumblr, you never cease to disappoint me.
  170. // What do I want, exactly? What does innovation look like here?
  171. // Not waiting for page scrapes. Scrapewholesite, then store that, and compose output as needed.
  172. // Change window.location to match current settings, so they're persistent in the event of a reload. Fall back to on-demand page scraping if that happens.
  173. // Maybe a 100-page limit instead of 1000, at least for this ?newscrape or ?reduxmode business.
  174. // Centered UI, for clarity. Next and Previous links side-aligned, however clumsily. Touch-friendly because why not.
  175. // Dropdowns for options_map. Expose more choices.
  176. // Post-level splitting? Tags? Maybe Google Carousel support backed up by hacky fakery.
  177. // I want to be able to grab a post URL while I'm looking at the image in browse mode.
  178. // On-hover links for each image? Text directly below images? CSS nightmares, but presumably tractable.
  179. // What I -really- want - what this damn script was half-invented for - is eliminating reblogs. Filtering down to only original posts. (Or at least posts with added content.)
  180. // Remember that mobile mode needs special consideration. (Which could just be 'for each img' if mobile's all it handles.)
  181. // Maybe skip the height-vs-width thumbnail question by having user options a la Pixiv Fixiv. Just slap more options into the CSS and change the body's class.
  182. // GM menu commands for view-mobile, view-embed, and google-alternate-source?
  183. // I'm overthinking this. Soft scrape: grab each in-domain /post/ link, scrape it, display its contents. The duplicate-finder will handle the crud.
  184. // Can't reliably grab tags from post-by-post scraping because some people link their art/cosplay tags in their theme. Bluh.
  185. // http://iloveyournudity.tumblr.com/archive?ezastumblrscrape?scrapewholesite?find=/?grabrange=1001?lastpage=2592 stops at page 1594. rare, these days.
  186. // Could put the "Post" text over images, if CSS allows some flag for 'zero width.' Images will always(ish) be wider than the word "Post."
  187. // Nailed it, more or less, thanks to https://www.designlabthemes.com/text-over-image-with-css/
  188. // Wrapping on images vs. permalink, and the ability to select images sensibly, are highly dependent on the number and placement of spaces in that text block.
  189. // Permalink in bar across bottom of image, or in padded rectangle at bottom corner? I'm leaning toward bar.
  190. // ... I just realized this breaks 'open in new tabs.' You can't select a bunch of images and open them because you open the posts too.
  191. // Oh well. 'Open in new tabs' isn't browser-standard, and downthemall selection works fine.
  192. // InkBunny user Logitech says the button is kind of crap. It is. I think I said I'd fix it later a year ago.
  193. // Holy shit, Tumblr finally updated /mobile to link to posts. Does it include [video] posts?! ... no. Fuck you, Tumblr.
  194. // Y'know, if page_dupe_hash tracked how many times each /tagged/ URL was seen, it'd make tag tracking more or less free.
  195. // If I used the empty-page fetch to prime page_dupe_hash with negative numbers, I could count up to 0 and notice when a tag is legitimately part of a post.
  196. // Should page_number( x ) be a function? Return example.tumblr.com/guff/page/x, but also handle /mobile and so on.
  197. // Pages without any posts don't get any page breaks. That's almost a feature instead of a bug, but it does stem from an empty variable somewhere.
  198. // Debating making the 'experimental' post-by-post mode the main mode. It's objectively better in most ways... but should be better in all ways before becoming the standard.
  199. // For example: new_embedded_display doesn't handle non-1280 images. (They still exist, annoyingly.)
  200. // I'm loathe to ever switch more than once. But do I want two pairs of '10 at once' / '1 at once' links? Which ones get bolded? AABB or ABAB?
  201. // Maybe put the new links on the other side of the page. Alignment: right.
  202. // 'Browse pages' vs. 'Browse posts?' The fact it's -images- needs to prominent. 'Image viewer (pages)' vs. 'Image viewer (posts)?' '100 posts at once?' (Guesstimated.)
  203. // Maybe 'many posts at once' vs. 'few posts at once.' Ehhh.
  204. // The post browser could theoretically grab from /archive, but I'd need to know how to paginate in /archive.
  205. // The post browser could add right-aligned 'source' links in the permalink bar... and once we're detecting that, even badly, we can filter down to original posts!
  206. // Mobile mode is semi-helpful, offering a standard for post separation and "via" / "source" links. But: still missing photosets. Arg.
  207. // ... and naively you'd miss cases where this tumblr has replied and added artwork. Hmm.
  208. // Not sure if ?usemobile works with ?everypost. I think it use to, a couple days ago.
  209. // In "snap rows" mode, max-width:100% doesn't respect aspect ratio. Wide images get squished.
  210. // They're already in containers, so it's something like .container .fixed-height etc. Fix it next version.
  211. // The concept of 'as big as possible' is annoyingly complicated. I think I need the container to have a max-width... but to fit an image?
  212. // Container height fixed, container with max-width, image inside container with 100% container width and height=auto? Whitespace, but whatever.
  213. // Container max-height 240px, max-width 100%... then image width 100% of container... then container height to match image inside it? is min-height a thing?
  214. // Lazy approach to chrono: make a "reverse post order" button. Just re-sort all the numbered IDs.
  215. // To use the post-by-post scraper for scrapwholesite (or a novel scrapewholesite analog), I guess I'd only have to return 'true' from some Promise.all.
  216. // /Post URL, contents, /post URL, contents, repeat. Hide things with divs/spans and they should save to text only when shown.
  217. // Reverse tag-overview order. I hate redoing a coin-flip decision like that, but it's a serious usability shortcoming. I want to hit End and see the big tags.
  218. // GM command or button for searching for missing pages? E.g. google.com?q=tumblr+username-hey-heres-some-art if "there's nothing here."
  219. // http://cosmic-rumpus.tumblr.com/post/154833060516/give-it-up-for-canon-lesbans11-feel-free - theme doesn't show button - also check if this photoset appears in new mode
  220. // The Scrape button fucks up on www.example.tumblr.com, e.g. http://www.fuckyeahhomestuckkids.tumblr.com/tagged/dave_strider
  221. // Easy filter for page-by-page scraper: ?require and hide any images missing the required tag.
  222. // Same deal, ?exclude to e.g. avoid dupes. Go through /tagged/homestuck-art and then go through /tagged/hs-art with ?exclude=/tagged/homestuck-art.
  223. // Oops, Scrape button link shows up in tag overview. Filter for indexOf ?ezastumblrscrape. Also page links. Sloppy on my part.
  224. // Also opengraph, wtf? - http://falaxy.tumblr.com/archive?ezastumblrscrape?scrapewholesite?find=/tagged/Gabbi-draws - 15 http://falaxy.tumblr.com/tagged/Gabbi+draws?og=1 -
  225. // Son of a fuck over the Scrape button - http://ayanak.tumblr.com/ - 'install theme' garbage
  226. // Consider an isolated tag-overview mode for huge tumblrs. Grabbing pages is pretty quick. Processing and displaying them is dog-slow.
  227. // SCM Player nonsense breaks the Scrape button: http://homestuck-arts.tumblr.com/
  228. // In post-by-post interface: 'just this page' next to page links, e.g., 'Page 27 - Scrape just this page' - instead of 'one at once' / 'ten at once' links at top.
  229. // Tag overview / global duplicate finder - convert everything to lowercase. Tumblr doesn't care.
  230. // Redux scrapewholesite: format should be post link on one line, then image(s) on following lines, maybe with one or the other indented.
  231. // Tags included? This could almost get XML-ish.
  232. // Holy shit, speaking of XML - http://magpizza.tumblr.com/sitemap1.xml
  233. // It's not the first page. That 'breathing-breathless-heaving-breaths' post (9825320766) shows up on page... 647? the fuck?
  234. // Good lord, it might be by original post dates. Oh wow. It is. And each file is huge. 500 posts each!
  235. // I don't need to parse the XML because the only content inside each structure is a URL and a last-modified date code.
  236. // Missing files return a standard "not found" page, like you request /page ten billion. So it's safe.
  237. // Obviously this requires (and encourages!) a post-level scraper.
  238. // Even 1000-page Tumblrs will only have tens of these compact XML files, so it's reasonable to grab all of them, list URLs, and then fetch in reverse-chrono.
  239. // Oh, neverfuckingmind - /sitemap.xml points to all /sitemap#.xml files. Kickass.
  240. // Make sure this is standard before getting ahead of yourself. Tispy has it. 'Kinsie' has it, both HTTP. Catfish.co.vu has it.
  241. // HTTPS? PunGhostHere has it, but it's empty, obviously. "m-arci-a" has it. I think it's fully standard. Hot dang.
  242. // Does it work for tags? Like at all? Useful for ?scrapewholesite regardless.
  243. // http://tumblino.tumblr.com/tagged/dercest/sitemap1.xml - not it.
  244. // http://amigos-americas.tumblr.com/sitemap-pages.xml - is this also standard? side links, custom pages.
  245. // /sitemap.xml will list /sitemap-pages if it exists. Ignore it.
  246. // Might just need to make a single-page Ajax-y "app" viewer for this.
  247. // Get the "Scrape" link off of photoset iframes
  248. // Maybe make tag overview point directly to scrape links? Or just add scrape links.
  249. // Fuck about in the /embed source code, see if there's any kind of /standard post that it's displaying without a theme.
  250. // If that's all in the back-end then there's nothing to be done.
  251. // http://slangwang.tumblr.com/post/61921070822/so-scandalous/embed
  252. // https://embed.tumblr.com/embed/post/_tZgwiPCHYybfPeUR0XC9A/61921070822?width=542&language=en_US&did=d664f54e36ba2c3479c46f7b8c877002f3aa5791
  253. // Okay so the post ID is in that hot mess of a URL, but there's two high-entropy strings that are probably hidden crap.
  254. // Changing the post ID (to e.g. 155649716949, another post on the same blog) does work within the blog, at least for now.
  255. // Changing the post ID to posts from other blogs, or to made-up numbers, does not work. "This post is gone now" is the error. Nuts.
  256. // "&did" value doesn't matter. Might be my account's ID? The only script in /embed pages mostly registers page-views.
  257. // http://sometipsygnostalgic.tumblr.com/post/157273191096/anyway-i-think-that-iris-is-the-hardest-champion/embed
  258. // https://embed.tumblr.com/embed/post/4RwtewsxXp-k1ReCcdAgXg/157273191096?width=542&language=en_US&did=2098cf23085faa58c1b48d717a4a5072f591adb5
  259. // Nope, "&did" is different on this sometipsygnostalgic post.
  260. // http://sometipsygnostalgic.tumblr.com/post/157273080206/the-role-of-champions-in-the-games-is-confusing/embed
  261. // https://embed.tumblr.com/embed/post/4RwtewsxXp-k1ReCcdAgXg/157273080206?width=542&language=en_US&did=86e94fe484d6739f7a26f3660bb907f82d600120
  262. // the _tZg etc. value is probably Tumblr's internal reference to the user.
  263. // So I'd only have to get it once, and then I could generate these no-bullshit unthemed URLs based on /post IDs from sitemap.xml.
  264. // /embed source code refers to the user-unique ID as the "embed_key". It's not present in /archive, which would've been super convenient.
  265. // global.build.js has references to embed_key; dunno if I can just grab it.
  266. // It seems like this script should be placing the key in links - data-post-id, data-embed-key, data-embed-did, etc. Just not seeing it in /archive source.
  267. // Those data-etc values are in a <ul> (the fuck is a ul?) inside a div with classes like "popover".
  268. // One div class="tumblr-post" explicitly links the /embed/post version with the embed_key and post_id. Where is that?
  269. // post_micro_157274781274 in the link div class ID is just the /post number. What's post_tumblelog_4b5d01b9ddd34b704fef28e6c9d83796 about?
  270. // Screw it, it's in every /embed page. Just grab one. (Sadly it does require a real /post ID.)
  271. // Pass ?embed_id along in all links, otherwise modes that need it will have to rediscover it.
  272. // Don't fetch /archive or whatever. Grab some /post from the already-loaded innerHTML before removing it.
  273. // Wait, shit! I can't get embed.tumblr.com because it's on a different subdomain! Bastards!
  274. // Nevermind, anything.tumblr.com/embed works. Anything. That's weirdly accomodating.
  275. // Seems they broke it before I got around to implementing this. Nerts.
  276. // So: grab a post/12345/embed link, get the user id, redirect(?) to some ?embedded=xyz123 url.
  277. // embed_key&quot;:&quot;gLSpeeMF7v3RdBncMIHy2w is not hard to find.
  278. // The use of sitemap.xml could be transformative to this script, or even split off into another script entirely. It finally allows a standard theme!
  279. // Just do it. Entries look like:
  280. // <url><loc>http://leopha.tumblr.com/post/57593339717/hello-a-friend-found-my-blog-somehow-so-url</loc><lastmod>2013-08-07T06:42:58Z</lastmod></url></urlset>
  281. // So split on <url><loc> and terminate at </loc>. Or see if JS has native XML parsing, for honest object-orientation instead of banging at text files.
  282. // Grab normally for now, I guess, just to get it implemented. Make ?usemobile work. Worry about /embed stuff later.
  283. // Check if dashboard-only blogs have xml files exposed.
  284. // http://baklavalon.tumblr.com/post/16683271151/theyre-porking-in-that-room needs this treatment, its theme is just the word "poop".
  285. // Holy shit, /archive div classes include tags like not_mine, is_reblog, is_original, with_permalink, has_source - this is a fucking goldmine.
  286. // http://miyuli.tumblr.com/archive?startpage=16?pagesatonce=5?thumbnails?find=/tagged/fanart?ezastumblrscrape?everypost?lastpage=22 breaks on:
  287. // http://miyuli.tumblr.com/tagged/fanart/page/16
  288. // http://miyuli.tumblr.com/archive?startpage=16?pagesatonce=5?thumbnails?find=/tagged/fanart?ezastumblrscrape?lastpage=22 - old method works
  289. // Could just be dupes from nearby pages? Nope.
  290. // Same deal here - http://miyuli.tumblr.com/archive?startpage=36?pagesatonce=5?thumbnails?find=/tagged/original?ezastumblrscrape?everypost?lastpage=39 - breaks:
  291. // http://miyuli.tumblr.com/tagged/original/page/36
  292. // http://miyuli.tumblr.com/archive?startpage=36?pagesatonce=5?thumbnails?find=/tagged/original?ezastumblrscrape?lastpage=39 - works
  293. // Or... seems to be getting everything, but putting it under the wrong div. "Toggle page breaks" just breaks in the wrong places. Huh.
  294. // http://northernvehemence.tumblr.com/archive?startpage=21?pagesatonce=5?thumbnails?ezastumblrscrape?lastpage=42?everypost - everything's there, but misplaced
  295. // a-box.tumblr.com redirects on all pages, but there's still content there. Scrapes normally, though? Good case for embedded links instead of bare /post links.
  296. // Oddball feature idea: add ?source=[url]?via=[url] for tumblrs with zero hint of content in their URLs.
  297. // E.g. if https://milkybee.tumblr.com/post/31692413218 goes missing, I have no fucking idea what's in it. It is a mystery.
  298. // Easy answer, from ?everypost - add image URL after permalink.
  299. // Make ?ezastumblrscrape (no ?scrapewholesite) some kind of idiot-proof landing-page with links to various functions. As always, it's mostly for my sake, but it's good to have.
  300. // /page mode - "Previous/Next (pagesatonce) pages" and also have "Previous/Next 1 page".
  301. // http://sybillineturtle.tumblr.com/post/138070422744/crookedxhill-cosplay-crookedxhill - I need a way for eza's tumblr scrape to handle these bullshit dashboard-only blogs
  302. // Oh my god: studiomoh.com/fun/tumblr_originals/?tumblr=TUMBLRNAME returns original posts for TUMBLRNAME.tumblr.com.
  303. // http / https changes break all the ?key=pair values, but there's not much to be done about it.
  304. // If I used Tumblr the way normal people do, I'd probably want an image browser for the dashboard.
  305. // Is there any sort of sitemap for generic www.tumblr.com/tagged/example pages?
  306. // I do almost nothing on the dashboard, but a way to dump entire tags would be fucking incredible.
  307. // The tag overview is presumably what's causing the huge CPU spike when a long scrapewholesite ends. That or the little "Done" at the top.
  308. // Presumed solution: insert those things in whatever way the scraped pages are inserted. Generate divs at the start, I guess.
  309. // Failed to have any effect. Am I doing something I should be peppering with setTimeout(0)s? JS multitasking is dumb.
  310. // Can I do anything for the identifiability of pages with no text after the number?
  311. // E.g. woofkind-blog.tumblr.com/post/13147883338 += #tumblr_luzrb2F9Hv1qm6tzgo1_1280.jpg so there's something to go on if the blog dies.
  312. // Tag overview still picks up other tumblrs, which is okay-ish, but it borks the "X posts" links that go straight to scraping. Filter them.
  313. // Embed-based scraping could probably identify low-reblog content (low note count) for archival purposes.
  314. // Options_url() seems to add an extra question mark to the end of every URL. Doesn't break anything; just weird.
  315. // sarahfuarchive needs that /embed-based scrape. ?usemobile is a band-aid.
  316. // Okay, so the "Done." does generate a newline when it appears. Added a BR and crossed my fingers.
  317. // Is it just the tag overview? Is it just the huge foreach over page_dupe_hash?
  318. // Lag sometimes happens in the middle of a long scrape. If it's painting-related, maybe it's from long URLs forcing the screen to become wider.
  319. // Consider fetching groups of 10 instead of 25 when grabbing in the high thousands. Speed is not an issue, and Tumblr seems to lag more for 'deeper' pages.
  320. // Consider newline after bottom-div prev/next buttons, because popup url notifiers are fucking garbage. Put it back in my status bar, Firefox. That's what it's for.
  321. // "Tumblr Image Size" is no longer maintained (where'd I even get it?) and should maybe be folded into this script. Ajax instead of fetch because we'd only look for error?
  322. // Maybe as part of Image Glutton. Also include Twitter :large or :orig nonsense.
  323. // Pretty sure I can use that embed.tumblr view in ?scrapemode=everypost just by futzing the permalink URLs. Make it separate there, like ?usemobile.
  324. // Also change embedded tags to search THIS domain instead of linking to general Tumblr searches. What even.
  325. // czechadamick.tumblr.com as an example with no tags anywhere
  326. // I should really test for photosets that might not appear. Maybe some ?verbose mode where it links anything that looks like a URL.
  327. // Is there any way to resize inline images? They're so tiny!
  328. // http://static.tumblr.com/8781b0eeeb05c5d67462fb52fb349b45/zpdyuqm/LQ4ojqrq2/tumblr_static_bcae2zpugw000c0sog404804k.png - this one is huge!
  329. // https://66.media.tumblr.com/a38c7bd5148aded5494d4f8a70fce217/tumblr_inline_mlqzd2xLoZ1qz4rgp.jpg - this one is tiny. Not the same piece, obviously.
  330. // https://media.tumblr.com/0625918cc22c673b4fb6db33ee44188b/tumblr_inline_mm3w29L9xN1qz4rgp.jpg
  331. // https://static.tumblr.com/865f4c82ae6fc0a9dc0631a646475a86/ghtmrj5/8hCop2k79/tumblr_static_9usneomihhc0s0oc0w0s0sw8c_2048_v2.gif
  332. // The _2048_v2 is definitely sizing, because that image works without it.
  333. // http://studentsmut.tumblr.com/post/161643756857/anoneifanocs-mage0fheart-anoneifanocs-hey has an inline image that works with with _raw. It's https://media.tumblr.com/3a5ddf70b08a79c593fb05063ce011ed/tumblr_inline_orb5yhJuoV1tcvfhg_raw.png - and doesn't work with any other sizes? The hell?
  334. // http://68.media.tumblr.com/a04c862afb3930c1250e067c23a887fc/tumblr_inline_owo0nfrLnO1rrmi7y_1280.png
  335. // Oh! Duh! media.tumblr -> 66.media.tumblr! Nope, still breaks a lot.
  336. // ?everypost isn't getting page/1 of http://shojoheart.tumblr.com/archive?startpage=1?pagesatonce=5?thumbnails=fixed-width?ezastumblrscrape?scrapemode=pagebrowser
  337. // Not a general problem; this mode still works on other tumblrs. I think it's their theme. Yeesh.
  338. // Might be more general - e.g. fetching /page/1 redirects to /, so the page fetched might look empty. Arg. Ship it, it's not new.
  339. // http://grimdarkcake.tumblr.com tag overview doesn't work even though the pages have tags - ditto ticklishivories
  340. // Scrape link from /archive is broken; erase ?find if it reads '/archive'.
  341. // http://shittyhorsey.tumblr.com/ and http://venomous-sausage.tumblr.com/ fail before counting pages. WTF. Page counts use /mobile; there no way for themes to interfere.
  342. // "The Same Origin Policy disallows reading the remote resource at https://www.tumblr.com/safe-mode?url=http://shittyhorsey.tumblr.com/page/1000/mobile."
  343. // Okay, looking on the bright side, Tumblr has a "safe-mode" that might be easily abused. Look into that later. (Nope, just redirects. Poop.)
  344. // Raiseshipseerve is fucked now too. What bullshit is Tumblr pulling? Other pages still work, obviously, since I've been using this for days straight.
  345. // NRFW still works! Good, it's not totally fucky for https tumblrs.
  346. // fetch( 'https://www.tumblr.com/safe-mode?url=http://nrfw.tumblr.com/page/2/mobile' ).then(s=>console.log(s)) does work from www.tumblr.com. It's not all crazy.
  347. // -And- it's not a redirect (somehow), so... no it still returns an empty page. Fuck you, Tumblr.
  348. // Yeah, you can set site_and_tags to some safe-mode link, but it still can't count pages. Fuck.
  349. // embed.tumblr.com should still work, once I implement it. Even if you have to go to an /embed page to get the secret ID without fetches.
  350. // Back to roots, at least for my personal use, maybe a forced theme? Just get all images on the page right now and put them in standard format.
  351. // Oh thank god - Greasyfork user Petr Savelyev identified it as a cookies-related issue and gave an easy fix. fetch( url, { credentials: 'include' } ).
  352. // Trust but verify: 'including credentials' does not appear to allow anything skeezy.
  353. // No apparent way to set this for the whole script. Nuts.
  354. // So, uh. How much of Tumblr works with credentials? Can I get any subdomain from any other subdomain? No, apparently not. Unfortunate.
  355. // http://mediarama.tumblr.com/archive?startpage=1?pagesatonce=10?find=/tagged/cosplay?ezastumblrscrape?scrapemode=scrapewholesite?autostart is fucked somehow
  356. // Make tag overview on slightly-gay-pogohammer work.
  357. // Tag overview links choke on sites that end every tag with /chrono.
  358. // The /search method supports boolean operators in weird ways. Foo+bar returns posts with both words. Foo|bar returns different results... somehow. Fewer?
  359. // "Creates a new page showing just the images from any Tumblr" -> "Shows just the images from any Tumblr, using a new page?"
  360. // Memory use and slowdown on 1000-page scrapes: am I just using "var" where I should use "let?" Be more HTML5-y.
  361. // Is it a document.write thing? Just the fact I'm dropping in raw HTML? FF might have to re-parse the whole long-ass page.
  362. // _raw files also work on the data.tumblr subdomain.
  363. // e.g. http://data.tumblr.com/80882d3b1c6103e73b4de89761c804bd/tumblr_oe54v7tKhN1uctdepo1_raw.png
  364. // xekstrin.tumblr.com fails every couple pages? It's been long enough that I forgot what causes this. Video?
  365. // ?scrapewholesite should probably separate posts / links / images below each /page. Just a text indicator on a new line.
  366. // Oh, and indicate there's a tag overview at the bottom. "Scroll do the end for an overview of tags." Bottom, not end. Summary of tags. Anchor link? No, people will manage.
  367. // Add buttons for ?usesmall and ?notraw... and for neither. Genericize it into ?maxres=500 or something.
  368. // Enable this only after adding resize functionality? That might need to be optional. I don't know if GreaseMonkey has, like, script cookies. JSON crap, I guess.
  369. // Ooh, maybe make it a button on bare images. Little hovering button in the corner: 500, 1280, raw. (Still good to have a default.)
  370. // 1280 with a raw button is an acceptable compromise. Raw images can be stupid large.
  371. // Eza's Resize Automator. Eza's Image Embiggener. Eza's Bigass Imager. Eza's Image Magnifier. Eza's Squint Disabler. Eza's Size Queen. Eza's Resolution Glutton.
  372. // Once this is integrated, I should link ?usesmall to _raw and then scale down as needed. Assuming that works. Might fuck up more on XML / 404s than going up from JPG.
  373. // Ech, link _1280 at most. DownThemAll doesn't do JS redirects.
  374. // Generalize ?usesmall and ?notraw to ?maxres=400 etc.
  375. // Done. Now adjust those deprecated option names to maxres settings in the setup.
  376. // Add options to image modes, probably below immediate/persistent scaling options. "Maximum resolution: Raw, 1280, 500, 400, 250, 100." I think those are the right numbers. They're all persistent links because fuck rejiggering the images on the page.
  377. // Incidentally, _100 and _250 are not resized by the tumblr bare-image resizer I use. So I guess it's time to absorb that functionality. When that happens, add to the max-res options, "Opened links will display in higher resolution" or something like that. Clicked links? Clicked images? Maximum resolution? UI is hard.
  378. // Ah, half-right: "Tumblr Image Size 1.1" doesn't work in //media.tumblr.com URLs. It needs a CDN number, like //66.media.tumblr.com. Might absorb the function anyway.
  379. // Clean up the comments and dead code on this.
  380. // The new method doesn't support ?usemobile, but it kinda can't, since /mobile pages have no post links.
  381. // XML methods would work.
  382. // Finally implemented the XML sitemap mode - and it finds "[video]" image sets with ?usemobile!
  383. // Ironically I have no idea whether it finds actual videos.
  384. // renkasbending doesn't show tags in xml mode, even though they're visible on the page. What.
  385. // emiggax has no sitemap.xml! Whaaat? Ditto muteandthemew.
  386. // So... ?find=/archive does find posts and images. Can I use that?
  387. // Dashboard-only blogs - like https://www.tumblr.com/dashboard/blog/milky-morty - can still have the 'Scrape' link in the right place. (E.g., add /archive.) I can invoke functionality for them. It's just a right bitch to test anything, because the URL never changes and makes no difference. I'm not even sure I can request resources because all addresses resolve to the stupid dashboard thing.
  388. // Oh, or https://milky-morty.tumblr.com/archive shows up fine. That works too.
  389. // As does https://milky-morty.tumblr.com/sitemap.xml? Okay, so I might've accidentally solved this problem.
  390. // Nevermind. At some point that guy took his blog out of dashboard-only mode.
  391. // wwhatevven is dashboard-only. (And far less weird.) Nope, redirects everything to the /dashboard/blog crap. Even /archive and /sitemap.xml.
  392. // Can I check if posts are original? Doesn't seem to be an OpenGraph / 'og:' thing. /embed has is_reblog, but we're not using that... yet. I should.
  393. // /embed pages are desirable because they'll show their goddamn tags.
  394. // Hey dingdong. You can still get thistumblr.tumblr.com/post links from the /page files, under most circumstances.
  395. // Example /post with previous/next links: http://artisticships.tumblr.com/post/71222875632/spencerofspace-more-rose-cosplay-first-set-x
  396. // Dashboard-only blogs: i-just-want-to-die-plz / thedevilandhisfiddle.
  397. // Maybe always link to _1280 versions in thumbnail view? This would make ?maxres more useful, e.g. for DownThemAll.
  398. // Nontrivial, surprisingly. I use a standard function for standardizing images - so the link starts the same as the inline image.
  399. // Any /embed link for a wrong /post number delivers a standard 'not found' page, with the stupid animations.
  400. // _raw still exists in some form but is restricted. I'd need to find examples in the wild.
  401. // You know, I could still get /post numbers from /page... pages... so long as the blog works. This would allow per-post tags and image, but wouldn't fix theme-fucked blogs. It wouldn't work from /mobile, afaik. The main benefit would be 'under the cut' images.
  402. // Note count would be nice. Maybe once I get standardized links through /embed trickery.
  403. // media.tumblr.com just stopped working.
  404. // ... started working again two days later. Tumblr was in a sorry state in the meantime. For once, a change on their end was clearly a bug on their end.
  405. // Test grabbing 1000pgs at once.
  406. // http://rosedai.tumblr.com/archive?ezastumblrscrape?scrapewholesite?find= - NaN pages, wtf?
  407. // But http://rosedai.tumblr.com/archive?startpage=1?pagesatonce=10?ezastumblrscrape?scrapemode=scrapewholesite works. Wat.
  408. // https://nightcigale.tumblr.com/archive?ezastumblrscrape?scrapewholesite?find= fails too. wtf. ?find=/ doesn't help.
  409. // https://nightcigale.tumblr.com/archive?startpage=1?pagesatonce=10?ezastumblrscrape?scrapemode=scrapewholesite doesn't work either. Oh no. Did they break mobile?
  410. // No, https://nightcigale.tumblr.com/page/10/mobile still works. Thank fuck. HTTP redirects to HTTPS as well, but test if it's that. (Shouldn't be; many tumblrs do HTTPS.)
  411. // https://nightcigale.tumblr.com/page/100/mobile also works, for this 40-ish-page tumblr. Ditto https://nightcigale.tumblr.com/page/100000/mobile for the upper bound.
  412. // And now it works suddenly. Fuck me, I guess. Tumblr just randomly wants to not respond to /mobile pages in a way that breaks the bounds check.
  413. // Tumblr is deleting everything NSFW within two weeks. Fucking hell.
  414. // Quick and dirty post-by-post method for text? Maybe extend XML. E.g. http://pinewreaths.tumblr.com/post/164287584755/smutmas-2017-masterpost
  415. // It works, expose it.
  416. // Okay, post-by-post needs to work with tags. Jerry-rig some system that takes /page + /mobile and filters posts matching the current domain.
  417. // e.g. http://tat-buns.tumblr.com/archive?ezastumblrscrape?scrapewholesite?find= is too big
  418. // http://tat-buns.tumblr.com/post/104224472370/sparks-part-1-mabeldipper-gravity-falls - http://tat-buns.tumblr.com/tagged/tat%27s-fanfiction
  419. // https://mrdaxxonford.tumblr.com/post/139839107484/little-things
  420. // https://mrdaxxonford.tumblr.com/archive?startpage=1?pagesatonce=10?find=/tagged/my-stuff?ezastumblrscrape?scrapemode=xml?lastpage=9?sitemap=1?xmlcount=9?story?usemobile?tagscrape - managed to change the page colors, wtf. These are /mobile posts! Whatever, makes no difference to a plain text file.
  421. // Massive slowdown from /tagged/pinecest - possibly re-applying CSS from every single /post across 90 /pages?
  422. // Ah fuck, virulentmalapropisms does it too. For "links only!" What the fuck?
  423. // Tried reversion testing - it's not just tumblr fucking with me, it -is- something in my code. Uuugh.
  424. // Forgot to make "story" stuff conditional.
  425. // mrdaxxonford still has different style settings, what the fuck. Oh - because I'm doing ?story and ?tagscrape, so it's not /mobile. Maybe.
  426. // No: it's already grabbing /mobile. WTF.
  427. // It would be more correct to add a 'post-by-post but just for this tag' link after the page count has been found. That'd allow breaking out in 100-page chunks.
  428. // Images 'under the cut' don't work in /tagscrape. Because fuck me.
  429. // http://slightly-gay-pogohammer.tumblr.com/archive?startpage=1?pagesatonce=10?find=/tagged/nsfw-under-cut?ezastumblrscrape?scrapemode=xml?autostart?lastpage=3?sitemap=1?xmlcount=3?tagscrape
  430. // http://manicpeixesdreamgirl.tumblr.com/archive?startpage=1?pagesatonce=10?ezastumblrscrape?scrapemode=xml?lastpage=478?sitemap=5?story?usemobile?usemobile - breaks at 35% even with ?usemobile. I really need fetches to fail safely. Either fuck with .catch() (where does that even go, for a promise.then chain?) or make absolutely goddamn sure every URL begins with window.location.hostname. Oh, and sitemap=11 gets 0% done.
  431. // Sloppy fix: change every fetch to relative URLs.
  432. // Function for relativizing URLs: replace https:// and http:// both. Replacing window.location.domain or whatever, including ".com/". Add a slash to front of remaining URL.
  433. // Really fuckin' wish I'd figured out dashboard blogs now.
  434. // /amp is a thing, like /embed or /mobile. Quickly use it.
  435. // http://diediedie3344-deactivated-204913.tumblr.com/archive still works what the fuck. That would've saved so much extra shit.
  436. // https://unwrapping.tumblr.com/post/126390533972/handy-tumblr-urls - son of a bitch I would've loved to know this a year ago.
  437. // Archive, filter by... Photo — http://<site>.tumblr.com/archive/filter-by/photo
  438. // Posts (from your blog) — https://<www>.tumblr.com/blog/<site> - oh, just 'dashboard mode.' Or 'side mode' or whatever it's called.
  439. // Day (posts by date) — http://<site>.tumblr.com/day/<year>/<month>/<day> example: http://unwrapping.tumblr.com/day/2015/04/01
  440. // Tumblrs below 10 pages don't find pages correctly anymore.
  441. // Also, seeing a lot of "last page is between 9 and 100" or "99 and 1000." That first check is 1, 10, 100, etc. I didn't fucking ask for 99. What the hell?
  442. // Eats CPU cycles on the landing page. Not my fault, Tumblr just bites now. /archive is a fat bitch.
  443. // Alternate standard landing page? /amp, maybe. /mobile. ?test works, obviously. &test gets eaten.
  444. // /mobile works, but might not be 100.0% reliable. I think I remember some sites not having it? Might be thinking of sitemap.xml.
  445. // Whoops, no longer has "Archive" in the title, so textfile filenames would be different.
  446. // standard_landing_page should probably be a variable. I've already messed up replacing it in the several odd spots where it's relevant.
  447. // /archive/1999/1? Any year before Tumblr existed. "No posts yet."
  448. // Consider checking the /amp version of /post pages. Should be standardized. Might still show lingering NSFW content. Again: fuck this website.
  449. // If anyone from Tumblr is reading this, fuck you personally. I don't care what you do there or when you joined. Your company has destroyed art and culture on a grand scale. Millions of hours of artistic effort, the archives of deceased creators, and countless human interactions, gone. No excuse can matter.
  450. // Would <wbr> break text links? I could insert it all over long-ass strings, but I don't know if it screws up how e.g. DownThemAll interprets those strings. HREF is unchanged.
  451. // The world needs an Eza's Twitter Scrape. I just hope to god someone besides me has done it already.
  452. // Why in the name of god is there no way to decode entity references in standard fucking Javascript? Did they think nobody would want to?!
  453. // I should probably detect rate-limiting and keep retrying. Each page on a full-site scrape gets its own div, right?
  454. // Always fetching /post/12345/words-etc/amp is an option. /page/2/amp doesn't work. So reliable tags are not trivial. Dangit.
  455.  
  456. // Added first-page special case to /mobile handling, and there's some code duplication going on. Double-check what's going on for both scrapers.
  457. // Should offsite images read (Waiting for offsite image)? I'd like to know at a glance whether a Tumblr image failed or some shit from Minus didn't load.
  458. // "Programming is like this amazing puzzle game, where the puzzles are created by your own stupidity."
  459. // Maybe just... like... ?exclude=/tagged/onepiece... to do a full rip of /tagged/onepiece and load that into page_dupe_hash? Or ?include maybe? Logical and & or?
  460. // Or some kind of text-file combiner, not because that's convenient to anyone else, but because I've been saving a shitload of text-files.
  461.  
  462. // Re-editing this from whatever's on GreasyFork, in Linux. November 2020.
  463.  
  464. // Changes since last version:
  465. // In ?scrapemode=www, moved each /post's permalink discovery ahead of de-duplication.
  466. // Added navigation controls and timing information for www.tumblr.com/tagged mode.
  467. // Refactored onclick code for "immediate" image size changes.
  468. // Added navigation to specific dates for www.tumblr.com/tagged mode.
  469. // 2020 deep hibernation return:
  470. // Fixed the goddamn titles. Back to simplicity.
  471. // Included max sizes for new-format image URLs, but de-duplication needs more than standardization.
  472.  
  473.  
  474.  
  475.  
  476.  
  477.  
  478.  
  479. // ------------------------------------ Global variables ------------------------------------ //
  480.  
  481.  
  482.  
  483.  
  484.  
  485.  
  486.  
  487. var options_map = new Object(); // Associative array for ?key=value pairs in URL.
  488.  
  489. // Here's the URL options currently used. Scalars are at their default values; boolean flags are all set to false.
  490. options_map[ "startpage" ] = 1; // Page to start at when browsing images.
  491. options_map[ "pagesatonce" ] = 10; // How many Tumblr pages to browse images from at once.
  492. options_map[ "thumbnails" ] = false; // For browsing mode, 240px-wide images vs. full-size.
  493. options_map[ "find" ] = ""; // What goes after the Tumblr URL. E.g. /tagged/art or /chrono.
  494.  
  495. // Only two hard problems in computer science...
  496. var page_dupe_hash = {}; // Dump each scraped page into this, then test against each page - to remove inter-page duplicates (i.e., same shit on every page)
  497.  
  498. // Using a global variable for this (defined later) instead of repeatedly defining it or trying to pass within the page using black magic.
  499. var site_and_tags = ""; // To be filled with e.g. http: + // + example.tumblr.com + /tagged/sherlock once options_map.find is parsed from the URL
  500.  
  501. // Global list of post_id numbers that have been used, corresponding to span IDs based on them
  502. var posts_placed = new Array;
  503.  
  504. // Global list of posts to be scraped, for use by sitemap.xml-based scrape function.
  505. var post_urls = new Array;
  506.  
  507. // "Safe" landing page - Tumblr-standard, no theme, ideally few images.
  508. // This can't be /archive or else immediate resizing doesn't work on blogs. But /mobile doesn't work for the new www mode. /archive/amp is consistent but "Not found."
  509. // Blog CSS still won't work right with /archive/amp are you kidding me.
  510. var standard_landing_page = "/mobile/page/amp";
  511.  
  512.  
  513.  
  514.  
  515.  
  516.  
  517. // ------------------------------------ Script start, general setup ------------------------------------ //
  518.  
  519.  
  520.  
  521.  
  522.  
  523. // First, determine regime: Bare image? Scrape page? Normal page with a scrape link?
  524. if( window.location.href.indexOf( 'media.tumblr.com/' ) > -1 ) { maximize_image_size(); } // If it's an image, attempt to resize it
  525. // First, determine if we're loading many pages and listing/embedding them, or if we're just adding a convenient button to that functionality.
  526. else if( window.location.href.indexOf( 'ezastumblrscrape' ) > -1 ) { // If we're scraping pages:
  527. // Replace Tumblr-standard Archive page with our own custom nonsense
  528. let original_title = document.title;
  529. document.head.innerHTML = ""; // Delete CSS. We'll start with a blank page.
  530. // document.title = ''; // Clear title. Script adds stuff 'to the end,' as before. Fetch below eventually fixes the front.
  531. // Since we're avoiding /archive (due to content security policy interfering with image-resize gimmicks) we need to get the page title from elsewhere. (Oh, and favicon.)
  532. /*
  533. fetch( "/archive" ).then( r => r.text() ).then( text => { // This is now the wrong title for www stuff. Mmmeh.
  534. title = text.split( '<title>' )[1].split( '</title>' )[0];
  535. // title = text.split( '<title' )[1].split( '>' )[1].split( '</title' )[0]; // Attempted kludge
  536. document.title = window.location.hostname + " - " + htmlDecode( title ) + " " + document.title; // Beat the race condition by losing. (If we're first, we append a blank.)
  537. favicon = text.split( '<link' ).map( x => x.split( '>' )[0] ).filter( x => x.indexOf( 'shortcut icon' ) > 0 )[0];
  538. document.head.innerHTML += '<link ' + favicon + '>';
  539. // The favicon fails sometimes - even on blogs where it otherwise works - even on pages where reloading makes it work. I do not understand how.
  540. } )
  541. */
  542. // Fuck it:
  543. document.title = window.location.hostname + " - " + original_title.replace( /[\"\\\/]/g, '' ) + " " + document.title;
  544.  
  545. document.body.outerHTML = "<div id='maindiv'><div id='fetchdiv'></div></div><div id='bottom_controls_div'></div>"; // This is our page. Top stuff, content, bottom stuff.
  546. let css_block = "<style> "; // Let's break up the CSS for readability and easier editing, here.
  547. css_block += "body { background-color: #DDD; } "; // Light grey BG to make image boundaries more obvious
  548. css_block += "h1,h2,h3{ display: inline; } ";
  549. css_block += ".fixed-width img{ width: 240px; } ";
  550. css_block += ".fit-width img{ max-width: 100%; } ";
  551. css_block += ".fixed-height img{ height: 240px; max-width:100%; } "; // Max-width for weirdly common 1px-tall photoset footers. Ultrawide images get their own row.
  552. css_block += ".fit-height img{ max-height: 100%; } ";
  553. css_block += ".smart-fit img{ max-width: 100%; max-height: 100%; } "; // Might ditch the individual fit modes. This isn't Pixiv Fixiv. (No reason not to have them?)
  554. css_block += ".pagelink{ display:none; } ";
  555. css_block += ".showpagelinks .pagelink{ display:inline; } ";
  556. // Text blocks over images hacked together after https://www.designlabthemes.com/text-over-image-with-css/ - CSS is the devil
  557. css_block += ".container { position: relative; padding: 0; margin: 0; } "; // Holds an image and the on-hover permalink
  558. css_block += ".postlink { position: absolute; width: 100%; left: 0; bottom: 0; opacity: 0; background: rgba(255,255,255,0.7); } "; // Permalink across bottom
  559. css_block += ".container:hover .postlink { opacity: 1; } ";
  560. css_block += ".hidden { display: none; } ";
  561. css_block += "a.interactive{ color:green; } "; // For links that don't go anywhere, e.g. 'toggle image size' - signalling.
  562. css_block += "</style>";
  563. document.body.innerHTML += css_block; // Has to go in all at once or the browser "helpfully" closes the style tag upon evaluation
  564. var mydiv = document.getElementById( "maindiv" ); // I apologize for the generic names. This script used to be a lot simpler.
  565.  
  566. // Identify options in URL (in the form of ?key=value pairs)
  567. var key_value_array = window.location.href.split( '?' ); // Knowing how to do it the hard way is less impressive than knowing how not to do it the hard way.
  568. key_value_array.shift(); // The first element will be the site URL. Durrrr.
  569. for( dollarsign of key_value_array ) { // forEach( key_value_array ), including clumsy homage to $_
  570. var this_pair = dollarsign.split( '=' ); // Split key=value into [key,value] (or sometimes just [key])
  571. if( this_pair.length < 2 ) { this_pair.push( true ); } // If there's no value for this key, make its value boolean True
  572. if( this_pair[1] == "false " ) { this_pair[1] = false; } // If the value is the string "false" then make it False - note fun with 1-ordinal "length" and 0-ordinal array[element].
  573. else if( !isNaN( parseInt( this_pair[1] ) ) ) { this_pair[1] = parseInt( this_pair[1] ); } // If the value string looks like a number, make it a number
  574. options_map[ this_pair[0] ] = this_pair[1]; // options_map.key = value
  575. }
  576. if( options_map.find[ options_map.find.length - 1 ] == "/" ) { options_map.find = options_map.find.substring( 0, options_map.find.length - 1 ); } // Prevents .com//page/2
  577. // if( options_map.find.indexOf( '/chrono' ) > 0 ) { options_map.chrono = true; } else { options_map.chrono = false; } // False case, to avoid unexpected persistence? Hm.
  578.  
  579. // Convert old URL options to new key-value pairs
  580. if( options_map[ "scrapewholesite" ] ) { options_map.scrapemode = "scrapewholesite"; options_map.scrapewholesite = false; }
  581. if( options_map[ "everypost" ] ) { options_map.scrapemode = "everypost"; options_map.everypost = false; }
  582. if( options_map[ "thumbnails" ] == true ) { options_map.thumbnails = "fixed-width"; } // Replace the original valueless key with the default value
  583. if( options_map[ "notraw" ] ) { options_map.maxres = "1280"; options_map.notraw = false; }
  584. if( options_map[ "usesmall" ] ) { options_map.maxres = "400"; options_map.usesmall = false; }
  585.  
  586. document.body.className = options_map.thumbnails; // E.g. fixed-width, fixed-height, as matches the CSS. Persistent thumbnail options. Failsafe = original size.
  587. // Oh yeah, we have to do this -after- options_map.find is defined:
  588. site_and_tags = window.location.protocol + "//" + window.location.hostname + options_map.find; // e.g. http: + // + example.tumblr.com + /tagged/sherlock
  589.  
  590. // Grab an example page so that duplicate-removal hides whatever junk is on every single page
  591. // This remains buggy due to asynchronicity. It's a race condition where the results are, at worst, mildly annoying.
  592. // Previous notes mention jeffmacanolinsfw.tumblr.com for some reason.
  593. // Can just .then this function?
  594. // Can I create cookies? That'd work fine. On load, grab cookies for this site and for this script, use /page/1 crap.
  595. if( options_map.startpage != 1 && options_map.scrapemode != "scrapewholesite" ) // Not on first page, so on-every-page stuff appears somewhere
  596. { exclude_content_example( site_and_tags + '/page/1' ); } // Since this doesn't happen on page 1, let's use page 1. Low pages are faster somehow. .
  597.  
  598. // Add tags to title, for archival and identification purposes
  599. document.title += options_map.find.split('/').join(' '); // E.g. /tagged/example/chrono -> "tagged example chrono"
  600.  
  601. // In Chrome, /archive pages monkey-patch and overwrite Promise.all and Promise.resolve.
  602. // Clunky solution to clunky problem: grab the default property from a fresh iframe.
  603. // Big thanks to inu-no-policeman for the iframe-based solution. Prototypes were not helpful.
  604. var iframe = document.createElement( 'iframe' );
  605. document.body.appendChild( iframe );
  606. window['Promise'] = iframe.contentWindow['Promise'];
  607. document.body.removeChild( iframe );
  608.  
  609. mydiv.innerHTML = "Not all images are guaranteed to appear.<br>"; // Thanks to JS's wacky accomodating nature, mydiv is global despite appearing in an if-else block.
  610.  
  611. // Go to image browser or link scraper according to URL options.
  612. switch( options_map.scrapemode ) {
  613. case "scrapewholesite": scrape_whole_tumblr(); break;
  614. case "xml" : scrape_sitemap(); break;
  615. case "everypost": setTimeout( new_embedded_display, 500 ); break; // Slight delay increases odds of exclude_content_example actually fucking working
  616. case "www": scrape_www_tagged(); break; // www.tumblr.com/tagged, and eventually your dashboard, maybe.
  617. default: scrape_tumblr_pages(); // Sensible delays do not work on the original image browser. Shrug.
  618. }
  619.  
  620. } else { // If it's just a normal Tumblr page, add a link to the appropriate /ezastumblrscrape URL
  621.  
  622. // Add link(s) to the standard "+Follow / Dashboard" nonsense. Before +Follow, I think - to avoid messing with users' muscle memory.
  623. // This is currently beyond my ability to dick with JS through a script in a plugin. Let's kludge it for immediate usability.
  624. // kludge by Ivan - http://userscripts-mirror.org/scripts/review/65725.html
  625.  
  626. // Preserve /tagged/tag/chrono, etc. Also preserve http: vs https: via "location.protocol".
  627. var find = window.location.pathname;
  628. if( find.indexOf( "/page/chrono" ) <= 0 ) { // Basically checking for posts /tagged/page, thanks to Detective-Pony. Don't even ask.
  629. if( find.lastIndexOf( "/page/" ) >= 0 ) { find = find.substring( 0, find.lastIndexOf( "/page/" ) ); } // Don't include e.g. /page/2. We'll add that ourselves.
  630. if( find.lastIndexOf( "/post/" ) >= 0 ) { find = find.substring( 0, find.lastIndexOf( "/post" ) ); }
  631. if( find.lastIndexOf( "/archive" ) >= 0 ) { find = find.substring( 0, find.lastIndexOf( "/archive" ) ); }
  632. // On individual posts (and the /archive page), the link should scrape the whole site.
  633. }
  634. var url = window.location.protocol + "//" + window.location.hostname + standard_landing_page + "?ezastumblrscrape?scrapewholesite?find=" + find;
  635. if( window.location.host == "www.tumblr.com" ) { url += "?scrapemode=www?thumbnails=fixed-width"; url = url.replace( "?scrapewholesite", "" ); }
  636.  
  637. // "Don't clean this up. It's not permanent."
  638. // Fuck it, it works and it's fragile. Just boost its z-index so it stops getting covered.
  639. var scrape_button = document.createElement("a");
  640. scrape_button.setAttribute( "style", "position: absolute; top: 26px; right: 1px; padding: 2px 0 0; width: 50px; height: 18px; display: block; overflow: hidden; -moz-border-radius: 3px; background: #777; color: #fff; font-size: 8pt; text-decoration: none; font-weight: bold; text-align: center; line-height: 12pt; z-index: 1000; " );
  641. scrape_button.setAttribute("href", url);
  642. scrape_button.innerHTML = "Scrape";
  643. var body_ref = document.getElementsByTagName("body")[0];
  644. body_ref.appendChild(scrape_button);
  645. // Pages where the button gets split (i.e. clicking top half only redirects tiny corner iframe) are probably loading this script separately in the iframe.
  646. // Which means you'd need to redirect the window instead of just linking. Bluh.
  647.  
  648. // Greasemonkey supports user commands through its add-on menu! Thus: no more manually typing /archive?ezastumblrscrape?scrapewholesite on uncooperative blogs.
  649. GM_registerMenuCommand( "Scrape whole Tumblr blog", go_to_scrapewholesite );
  650. // If a page is missing (post deleted, blog deleted, name changed) a reblog can often be found based on the URL
  651. // Ugh, these should only appear on /post/ pages.
  652. // Naming these commands is hard. Both look for reblogs, but one is for if the post you're on is already a reblog.
  653. // Maybe "because this Tumblr changed names / moved / is missing" versus "because this reblog got deleted?" Newbs might not know either way.
  654. // "As though" this blog is missing / this post is missing?
  655. // "Search for this Tumblr under a different name" versus "search for other blogs that've reblogged this?"
  656. GM_registerMenuCommand( "Google for reblogs of this original Tumblr post", google_for_reblogs );
  657. GM_registerMenuCommand( "Google for other instances of this Tumblr reblog", google_for_reblogs_other ); // two hard problems
  658.  
  659. // if( window.location.href.indexOf( '?browse' ) > -1 ) { browse_this_page(); } // Experimental single-page scrape mode - DEFINITELY not guaranteed to stay
  660. }
  661.  
  662. function go_to_scrapewholesite() {
  663. // let redirect = window.location.protocol + "//" + window.location.hostname + "/archive?ezastumblrscrape?scrapewholesite?find=" + window.location.pathname;
  664. let redirect = window.location.protocol + "//" + window.location.hostname + standard_landing_page
  665. + "?ezastumblrscrape?scrapewholesite?find=" + window.location.pathname;
  666. window.location.href = redirect;
  667. }
  668.  
  669. function google_for_reblogs() {
  670. let blog_name = window.location.href.split('/')[2].split('.')[0]; // e.g. http//example.tumblr.com -> example
  671. let content = window.location.href.split('/').pop(); // e.g. http//example.tumblr.com/post/12345/hey-i-drew-this -> hey-i-drew-this
  672. let redirect = "https://google.com/search?q=tumblr " + blog_name + " " + content;
  673. window.location.href = redirect;
  674. }
  675.  
  676. function google_for_reblogs_other() {
  677. let content = window.location.href.split('/').pop().split('-'); // e.g. http//example.tumblr.com/post/12345/hey-i-drew-this -> hey,i,drew,this
  678. let blog_name = content.shift(); // e.g. examplename-hey-i-drew-this -> examplename
  679. content = content.join('-');
  680. let redirect = "https://google.com/search?q=tumblr " + blog_name + " " + content;
  681. window.location.href = redirect;
  682. }
  683.  
  684.  
  685.  
  686.  
  687.  
  688.  
  689.  
  690.  
  691.  
  692. // ------------------------------------ Whole-site scraper for use with DownThemAll ------------------------------------ //
  693.  
  694.  
  695.  
  696.  
  697.  
  698.  
  699.  
  700.  
  701.  
  702. // Monolithic scrape-whole-site function, recreating the original intent (before I added pages and made it a glorified multipage image browser)
  703. function scrape_whole_tumblr() {
  704. // console.log( page_dupe_hash );
  705. var highest_known_page = 0;
  706. options_map.startpage = 1; // Reset to default, because other values do goofy things to the image-browsing links below
  707.  
  708. // Link to image-viewing version, preserving current tags
  709. mydiv.innerHTML += "<br><h1><a id='browse10' href='" + options_url( {scrapemode:'pagebrowser', thumbnails:true} ) +"'>Browse images (10 pages at once)</a> </h1>";
  710. mydiv.innerHTML += "<h3><a id='browse5' href='" + options_url( {scrapemode:'pagebrowser', thumbnails:true, pagesatonce:5} ) + "'>(5 pages at once)</a> </h3>";
  711. mydiv.innerHTML += "<h3><a id='browse1' href='" + options_url( {scrapemode:'pagebrowser', thumbnails:true, pagesatonce:1} ) + "'>(1 page at once)</a></h3><br><br>";
  712.  
  713. mydiv.innerHTML += "<a id='exp10' href='" + options_url( {scrapemode:'everypost', thumbnails:true, pagesatonce:10} ) + "'>Experimental fetch-every-post image browser (10 pages at once)</a> ";
  714. mydiv.innerHTML += "<a id='exp5' href='" + options_url( {scrapemode:'everypost', thumbnails:true, pagesatonce: 5} ) + "'>(5 pages at once)</a> ";
  715. mydiv.innerHTML += "<a id='exp1' href='" + options_url( {scrapemode:'everypost', thumbnails:true, pagesatonce: 1} ) + "'>(1 page at once)</a><br><br>";
  716.  
  717. mydiv.innerHTML += "<a id='xml_all' href='" + options_url( {scrapemode:'xml' } ) + "'>Post-by-post images and text</a> <-- New option for saving stories<br><br>";
  718.  
  719. // Find out how many pages we need to scrape.
  720. if( isNaN( options_map.lastpage ) ) {
  721. // Find upper bound in a small number of fetches. Ideally we'd skip this - some themes list e.g. "Page 1 of 24." I think that requires back-end cooperation.
  722. mydiv.innerHTML += "Finding out how many pages are in <b>" + site_and_tags.substring( site_and_tags.indexOf( '/' ) + 2 ) + "</b>:<br><br>";
  723.  
  724. // Returns page number if there's no Next link, or negative page number if there is a Next link.
  725. // Only for use on /mobile pages; relies on Tumblr's shitty standard theme
  726. function test_next_page( body ) {
  727. var link_index = body.indexOf( 'rel="canonical"' ); // <link rel="canonical" href="http://shadygalaxies.tumblr.com/page/100" />
  728. var page_index = body.indexOf( '/page/', link_index );
  729. var terminator_index = body.indexOf( '"', page_index );
  730. var this_page = parseInt( body.substring( page_index+6, terminator_index ) );
  731. if( body.indexOf( '>next<' ) > 0 ) { return -this_page; } else { return this_page }
  732. }
  733.  
  734. // Generates an array of length "steps" between given boundaries - or near enough, for sanity's sake
  735. function array_between_bounds( lower_bound, upper_bound, steps ) {
  736. if( lower_bound > upper_bound ) { // Swap if out-of-order.
  737. var temp = lower_bound; lower_bound = upper_bound, upper_bound = temp;
  738. }
  739. var bound_range = upper_bound - lower_bound;
  740. if( steps > bound_range ) { steps = bound_range; } // Steps <= bound_range, but steps > 1 to avoid division by zero:
  741. var pages_per_test = parseInt( bound_range / steps ); // Steps-1 here, so first element is lower_bound & last is upper_bound. Off-by-one errors, whee...
  742. var range = Array( steps )
  743. .fill( lower_bound )
  744. .map( (value,index) => value += index * pages_per_test );
  745. range.push( upper_bound );
  746. return range;
  747. }
  748.  
  749. // DEBUG
  750. // site_and_tags = 'https://www.tumblr.com/safe-mode?url=http://shittyhorsey.tumblr.com';
  751.  
  752. // Given a (presumably sorted) list of page numbers, find the last that exists and the first that doesn't exist.
  753. function find_reasonable_bound( test_array ) {
  754. return Promise.all( test_array.map( pagenum => fetch( site_and_tags + '/page/' + pagenum + '/mobile', { credentials: 'include' } ) ) )
  755. .then( responses => Promise.all( responses.map( response => response.text() ) ) )
  756. .then( pages => pages.map( page => test_next_page( page ) ) )
  757. .then( numbers => {
  758. var lower_index = -1;
  759. numbers.forEach( (value,index) => { if( value < 0 ) { lower_index++; } } ); // Count the negative numbers (i.e., count the pages with known content)
  760. if( lower_index < 0 ) { lower_index = 0; }
  761. var bounds = [ Math.abs(numbers[lower_index]), numbers[lower_index+1] ]
  762. mydiv.innerHTML += "Last page is between " + bounds[0] + " and " + bounds[1] + ".<br>";
  763. return bounds;
  764. } )
  765. }
  766.  
  767. // Repeatedly narrow down how many pages we're talking about; find a reasonable "last" page
  768. find_reasonable_bound( [2, 10, 100, 1000, 10000, 100000] ) // Are we talking a couple pages, or a shitload of pages?
  769. .then( pair => find_reasonable_bound( array_between_bounds( pair[0], pair[1], 10 ) ) ) // Narrow it down. Fewer rounds of more fetches works best.
  770. .then( pair => find_reasonable_bound( array_between_bounds( pair[0], pair[1], 10 ) ) ) // Time is round count, fetches add up, selectivity is fetches x fetches.
  771. // Quit fine-tuning numbers and just conditional in some more testing for wide ranges.
  772. .then( pair => { if( pair[1] - pair[0] > 50 ) { return find_reasonable_bound( array_between_bounds( pair[0], pair[1], 10 ) ) } else { return pair; } } )
  773. .then( pair => { if( pair[1] - pair[0] > 50 ) { return find_reasonable_bound( array_between_bounds( pair[0], pair[1], 10 ) ) } else { return pair; } } )
  774. .then( pair => {
  775. options_map.lastpage = pair[1];
  776.  
  777. document.getElementById( 'browse10' ).href += "?lastpage=" + options_map.lastpage; // Add last-page indicator to Browse Images link
  778. document.getElementById( 'browse5' ).href += "?lastpage=" + options_map.lastpage; // ... and the 5-pages-at-once link.
  779. document.getElementById( 'browse1' ).href += "?lastpage=" + options_map.lastpage; // ... and the 1-page-at-onces link.
  780. document.getElementById( 'exp10' ).href += "?lastpage=" + options_map.lastpage; // ... and this fetch-every-post link.
  781. document.getElementById( 'exp5' ).href += "?lastpage=" + options_map.lastpage; // ... and this fetch-every-post link.
  782. document.getElementById( 'exp1' ).href += "?lastpage=" + options_map.lastpage; // ... and this fetch-every-post link.
  783. document.getElementById( 'xml_all' ).href += "?lastpage=" + options_map.lastpage; // ... and this XML post-by-post link, why not.
  784.  
  785. start_scraping_button();
  786. } );
  787. }
  788. else { // If we're given the highest page by the URL, just use that
  789. start_scraping_button();
  790. }
  791.  
  792. // Add "Scrape" button to the page. This will grab images and links from many pages and list them page-by-page.
  793. function start_scraping_button() {
  794. if( options_map.grabrange ) { // If we're only grabbing a 1000-page block from a huge-ass tumblr:
  795. mydiv.innerHTML += "<br>This will grab 1000 pages starting at <b>" + options_map.grabrange + "</b>.<br><br>";
  796. } else { // If we really are describing the last page:
  797. mydiv.innerHTML += "<br>Last page is <b>" + options_map.lastpage + "</b> or lower. ";
  798. mydiv.innerHTML += "<a href=" + options_url( {lastpage: false} ) + ">Find page count again?</a> <br><br>";
  799. }
  800.  
  801. if( options_map.lastpage > 1500 && !options_map.grabrange ) { // If we need to link to 1000-page blocks, and aren't currently inside one:
  802. for( let x = 1; x < options_map.lastpage; x += 1000 ) { // For every 1000 pages...
  803. let decade_url = window.location.href + "?grabrange=" + x + "?lastpage=" + options_map.lastpage;
  804. mydiv.innerHTML += "<a href='" + decade_url + "'>Pages " + x + "-" + (x+999) + "</a><br>"; // ... link a range of 1000 pages.
  805. }
  806. }
  807.  
  808. // Add button to scrape every page, one after another.
  809. // Buttons within GreaseMonkey are a huge pain in the ass. I stole this from stackoverflow.com/questions/6480082/ - thanks, Brock Adams.
  810. var button = document.createElement ('div');
  811. button.innerHTML = '<button id="myButton" type="button">Find image links from all pages</button>';
  812. button.setAttribute ( 'id', 'scrape_button' ); // I'm really not sure why this id and the above HTML id aren't the same property.
  813. document.body.appendChild ( button ); // Add button (at the end is fine)
  814. document.getElementById ("myButton").addEventListener ( "click", scrape_all_pages, false ); // Activate button - when clicked, it triggers scrape_all_pages()
  815.  
  816. if( options_map.autostart ) { document.getElementById ("myButton").click(); } // Getting tired of clicking on every reload - debug-ish
  817. if( options_map.lastpage <= 26 ) { document.getElementById ("myButton").click(); } // Automatic fetch (original behavior!) for a single round
  818. }
  819. }
  820.  
  821.  
  822.  
  823. function scrape_all_pages() { // Example code implies that this function /can/ take a parameter via the event listener, but I'm not sure how.
  824. var button = document.getElementById( "scrape_button" ); // First, remove the button. There's no reason it should be clickable twice.
  825. button.parentNode.removeChild( button ); // The DOM can only remove elements from a higher level. "Elements can't commit suicide, but infanticide is permitted."
  826.  
  827. mydiv.innerHTML += "Scraping page: <div id='pagecounter'></div><div id='afterpagecounter'><br></div><br>"; // This makes it easier to view progress,
  828.  
  829. // Create divs for all pages' content, allowing asynchronous AJAX fetches
  830. var x = 1;
  831. var div_end_page = options_map.lastpage;
  832. if( !isNaN( options_map.grabrange ) ) { // If grabbing 1000 pages from the middle of 10,000, don't create 0..10,000 divs
  833. x = options_map.grabrange;
  834. div_end_page = x + 1000; // Should be +999, but whatever, no harm in tiny overshoot
  835. }
  836. for( ; x <= div_end_page; x++ ) {
  837. var siteurl = site_and_tags + "/page/" + x;
  838. if( options_map.usemobile ) { siteurl += "/mobile"; } // If ?usemobile is flagged, scrape the mobile version.
  839. if( x == 1 && options_map.usemobile ) { siteurl = site_and_tags + "/mobile"; } // Hacky fix for redirect from example.tumblr.com/page/1/anything -> example.tumblr.com
  840.  
  841. var new_div = document.createElement( 'div' );
  842. new_div.id = '' + x;
  843. document.body.appendChild( new_div );
  844. }
  845.  
  846. // Fetch all pages with content on them
  847. var page_counter_div = document.getElementById( 'pagecounter' ); // Probably minor, but over thousands of laggy page updates, I'll take any optimization.
  848. pagecounter.innerHTML = "" + 1;
  849. var begin_page = 1;
  850. var end_page = options_map.lastpage;
  851. if( !isNaN( options_map.grabrange ) ) { // If a range is defined, grab only 1000 pages starting there
  852. begin_page = options_map.grabrange;
  853. end_page = options_map.grabrange + 999; // NOT plus 1000. Stop making that mistake. First page + 999 = 1000 total.
  854. if( end_page > options_map.lastpage ) { end_page = options_map.lastpage; } // Kludge
  855. document.title += " " + (parseInt( begin_page / 1000 ) + 1); // Change page title to indicate which block of pages we're saving
  856. }
  857.  
  858.  
  859. // Generate array of URL/pagenum pair-arrays
  860. url_index_array = new Array;
  861. for( var x = begin_page; x <= end_page; x++ ) {
  862. var siteurl = site_and_tags + "/page/" + x;
  863. if( options_map.usemobile ) { siteurl += "/mobile"; } // If ?usemobile is flagged, scrape the mobile version. No theme shenanigans... but also no photosets. Sigh.
  864. if( x == 1 && options_map.usemobile ) { siteurl = site_and_tags + "/mobile"; } // Hacky fix for redirect from example.tumblr.com/page/1/anything -> example.tumblr.com
  865. url_index_array.push( [siteurl, x] );
  866. }
  867.  
  868. // Fetch, scrape, and display all URLs. Uses promises to work in parallel and promise.all to limit speed and memory (mostly for reliability's sake).
  869. // Consider privileging first page with single-element fetch, to increase apparent responsiveness. Doherty threshold for frustration is 400ms.
  870. var simultaneous_fetches = 25;
  871. var chain = Promise.resolve(0); // Empty promise so we can use "then"
  872.  
  873. var order_array = [1]; // We want to show the first page immediately, and this is a callback rat's-nest, so let's make an array of how many pages to take each round
  874. for( var x = 1; x < url_index_array.length; x += simultaneous_fetches ) { // E.g. [1, simultaneous_fetchs, s_f, s_f, s_f, whatever's left]
  875. if( url_index_array.length - x > simultaneous_fetches ) { order_array.push( simultaneous_fetches ); } else { order_array.push( url_index_array.length - x ); }
  876. }
  877.  
  878. order_array.forEach( (how_many) => {
  879. chain = chain.then( s => {
  880. var subarray = url_index_array.splice( 0, how_many ); // Shift a reasonable number of elements into separate array, for partial array.map
  881. return Promise.all( subarray.map( page =>
  882. Promise.all( [ fetch( page[0], { credentials: 'include' } ).then( s => s.text() ), page[1], page[0] ] ) // Return [ body of page, page number, page URL ]
  883. ) )
  884. } )
  885. .then( responses => responses.map( s => { // Scrape URLs for links and images, display on page
  886. var pagenum = s[1];
  887. var page_url = s[2];
  888. var url_array = soft_scrape_page_promise( s[0] ) // Surprise, this is a promise now
  889. .then( urls => {
  890. // Sort #link URLs to appear first, because we don't do that in soft-scrape anymore
  891. urls.sort( (a,b) => -a.indexOf( "#link" ) ); // Strings containing "#link" go before others - return +1 if not found in 'a.' Should be stable.
  892.  
  893. // Print URLs so DownThemAll (or similar) can grab them
  894. var bulk_string = "<br><a href='" + page_url + "'>" + page_url + "</a><br>"; // A digest, so we can update innerHTML just once per div
  895. // DEBUG-ish - on theory that 1000-page-tall scraping/rendering fucks my VRAM
  896. if( options_map.smalltext ) { bulk_string = "<p style='font-size:1px'>" + bulk_string; } // If ?smalltext flag is set, render text unusably small, for esoteric reasons
  897. urls.forEach( (value,index,array) => {
  898. if( options_map.plaintext ) {
  899. bulk_string += value + '<br>';
  900. } else {
  901. bulk_string += '<a href ="' + value + '">' + value + '</a><br>';
  902. }
  903. } )
  904. document.getElementById( '' + pagenum ).innerHTML = bulk_string;
  905. if( parseInt( pagecounter.innerHTML ) < pagenum ) { pagecounter.innerHTML = "" + pagenum; } // Increment pagecounter (where sensible)
  906. } );
  907. } )
  908. )
  909. } )
  910.  
  911. chain = chain.then( s => {
  912. document.getElementById( 'afterpagecounter' ).innerHTML = "Done. Use DownThemAll (or a similar plugin) to grab all these links.";
  913.  
  914. // Divulge contents of page_dupe_hash to check for common tags
  915. // Ugh, I'm going to have to turn this from an associative array into an array-of-arrays if I want to sort it.
  916. let tag_overview = "<br>" + "Tag overview: " + "<br>";
  917. let score_tag_list = new Array; // This will hold an array of arrays so we can sort this associative array by its values. Wheee.
  918. for( let url in page_dupe_hash ) {
  919. if( url.indexOf( '/tagged/' ) > 0 // If it's a tag URL...
  920. && page_dupe_hash[ url ] > 1 // and non-unique...
  921. && url.indexOf( '/page/' ) < 0 // and not a page link...
  922. && url.indexOf( '?og' ) < 0 // and not an opengraph link...
  923. && url.indexOf( '?ezas' ) < 0 // and not this script, wtf...
  924. ) { // So if it's a TAG, in other words...
  925. score_tag_list.push( [ page_dupe_hash[ url ], url ] ); // ... store [ number of times seen, tag URL ] for sorting.
  926. }
  927. }
  928. score_tag_list.sort( (a,b) => a[0] > b[0] ); // Ascending order, now - most common tags at the very bottom, for easier access
  929. score_tag_list.map( pair => {
  930. pair[1] = pair[1].replace( '/chrono', '' ); // Remove /chrono from sites that append it automatically, since it breaks the 'N pages' autostart links. (/chrono/ might fail.)
  931. var this_tag = pair[1].split('/').pop(); // e.g. example.tumblr.com/tagged/my-art -> my-art
  932. //if( this_tag == '' ) { let this_tag = pair[1].split('/').pop(); } // Trailing slash screws this up, so get the second-to-last thing instead
  933. var scrape_link = options_url( {find: '/tagged/'+this_tag, lastpage: false, grabrange: false, autostart: true} ); // Direct link to ?scrapewholesite for e.g. /tagged/my-art
  934. tag_overview += "<br><a href='" + scrape_link + "'>" + pair[0] + " posts</a>:\t" + "<a href='" + pair[1] + "'>" + pair[1] + "</a>";
  935. } )
  936. document.body.innerHTML += tag_overview;
  937.  
  938. } )
  939. }
  940.  
  941.  
  942.  
  943.  
  944.  
  945.  
  946.  
  947.  
  948.  
  949. // ------------------------------------ Multi-page scraper with embedded images ------------------------------------ //
  950.  
  951.  
  952.  
  953.  
  954.  
  955.  
  956.  
  957.  
  958.  
  959. function scrape_tumblr_pages() {
  960. // Grab an empty page so that duplicate-removal hides whatever junk is on every single page
  961. // This is DEBUG-ish. It might be slow, barring caching. It might not work due to asynchrony. It could block actual content thanks to 'my best posts' sidebars.
  962.  
  963. if( isNaN( parseInt( options_map.startpage ) ) || options_map.startpage <= 1 ) { options_map.startpage = 1; } // Sanity check
  964.  
  965. mydiv.innerHTML += "<br>" + html_previous_next_navigation() + "<br>";
  966. document.getElementById("bottom_controls_div").innerHTML += "<br>" + html_page_count_navigation() + "<br>" + html_previous_next_navigation();
  967.  
  968. mydiv.innerHTML += "<br>" + html_ezastumblrscrape_options() + "<br>";
  969.  
  970. mydiv.innerHTML += "<br>" +image_size_options();
  971.  
  972. mydiv.innerHTML += "<br><br>" + image_resolution_options();
  973.  
  974. // Fill an array with the page URLs to be scraped (and create per-page divs while we're at it)
  975. var pages = new Array( parseInt( options_map.pagesatonce ) )
  976. .fill( parseInt( options_map.startpage ) )
  977. .map( (value,index) => value+index );
  978. pages.forEach( pagenum => {
  979. mydiv.innerHTML += "<hr><div id='" + pagenum + "'><b>Page " + pagenum + " </b></div>";
  980. } )
  981.  
  982. pages.map( pagenum => {
  983. var siteurl = site_and_tags + "/page/" + pagenum; // example.tumblr.com/page/startpage, startpage+1, startpage+2, etc.
  984. if( options_map.usemobile ) { siteurl += "/mobile"; } // If ?usemobile is flagged, scrape mobile version. No theme shenanigans... but also no photosets. Sigh.
  985. if( pagenum == 1 && options_map.usemobile ) { siteurl = site_and_tags + "/mobile"; } // Hacky fix for redirect from example.tumblr.com/page/1/anything -> example.tumblr.com
  986. fetch( siteurl, { credentials: 'include' } ).then( response => response.text() ).then( text => {
  987. document.getElementById( pagenum ).innerHTML += "<b>fetched</b><br>" // Immediately indicate the fetch happened.
  988. + "<a href='" + siteurl + "'>" + siteurl + "</a><br>"; // Link to page. Useful for viewing things in-situ... and debugging.
  989. // For some asinine reason, 'return url_array' causes 'Permission denied to access property "then".' So fake it with ugly nesting.
  990. soft_scrape_page_promise( text )
  991. .then( url_array => {
  992. var div_digest = ""; // Instead of updating each div's HTML for every image, we'll lump it into one string and update the page once per div.
  993.  
  994. var video_array = new Array;
  995. var outlink_array = new Array;
  996. var inlink_array = new Array;
  997.  
  998. url_array.forEach( (value,index,array) => { // Shift videos and links to separate arrays, blank out those URLs in url_array
  999. if( value.indexOf( '#video' ) > 0 ) { video_array.push( value ); array[index] = '' }
  1000. if( value.indexOf( '#offsite' ) > 0 ) { outlink_array.push( value ); array[index] = '' }
  1001. if( value.indexOf( '#local' ) > 0 ) { inlink_array.push( value ); array[index] = '' }
  1002. } );
  1003. url_array = url_array.filter( url => url === "" ? false : true ); // Remove empty elements from url_array
  1004.  
  1005. // Display video links, if there are any
  1006. video_array.forEach( value => {div_digest += "Video: <a href='" + value + "'>" + value + "</a><br> "; } )
  1007.  
  1008. // Display page links if the ?showlinks flag is enabled
  1009. if( options_map.showlinks ) {
  1010. div_digest += "Outgoing links: ";
  1011. outlink_array.forEach( (value,index) => { div_digest += "<a href='" + value.replace('#offsite#link', '') + "'>O" + (index+1) + "</a> " } );
  1012. div_digest += "<br>" + "Same-Tumblr links: ";
  1013. inlink_array.forEach( (value,index) => { div_digest += "<a href='" + value.replace('#local#link', '') + "'>T" + (index+1) + "</a> " } );
  1014. div_digest += "<br>";
  1015. }
  1016.  
  1017. // Embed high-res images to be seen, clicked, and saved
  1018. url_array.forEach( image_url => {
  1019. // Embed images (linked to themselves) and link to photosets
  1020. if( image_url.indexOf( "#photoset#" ) > 0 ) { // Before the first image in a photoset, print the photoset link.
  1021. var photoset_url = image_url.split( "#" ).pop();
  1022. // URL is like tumblr.com/image#photoset#http://tumblr.com/photoset_iframe - separate past last hash... t.
  1023. div_digest += " <a href='" + photoset_url + "'>Set:</a>";
  1024. }
  1025. div_digest += "<a id='" + encodeURI( image_url ) + "' target='_blank' href='" + image_url + "'>" + "<img alt='(Waiting for image)' onerror='" + error_function( image_url ) + "' src='" + image_url + "'></a> ";
  1026. // div_digest += "<a id='" + encodeURI( image_url ) + "' target='_blank' href='" + image_url + "'>" + "<img alt='(Waiting for image)' src='" + image_url + "'></a> ";
  1027. } )
  1028.  
  1029. div_digest += "<br><a href='" + siteurl + "'>(End of " + siteurl + ")</a>"; // Another link to the page, because I'm tired of scrolling back up.
  1030. document.getElementById( pagenum ).innerHTML += div_digest;
  1031. } ) // End of 'then( url_array => { } )'
  1032. } ) // End of 'then( text => { } )'
  1033. } ) // End of 'pages.map( pagenum => { } )'
  1034. }
  1035.  
  1036.  
  1037.  
  1038.  
  1039.  
  1040.  
  1041.  
  1042.  
  1043.  
  1044. // ------------------------------------ Whole-site scraper based on post-by-post method ------------------------------------ //
  1045.  
  1046.  
  1047.  
  1048.  
  1049.  
  1050.  
  1051.  
  1052.  
  1053. // The use of sitemap.xml could be transformative to this script, or even split off into another script entirely. It finally allows a standard theme!
  1054. // Just do it. Entries look like:
  1055. // <url><loc>http://leopha.tumblr.com/post/57593339717/hello-a-friend-found-my-blog-somehow-so-url</loc><lastmod>2013-08-07T06:42:58Z</lastmod></url></urlset>
  1056. // So split on <url><loc> and terminate at </loc>. Or see if JS has native XML parsing, for honest object-orientation instead of banging at text files.
  1057. // Grab normally for now, I guess, just to get it implemented. Make ?usemobile work. Worry about /embed stuff later.
  1058. // Check if dashboard-only blogs have xml files exposed.
  1059. // While we're at it, check that the very first <loc> for sitemap2 doesn't point to a /page/n URL. Not useful, just interesting.
  1060.  
  1061. // Simplicate for now. Have a link to grab an individual XML file, get everything inside it, then add images (and tags?) from each page.
  1062. // For an automatic 'get the whole damn site' mode, maybe just call each XML file sequentially. Not a for() loop - have each page maybe call the next page when finished.
  1063. // Oh right, these XML files are like 500 posts each. Not much different from grabbing 50 pages at a time - and we currently grab 25 at once - but significant.
  1064. // If I have to pass the /post URL list to a serial function in order to rate-limit this, I might as well grab all XML files and fill the list -once.-
  1065. // Count down? So e.g. scrape(x) calls scrape(x-1). It's up to the root function to call the high value initially.
  1066.  
  1067. // Single-sitemap pages (e.g. scraping and displaying sitemap2) will need prev/next controls (e.g. to sitemap1 & sitemap3).
  1068. // These will be chronological. That's fine, just mention it somewhere.
  1069.  
  1070. // I'm almost loathe to publicize this. It's not faster or more reliable than the monolithic scraper. It's not a better image-browsing method, in my opinion.
  1071. // It's only really useful for people who think a blank theme is the same as erasing their blog... and the less those jerks know, the better. Don't delete art.
  1072. // I guess it'll be better when I list tags with each post, but at present they're strongly filtered out.
  1073.  
  1074. // Wait, wtf? I ripped sometipsygnostalgic as a test cast (98 sitemaps, ~50K posts) and the last page scraped is from years ago.
  1075. // None of the sitemaps past sitemap51 exist. So the first 51 are present - the rest 404. (To her theme, not as a generic tumblr error with animated background.)
  1076. // It's not a URL-generation error; the same thing happens copy-pasting the XML URLs straight from sitemap.xml.
  1077. // Just... fuck. Throw a warning for now, see if there's anything to be done later.
  1078. // If /mobile versions of /post URLs unexpectedly had proper [video] references... do they contain previous/next post info? Doesn't look like it. Nuts.
  1079.  
  1080. // In addition to tags, this should maybe list sources and 'via' links. It's partially archival, after all.
  1081. // Sort of have to conditionally unfilter /tagged links from both duplicate-remover and whichever function removes tumblr guff. 'if !xml remove tagged.' bluh.
  1082. // Better idea: send the list to a 'return only tags' filter first, non-destructively. Then filter them out.
  1083.  
  1084. // I guess this is what I'd modify to get multi-tag searches - like homestuck+hs+hamsteak. Scrape by /tagged and /page, populate post_urls, de-dupe and sort.
  1085. // Of course then I'd want it to 'just work' with the existing post-by-post image browser, which... probably won't happen. I'm loathe to recreate it exactly or inexactly.
  1086.  
  1087. // This needs some method to find /post numbers from /page and /tagged pages.
  1088. // Maybe trigger a later post-by-post function from a modified scrapewholesite approach? Code quality is a non-issue right now.
  1089. // Basically have to fill some array post_urls with /post URLs (strings) and then call scrape_post_urls(). Oh, that array is global. This was already a janky first pass.
  1090. // Instead of XML files, use /page or /page + /mobile... pages.
  1091. // Like 'if /tagged/something then grab all links and filter for indexof /post.'
  1092.  
  1093. // This is the landing function for this mode - it grabs sitemap.xml, parses options, sets any necessary variables, and invokes scrape_sitemap_x for sitemap1 and so on.
  1094. function scrape_sitemap() {
  1095. document.title += ' sitemap';
  1096.  
  1097. // Flow control.
  1098. // Two hard problems. Structure now, names later.
  1099. if( options_map.sitemap ) {
  1100. mydiv.innerHTML += "<br>Completion: <div id='pagecounter'>0%</div><div id='afterpagecounter'></div><br>"; // This makes it easier to view progress,
  1101. let final_sitemap = 0; // Default: no recursion. 'Else considered harmful.'
  1102. if( options_map.xmlcount ) { final_sitemap = options_map.xmlcount; } // If we're doing the whole site, indicate the highest sitemap and recurse until you get there.
  1103. scrape_sitemap_x( options_map.sitemap, final_sitemap );
  1104. } else {
  1105. // Could create a div for this enumeration of sitemapX files, and always have it at the top. It'd get silly for tumblrs with like a hundred of them.
  1106. // Maybe at the bottom?
  1107. fetch( window.location.protocol + "//" + window.location.hostname + '/sitemap.xml', { credentials: 'include' } ) // Grab text-like file
  1108. .then( r => r.text() )
  1109. .then( t => { // Process the text of sitemap.xml
  1110. let sitemap_list = t.split( '<loc>' );
  1111. sitemap_list.shift(); // Get rid of data before first location
  1112. sitemap_list = sitemap_list.map( e => e.split('</loc>')[0] ); // Terminate each entry
  1113. sitemap_list = sitemap_list.filter( e => { return e.indexOf( '/sitemap' ) > 0 && e.indexOf( '.xml' ) > 0; } ); // Remove everything but 'sitemapX.xml' links
  1114.  
  1115. mydiv.innerHTML += '<br>' + window.location.hostname + ' has ' + sitemap_list.length + ' sitemap XML file';
  1116. if( sitemap_list.length > 1 ) { mydiv.innerHTML += 's'; } // Pluralization! Not sexy, just functional.
  1117. // mydiv.innerHTML += ' (Up to ' + sitemap_list.length * 500 + ' posts.) <br>'
  1118. mydiv.innerHTML += '. (' + Math.ceil(sitemap_list.length / 2) + ',000 posts or fewer.)<br><br>' // Kludge math, but I pefer this presentation.
  1119. if( sitemap_list.length > 50 ) { mydiv.innerHTML += 'Sitemaps past 50 may not work.<br><br>'; } // Pluralization! Not sexy, just functional.
  1120. // List everything: <Links only>
  1121. // List sitemap1: <Links only> or <Links & thumbnails>
  1122. // List sitemap2: <Links only> or <Links & thumbnails> etc.
  1123. mydiv.innerHTML += 'List everything: ' + '<a href=' + options_url( { sitemap:1, xmlcount:sitemap_list.length } ) + '>Links only</a> - <a href=' + options_url( { sitemap:1, xmlcount:sitemap_list.length, story:true, usemobile:true } ) + '>Links and text (stories)</a><br><br>';
  1124. if( options_map.find && options_map.lastpage ) {
  1125. mydiv.innerHTML += 'List just this tag (' + options_map.find + '): ' + '<a href=' + options_url( { sitemap:1, xmlcount:options_map.lastpage, tagscrape:true } ) + '>Links only</a> - <a href=' + options_url( { sitemap:1, xmlcount:options_map.lastpage, story:true, usemobile:true, tagscrape:true } ) + '>Links and text (stories)</a><br><br>';
  1126. }
  1127. // mydiv.innerHTML += "<br><h1><a id='browse10' href='" + options_url( {scrapemode:'pagebrowser', thumbnails:true} ) +"'>Browse images (10 pages at once)</a> </h1>";
  1128. for( n = 1; n <= sitemap_list.length; n++ ) {
  1129. let text_link = options_url( { sitemap:n } );
  1130. let images_link = options_url( { sitemap:n, thumbnails:'xml' } );
  1131. let story_link = options_url( { sitemap:n, story:true, usemobile:true } );
  1132. mydiv.innerHTML += 'List sitemap' + n + ': ' + '<a href=' + text_link + '>Links only</a> <a href=' + images_link + '>Links & thumbnails</a> <a href=' + story_link + '>Links & text</a><br>'
  1133. }
  1134. } )
  1135. }
  1136. }
  1137.  
  1138. // Text-based scrape mode for sitemap1.xml, sitemap2.xml, etc.
  1139. function scrape_sitemap_x( sitemap_x, final_sitemap ) {
  1140. document.title = window.location.hostname + ' - sitemap ' + sitemap_x; // We lose the original tumblr's title, but whatever.
  1141. if( sitemap_x == final_sitemap ) { document.title = window.location.hostname + ' - sitemap complete'; }
  1142. if( options_map.story ) { document.title += ' with stories'; }
  1143.  
  1144. // Last-minute kludge: if options_map.tagscrape, use /page instead of any xml, get /post URLs matching this domain.
  1145. // Only whole-hog for now - sitemap_x might allow like ten or a hundred pages at once, later.
  1146.  
  1147. var base_site = window.location.protocol + "//" + window.location.hostname; // e.g. http://example.tumblr.com
  1148. var sitemap_url = base_site + '/sitemap' + sitemap_x + '.xml';
  1149.  
  1150. if( options_map.tagscrape ) {
  1151. sitemap_url = base_site + options_map.find + '/page/' + sitemap_x;
  1152. document.title += ' ' + options_map.find.replace( /\//g, ' '); // E.g. 'tagged my-stuff'.
  1153. }
  1154.  
  1155. mydiv.innerHTML += "Finding posts from " + sitemap_url + ".<br>";
  1156. fetch( sitemap_url, { credentials: 'include' } ) // Grab text-like file
  1157. .then( r => r.text() ) // Get txt from HTTP response
  1158. .then( t => { // Process text to extract links
  1159. let location_list = t.split( '<loc>' ); // Break at each location declaration
  1160. location_list.shift(); location_list.shift(); // Ditch first two elements. First is preamble, second is base URL (e.g. example.tumblr.com).
  1161. location_list = location_list.map( e => e.split( '</loc>' )[0] ); // Terminate each entry at location close-tag
  1162.  
  1163. if( options_map.tagscrape ) {
  1164. // Fill location_list with URLs on this page, matching this domain, containing e.g. /post/12345.
  1165. // Trailing slash not guaranteed. Quote terminator unknown. Fuck it, just assume it's all doublequotes.
  1166. location_list = t.split( 'href="' );
  1167. location_list.shift();
  1168. location_list = location_list.map( e => e.split( '"' )[0] ); // Terminate each entry at location close-tag
  1169. location_list = location_list.filter( u => u.indexOf( window.location.hostname + '/post' ) > -1 );
  1170. location_list = location_list.filter( u => u.indexOf( '/embed' ) < 0 );
  1171. location_list = location_list.filter( u => u.indexOf( '#notes' ) < 0 ); // I fucking hate this website.
  1172. // https://www.facebook.com/sharer/sharer.php?u=https://shamserg.tumblr.com/post/55985088311/knight-of-vengence-by-shamserg
  1173. // I fucking hate the web.
  1174. location_list = location_list.map( e => window.location.protocol + '//' + window.location.hostname + e.split( window.location.hostname )[1] );
  1175. // location_list = location_list.filter( u => u.indexOf( '?ezastumblrscrape' ) <= 0 ); // Exclude links to this script. (Nope, apparently that's introduced elsewhere. WTF.)
  1176. // Christ, https versus http bullshit AGAIN. Just get it done.
  1177. // location_list = location_list.map( e => window.location.protocol + '//' + e.split( '//' )[1] ); // http -> https, or https -> http, as needed.
  1178. // I don't think I ever remove duplicates. On pages with lots of silly "sharing" bullshit, this might grab pages multiple times. Not sure I care. Extreme go horse.
  1179. }
  1180.  
  1181. // location_list should now contain a bunch of /post/12345 URLs.
  1182. if( options_map.usemobile ) { location_list = location_list.map( (url) => { return url + '/mobile' } ); }
  1183. console.log( location_list );
  1184. // return location_list; // Promises are still weird. (JS's null-by-default is also weird.)
  1185. post_urls = post_urls.concat( location_list ); // Append these URLs to the global array of /post URLs. (It's not like Array.push() because fuck you.)
  1186. // console.log( post_urls );
  1187. // It seems like bad practice to recurse from here, but for my weird needs it seems to make sense.
  1188. if( sitemap_x < final_sitemap ) {
  1189. scrape_sitemap_x( sitemap_x + 1, final_sitemap ); // If there's more to be done, recurse.
  1190. } else {
  1191. scrape_post_urls(); // If not, fetch & display all these /post URLs.
  1192. }
  1193. } )
  1194.  
  1195. }
  1196.  
  1197. // Fetch all the URLs in the global array of post_urls, then display contents as relevant
  1198. function scrape_post_urls() {
  1199. // Had to re-check how I did rate-limited fetches for the monolithic main scraper. It involved safe levels of 'wtf.'
  1200. // Basically I put n URLs in an array, or an array of fetch promises I guess, then use promise.all to wait until they're all fetched. Repeat as necessary.
  1201. // Can I use web workers? E.g. instantiate 25 parallel scripts that each shift from post_urls and spit back HTML. Ech. External JS is encouraged and DOM access is limited.
  1202. // Screw it, batch parallelism will suffice.
  1203. // Can I define a variable as a function? I mean, can I go var foobar = function_name, so I can later invoke foobar() and get function_name()?
  1204. // If so: I can pick a callback before all of this, fill a global array with e.g. [url, contents] stuff, and jump to a text- or image-based display when that's finished.
  1205. // Otherwise I think I'm gonna be copying myself a lot in order to display text xor images.
  1206.  
  1207. // Create divs for each post? This is going to be a lot of divs.
  1208. // 1000 is fine for the traditional monolithic scrape mode, but that covers ten or twenty times as many posts as a single sitemapX.xml.
  1209. // And it probably doesn't belong right here.
  1210. // Fuck it, nobody's using this unless they read the source code. Gotta build something twice to build it right.
  1211.  
  1212. // Create divs for each post, so they can be fetched and inserted asynchronously
  1213. post_urls.forEach( (url) => {
  1214. let new_div = document.createElement( 'div' );
  1215. new_div.id = '' + url;
  1216. document.body.appendChild( new_div );
  1217. } )
  1218.  
  1219. // Something in here is throwing "Promise rejection value is a non-unwrappable cross-compartment wrapper." I don't even know what the fuck.
  1220. // Apparently Mozilla doesn't know what the fuck either, because all discussion of this seems to be 'well it ought to throw a better error.'
  1221. // Okay somehow this happens even with everything past 'console.log( order_array );' commented out, so it's not actually a wrong thing involving the Promise nest. Huh.
  1222. // It's the for loop.
  1223. // Is it because we try using location_list.length? Even though it's a global array and not a Promise? And console.log() has no problem... oh, console.log quietly chokes.
  1224. // I'm trying to grab a sitemap2 that doesn't exist because I was being clever in opposite directions.
  1225. // Nope, still shits the bed, same incomprehensible error. Console.log shows location_list has... oh goddammit. location_list was the local variable; post_urls is the global.
  1226. // Also array.concat doesn't work for some reason. post_urls still has zero elements.
  1227. // Oh for fuck's sake it has no side effects! You have to do thing=thing.concat()! God damn the inconsistent behavior versus array.push!
  1228.  
  1229. // We now get as far as chain=chain.then and the first console.log('foobar'), but it happens exactly once.
  1230. // Splicing appears to happen correctly.
  1231. // We enter responses.map, and the [body of post, post url] array is passed correctly and contains right-looking data.
  1232. // Ah. links_from_text is not a function. It's links_from_page. Firefox, that's a dead simple error - say something.
  1233.  
  1234. // Thumbnails? True thumbnails, and only for tumblr images with _100 available. Still inadvisable for the whole damn tumblr at once.
  1235. // So limit that to single-sitemap modes (like linking out to 1000-pages-at-once monolithic scrapes) and call it good. 500 thumbnails is what /archive does anyway.
  1236.  
  1237. // Terrible fix for unique tags: can I edit CSS from within the page? Can I count inside that?
  1238. // I'd be trying to do something like <tag id='my_art>my_art</tag> and increment some tags['my_art'] value - then only show <tag> with a corresponding value over one.
  1239.  
  1240. // Utter hack: I need to get the text of each /post to go above or below it. The key is options_map( 'story' ) == true.
  1241. // Still pass it as a "link" string. Prepend with \n or something, a non-URL-like control character, which we're okay to print accidentally. Don't link that "link."
  1242. // /mobile might work? Photosets show up as '[video]', but some text is in the HTML. Dunno if it's everything.
  1243. // Oh god they're suddenly fucking with mobile.
  1244. // In /mobile, get everything from '</h1>' to '<div id="navigation">'.
  1245.  
  1246. var simultaneous_fetches = 25;
  1247. var order_array = new Array; // No reason to show immediate results, so fill with n, n, n until the sum matches the number of URLs.
  1248. for( x = post_urls.length; x > simultaneous_fetches; x -= simultaneous_fetches ) {
  1249. order_array.push( simultaneous_fetches );
  1250. }
  1251.  
  1252. if( x != 0 ) { order_array.push( x ); } // Any remainder from counting-down for() loop becomes the last element
  1253.  
  1254. // console.log( order_array );
  1255. // order_array = [5,5]; // Debug - limits fetches to just a few posts
  1256.  
  1257. var chain = Promise.resolve(0); // Empty promise so we can use "then"
  1258.  
  1259. order_array.forEach( (how_many, which_entry) => {
  1260. chain = chain.then( s => {
  1261. // console.log( 'foobar' );
  1262. var subarray = post_urls.splice( 0, how_many ); // Shift some number of elements into a separate array, for partial array.map
  1263. // console.log( 'foobar2', subarray );
  1264. return Promise.all( subarray.map( post_url =>
  1265. Promise.all( [ fetch( post_url, { credentials: 'include' } ).then( s => s.text() ), post_url ] ) // Return [body of post, post URL]
  1266. ) )
  1267. } )
  1268. . then( responses => responses.map( s => {
  1269. // console.log( 'foobar3', s );
  1270. var post_url = s[1];
  1271. // var url_array = soft_scrape_page_promise( s[0] ) // Surprise, this is a promise now
  1272. // Wait, do I need it to be a promise? Because otherwise I might prefer to use links_from_text().
  1273. var sublinks = links_from_page( s[0] );
  1274.  
  1275. let tag_links = sublinks.filter( return_tags ); // Copy tags into some new array. No changes to other filters required! (Test on eixn, ~2000 pages.)
  1276. tag_links = tag_links.filter( u => u.indexOf( '?ezastumblrscrape' ) < 0 );
  1277.  
  1278. // Do we want to filter down to images for this scrape function? It's not necessarily displaying them.
  1279. // Arg, you almost have to, to filter out the links to tumblrs that reblogged each post.
  1280. sublinks = sublinks.filter( s => { return s.indexOf( '.jpg' ) > 0 || s.indexOf( '.jpeg' ) > 0 || s.indexOf( '.png' ) > 0 || s.indexOf( '.gif' ) > 0; } );
  1281.  
  1282. sublinks = sublinks.filter( tumblr_blacklist_filter ); // Remove avatars and crap
  1283. sublinks = sublinks.map( image_standardizer ); // Clean up semi-dupes (e.g. same image in different sizes -> same URL)
  1284. sublinks = sublinks.filter( novelty_filter ); // Global duplicate remover
  1285.  
  1286. // sublinks = sublinks.concat( tag_links ); // Add tags back in. Not ideal, but should be functional.
  1287.  
  1288. if( options_map.story ) {
  1289. // Janky last-ditch text ripping, because Tumblr is killing NSFW sites:
  1290. // Assume we're using /mobile. Grab everything between the end of the blog name and the standard navigation div.
  1291. // let story_start = s[0].indexOf( '</h1>' ) + 5;
  1292. let story_start = s[0].indexOf( '<p>' ) + 3; // Better to skip ahead a bit instead of showing the <h2> date
  1293. let story_end = s[0].indexOf( '<div id="navigation">' );
  1294. let story = s[0].substring( story_start, story_end );
  1295. // Problem: images are still showing up. We sort of don't want that. Clunk.
  1296. story = story.replace( /<img/g, '<notimg' ); // Clunk clunk clunk get it done. It is extreme go horse time.
  1297. // I should really convert inline images to links.
  1298. // Embedded videos what the fuck?
  1299. story = story.replace( /<iframe/g, '<notiframe' ); // Clunk clunk clunk!
  1300. story = story.replace( /<style/g, '<notstyle' ); // CLUNK CLUNK.
  1301. //story = story.replace( /<span/g, '<notspan' ); // Clunk? // Oh, no. It's the /embed crap.
  1302. sublinks.push( story ); // This is just an array of strings, right?
  1303. // console.log( story ); // DEBUG
  1304. }
  1305.  
  1306. // Oh right, this doesn't really return anything. We arrange and insert HTML from here.
  1307. var bulk_string = "<br><a href='" + post_url + "'>" + post_url + "</a><br>"; // A digest, so we can update innerHTML just once per div
  1308. sublinks.forEach( (link) => {
  1309. let contents = link;
  1310. if( options_map.thumbnails == 'xml' && link.indexOf( '_1280' ) > -1 ) { // If we're showing thumbnails and this image can be resized, do, then show it
  1311. let img = link.replace( '_1280', '_100' );
  1312. contents = '<img src=' + img + '>' + link; // <img> deserves a class, for consistent scale.
  1313. }
  1314. this_link = '<a href ="' + link + '">' + contents + '</a><br>';
  1315.  
  1316. // How is this not what's causing the CSS bleed?
  1317. // if( link.substring(0) == '\n' ) { // If this is text
  1318. if( link.indexOf( '\n' ) > -1 ) { // If this is text
  1319. this_link = link;
  1320. }
  1321.  
  1322. bulk_string += this_link;
  1323. } )
  1324.  
  1325. var tag_string = "";
  1326.  
  1327. tag_links.forEach( (link) => {
  1328. // let tag = link.split( '/tagged/' )[1]
  1329. tag_string += '#' + link.split( '/tagged/' )[1] + ' ';
  1330. } )
  1331. bulk_string += tag_string + "<br>";
  1332.  
  1333. // Tags should be added here. If we have them.
  1334. // console.log( bulk_string );
  1335. document.getElementById( '' + post_url ).innerHTML = bulk_string; // Yeeeah, I should probably create these div IDs before this happens.
  1336. // And here's where I'd increase the page counter, if we had one.
  1337. // Let's use order_array to judge how done we are - and make it a percent, not a page count. Use order_array.length and pretend the last element's the same size.
  1338. // document.getElementById( 'pagecounter' ).innerHTML = '%';
  1339. // No wait, we can't do it here. This is per-post, not per-page. Or... we could do it real half-assed.
  1340. } )
  1341. )
  1342. .then( s => { // I don't think we take any actual data here. This just fires once per 'responses' group, so we can indicate page count etc.
  1343. let completion = Math.ceil( 100 * (which_entry+1) / order_array.length ); // Zero-ordinal index to percentage. Blugh.
  1344. document.getElementById( 'pagecounter' ).innerHTML = '' + completion + '%';
  1345. } )
  1346. } )
  1347.  
  1348. // "Promises allow a flat execution pattern!" Fuck you, you liars. Look at that rat's nest of alternating braces.
  1349. // If you've done all the work in spaghetti functions somewhere else, maybe it's fine, but if you want code to happen where it fucking starts, anonymous functions SUCK.
  1350.  
  1351. }
  1352.  
  1353.  
  1354.  
  1355.  
  1356.  
  1357.  
  1358.  
  1359.  
  1360.  
  1361. // ------------------------------------ Post-by-post scraper with embedded images ------------------------------------ //
  1362.  
  1363.  
  1364.  
  1365.  
  1366.  
  1367.  
  1368.  
  1369.  
  1370.  
  1371. // Scrape each page for /post/ links, scrape each /post/ for content, display in-order with less callback hell
  1372. // New layout & new scrape method - not required to be compatible with previous functions
  1373. function new_embedded_display() {
  1374. if( isNaN( parseInt( options_map.startpage ) ) || options_map.startpage <= 1 ) { options_map.startpage = 1; }
  1375.  
  1376. mydiv.innerHTML += "<center>" + html_previous_next_navigation() + "<br><br>" + html_page_count_navigation() + "</center><br>";
  1377. document.getElementById("bottom_controls_div").innerHTML += "<center>" + html_page_count_navigation() + "<br><br>" + html_previous_next_navigation() + "</center>";
  1378.  
  1379. // Links out from this mode - scrapewholesite, original mode, maybe other crap
  1380. //mydiv.innerHTML += "This mode is under development and subject to change."; // No longer true. It's basically feature-complete.
  1381. // mydiv.innerHTML += " - <a href=" + options_url( { everypost:false } ) + ">Return to original image browser</a>" + "<br>" + "<br>";
  1382. mydiv.innerHTML += "<br>" + html_ezastumblrscrape_options() + "<br><br>";
  1383.  
  1384. // "Pages 1 to 10 (of 100) from http://example.tumblr.com"
  1385. mydiv.innerHTML += "Pages " + options_map.startpage + " to " + (options_map.startpage + options_map.pagesatonce - 1);
  1386. if( !isNaN(options_map.lastpage) ) { mydiv.innerHTML += " (of " + options_map.lastpage + ")"; }
  1387. mydiv.innerHTML += " from " + site_and_tags + "<br>";
  1388.  
  1389. mydiv.innerHTML += image_size_options() + "<br><br>";
  1390. mydiv.innerHTML += image_resolution_options() + "<br><br>";
  1391.  
  1392. // Messy inline function for toggling page breaks - they're optional because we have post permalinks now
  1393. mydiv.innerHTML += "<a href='javascript: void(0);' class='interactive'; onclick=\"(function(o){ \
  1394. let mydiv = document.getElementById('maindiv'); \
  1395. if( mydiv.className == '' ) { mydiv.className = 'showpagelinks'; } \
  1396. else { mydiv.className = ''; } \
  1397. })(this)\">Toggle page breaks</a><br><br>";
  1398.  
  1399. mydiv.innerHTML += "<span id='0'></span>"; // Empty span for things to be placed after.
  1400. posts_placed.push( 0 ); // Because fuck special cases.
  1401.  
  1402. // Scrape some pages
  1403. for( let x = options_map.startpage; x < options_map.startpage + options_map.pagesatonce; x++ ) {
  1404. fetch( site_and_tags + "/page/" + x, { credentials: 'include' } ).then( r => r.text() ).then( text => {
  1405. scrape_by_posts( text, x );
  1406. } )
  1407. }
  1408. }
  1409.  
  1410.  
  1411.  
  1412. // Take the HTML from a /page, fetch the /post links, display images
  1413. // Probably ought to be despaghettified and combined with the above function, but I was fighting callback hell -hard- after the last major version
  1414. // Alternately, split it even further and do some .then( do_this ).then( do_that ) kinda stuff above.
  1415. function scrape_by_posts( html_copy, page_number ) {
  1416. // console.log( page_dupe_hash ); // DEBUG
  1417. let posts = links_from_page( html_copy ); // Get links on page
  1418. posts = posts.filter( link => { return link.indexOf( '/post/' ) > 0 && link.indexOf( '/photoset' ) < 0; } ); // Keep /post links but not photoset iframes
  1419. posts = posts.map( link => { return link.replace( '#notes', '' ); } ); // post/1234 is the same as /post/1234#notes
  1420. posts = posts.filter( link => link.indexOf( window.location.host ) > 0 ); // Same-origin filter. Not necessary, but it unclutters the console. Fuckin' CORS.
  1421. if( page_number != 1 ) { posts = posts.filter( novelty_filter ); } // Attempt to remove posts linked on every page, e.g. commission info. Suffers a race condition.
  1422. posts = remove_duplicates( posts ); // De-dupe
  1423. // 'posts' now contains an array of /post URLs
  1424.  
  1425. // Display link and linebreak before first post on this page
  1426. let first_id = posts.map( u => parseInt( u.split( '/' )[4] ) ).sort( ).pop(); // Grab ID from its place in each URL, sort accordingly, take the top one
  1427. let page_link = "<span class='pagelink'><hr><a target='_blank' href='" + site_and_tags + "/page/" + page_number + "'> Page " + page_number + "</a>";
  1428. if( posts.length == 0 ) { first_id = 1; page_link += " - No images found."; } // Handle empty pages with dummy content. Out of order, but whatever.
  1429. page_link += "<br><br></span>";
  1430. display_post( page_link, first_id + 0.5 ); // +/- on the ID will change with /chrono, once that matters
  1431.  
  1432. posts.map( link => {
  1433.  
  1434. fetch( link, { credentials: 'include' } ).then( r => r.text() ).then( text => {
  1435.  
  1436. let sublinks = links_from_page( text );
  1437. sublinks = sublinks.filter( s => { return s.indexOf( '.jpg' ) > 0 || s.indexOf( '.jpeg' ) > 0 || s.indexOf( '.png' ) > 0 || s.indexOf( '.gif' ) > 0; } );
  1438. sublinks = sublinks.filter( tumblr_blacklist_filter ); // Remove avatars and crap
  1439. sublinks = sublinks.map( image_standardizer ); // Clean up semi-dupes (e.g. same image in different sizes -> same URL)
  1440. sublinks = sublinks.filter( novelty_filter ); // Global duplicate remover
  1441. // Oh. Photosets sort of just... work? That might not be reliable; DownThemAll acts like it can't see the iframes on some themes.
  1442. // Yep, they're there. Gonna be hard to notice if/when they fail. Oh well, "not all images are guaranteed to appear."
  1443. // Videos will still be weird. (But it does grab their preview thumbnails.)
  1444. // Wait, can I filter reblogs here? E.g. with a ?noreblogs flag, and then checking if any given post has via/source links. Hmm. Might be easier in /mobile pages.
  1445.  
  1446. // Seem to get a lot of duplicate images? e.g. both
  1447. // https://media.tumblr.com/tumblr_m2gktkD7u31qdcy3io1_640.jpg and
  1448. // https://media.tumblr.com/tumblr_m2gktkD7u31qdcy3io1_1280.jpg
  1449. // Oh! Do I just not handle _640?
  1450.  
  1451. // Get ID from post URL, e.g. http//example.tumblr.com/post/12345/title => 12345
  1452. let post_id = parseInt( link.split( '/' )[4] ); // 12345 as a NUMBER, not a string, doofus
  1453.  
  1454. if( sublinks.length > 0 ) { // If this post has images we're displaying -
  1455. let this_post = new String;
  1456. sublinks.map( url => {
  1457. this_post += '<span class="container">';
  1458. this_post += '<a target="_blank" id="' + encodeURI( url ) + '" href="' + url + '">';
  1459. this_post += '<img onerror=\'' + error_function( url ) + '\' id="'+url+'" src="' + url + '"></a>';
  1460. this_post += '<span class="postlink"><a target="_blank" href="' + link + '">Permalink</a></span></span> ';
  1461. } )
  1462. display_post( this_post, post_id );
  1463. }
  1464.  
  1465. } )
  1466.  
  1467. } )
  1468. }
  1469.  
  1470. // Place content on page in descending order according to post ID number
  1471. // Consider rejiggering the old scrape method to use this. Move to 'universal' section if so. Alter or spin off to link posts instead?
  1472. // Turns out I never implemented ?chrono or ?reverse, so nevermind that for now.
  1473. // Remember to set options_map.chrono if ?find contains /chrono or whatever.
  1474. function display_post( content, post_id ) {
  1475. let this_node = document.createElement( "span" );
  1476. this_node.innerHTML = content;
  1477. this_node.id = post_id
  1478.  
  1479. // Find lower-numbered node than post_id
  1480. let target_id = posts_placed.filter( n => n <= post_id ).sort( ).pop(); // Take the highest number less than (or equal to) post_id
  1481. if( options_map.find.indexOf( '/chrono' ) > 0 ) {
  1482. target_id = posts_placed.filter( n => n <= post_id ).sort( ).shift(); // Take the... fuck... lowest? What am I doing again?
  1483. // Fuuuck, this is really inconsistent. Nevermind the looney-toons syntax I used here, =>n<=.
  1484. // Screw it, use the old scraper for now.
  1485. }
  1486. let target_node = document.getElementById( target_id );
  1487.  
  1488. // http://stackoverflow.com/questions/4793604/how-to-do-insert-after-in-javascript-without-using-a-library
  1489. target_node.parentNode.insertBefore( this_node, target_node ); // Insert our span after the lower-ID node
  1490. posts_placed.push( post_id ); // Remember that we added this ID
  1491.  
  1492. // No return value
  1493. }
  1494.  
  1495. // Return ascending or descending order depending on "chrono" setting
  1496. // function post_order_sort( a, b )
  1497.  
  1498.  
  1499.  
  1500.  
  1501.  
  1502.  
  1503.  
  1504.  
  1505.  
  1506. // ------------------------------------ Specific handling for www.tumblr.com (tag search, possibly dashboard) ------------------------------------ //
  1507.  
  1508.  
  1509.  
  1510.  
  1511.  
  1512.  
  1513.  
  1514.  
  1515.  
  1516. // URLs like https://www.tumblr.com/tagged/wooloo?before=1560014505 don't follow easy sequential pagination, so we have to (a) be linear or (b) guess. First whack is (a).
  1517. // Tumblr dashboard is obvious standardized, so we can make assumptions about /post links in relation to images.
  1518. // We do not fetch any individual /posts. We can't. They're on different subdomains, and Tumblr CORS remains tight-assed. But we can link them like in the post scrape mode.
  1519. // Ooh, I could maybe get "behind the jump" content via /embed URLs. Posts here should contain the blog UUID and a post number.
  1520.  
  1521. // Copied notes from above:
  1522. // https://www.tumblr.com/tagged/homestuck sort of works. Trouble is, pages go https://www.tumblr.com/tagged/homestuck?before=1558723097 - ugh.
  1523. // Next page is https://www.tumblr.com/tagged/homestuck?before=1558720051 - yeah this is a Unix timestamp. It's epoch time.
  1524. // https://www.tumblr.com/dashboard/2/185117399018 - but this one -is- based on /post numbers. Tumblr.com: given two choices, take all three.
  1525. // Being able to scrape and archive site-wide tags or your own dashboard would be useful. Dammit.
  1526. // Just roll another mode into this script. It's already a hot mess. The new code just won't run on individual blogs.
  1527. // Okay, so dashboard and site-wide tag modes.
  1528. // Dashboard post numbers don't have to be real post numbers. Tag-search timestamps obviously don't have to relate to real posts.
  1529. // We de-dupe, so overkill is fine... ish. Tumblr's touchy about "rate limit exceeded" these days.
  1530. // Tag scrape would be suuuper useful if we can grab blog posts from www.tumblr.com. Like "hiveswapcomicscontest."
  1531.  
  1532. // Still no luck on scraping dashboard-only blogs. Bluh.
  1533. // This is so aggravating. I can see the content, obviously. "View page source" just returns the dashboard source. But document.body.outerHTML contains the blog proper.
  1534. // Consider /embed again:
  1535. // https://embed.tumblr.com/embed/post/4RwtewsxXp-k1ReCcdAgXg/185288559546?width=542&language=en_US&did=a5c973d33a43ace664986204d72d7739de31b614
  1536. // This works but provides no previous/next link. (We need the ID, but we can get it from www.tumblr.com, then redirect.)
  1537. // Using DaveJaders for testing. https://davejaders.tumblr.com/archive does not redirect, so we can use that for same-origin fetches. Does /page stuff work?
  1538. // fetch( '/' ).then( r => r.text() ).then( t => document.body.outerHTML = t ) - CORS failure. "The Same Origin Policy disallows reading the remote resource at https://www.tumblr.com/login_required/davejaders . (Reason: CORS header ‘Access-Control-Allow-Origin’ missing)." Fuck me again, apparently.
  1539. // Circle back to dashboard-only blogs by showing the actual content. Avoid clearing body.innerHTML, let Tumblr do its thing, interact with the sidebar deal.
  1540.  
  1541. // I could still estimate page count, using a binary search. Assuming constant post rates.
  1542. // Or, get fancy, and estimate total posts from time between posts on sparse samples. Each fetch is ten posts in-order.
  1543. // It does say "No posts found." See https://www.tumblr.com/tagged/starlight-brigade?before=1503710342
  1544. // Between the ?before we fetch and the Next link, we have the span of time it took for ten posts to be made. I would completely exclude the "last" page, if found.
  1545. // I guess... integrate in blocks? E.g., given a data point, assume that rate continues to the next data point. Loose approximations are fine.
  1546. // E.g. push timestamp/rate pairs into an array, sort by timestamp, find delta between each timestamp, and multiply each rate by each period. One period is bidirectional.
  1547. // Lerping is not meaningfully harder. It's the above, plus each period times half the difference between the rates at either end of that period.
  1548. // Specific date entry? input type = "date".
  1549. // Specific date entry is useful and simple enough to finish before posting a new version. Content estimation can wait.
  1550. // Oh, idiot: make the "on or before" display also the navigation mechanic.
  1551.  
  1552. // We now go right here for any ?ezastumblrscrape URL on www.tumblr.com. Sloppy but functional.
  1553. function scrape_www_tagged( ) {
  1554. // ?find=/tagged/whatever is already populated by existing scrape links, but any ?key=value stuff gets lost.
  1555. // (If no ?before=timestamp, fetch first page. ?before=0 works. Some max_int would be preferable for sorting. No exceptions needed, then.)
  1556. is_first_page = false; // Implicit global, eat me. Clunk.
  1557. if( isNaN( options_map.before ) ) {
  1558. options_map.before = parseInt( Date.now() / 1000 ); // Current Unix timestamp in seconds, not milliseconds. The year is not 51413.
  1559. is_first_page = true;
  1560. }
  1561. if( options_map.before > parseInt( Date.now() / 1000 ) ) { is_first_page = true; } // Also handle when going days or years "into the future."
  1562.  
  1563. // Standard-ish intial steps: clear page (handled), add controls, maybe change window title.
  1564. // We can't use truly standard prev/next navigation. Even officially, you only get "next" links. (Page count options should still work.)
  1565. let www_tagged_next = "<a class='www_next' href='" + options_url() + "'>Next >>></a>";
  1566. let pages_www_tagged = "";
  1567. pages_www_tagged += "<a class='www_next' href='" + options_url( {"pagesatonce": 10} ) + "'> 10</a>, ";
  1568. pages_www_tagged += "<a class='www_next' href='" + options_url( {"pagesatonce": 5} ) + "'> 5</a>, ";
  1569. pages_www_tagged += "<a class='www_next' href='" + options_url( {"pagesatonce": 1} ) + "'> 1</a> - >>>";
  1570.  
  1571. // Periods of time, in seconds, because Unix epoch timestamps.
  1572. let one_day = 24 * 60 * 60;
  1573. let one_week = one_day * 7;
  1574. let one_year = one_day * 365.24; // There's no reason not to account for leap years.
  1575.  
  1576. let approximate_place = "Posts <a href='" + options_map.find + "'>" + options_map.find.split('/').join(' ') + "</a>"; // options_map.find as relative link. Convenient.
  1577. let date_string = new Date( options_map.before * 1000 ).toISOString().slice( 0, 10 );
  1578. approximate_place += " on or before <input type='date' id='on_or_before' value='" + date_string + "'>";
  1579. // How do I associate this entry with the button so that pressing Enter triggers the button? Eh, minor detail.
  1580.  
  1581. let coverage = "<span id='coverage'></span>"; // Started writing this string, then realized I can't know its value until all pages are fetched.
  1582.  
  1583. let time_www_tagged = ""; // Browse forward / backward by day / week / year.
  1584. time_www_tagged += "<a href='" + options_url( {"before": options_map.before + one_day} ) + "'><<< </a> One day";
  1585. time_www_tagged += "<a href='" + options_url( {"before": options_map.before - one_day} ) + "'> >>></a> - ";
  1586. time_www_tagged += "<a href='" + options_url( {"before": options_map.before + one_week} ) + "'><<< </a> One week";
  1587. time_www_tagged += "<a href='" + options_url( {"before": options_map.before - one_week} ) + "'> >>></a> - ";
  1588. time_www_tagged += "<a href='" + options_url( {"before": options_map.before + one_year} ) + "'><<< </a> One year";
  1589. time_www_tagged += "<a href='" + options_url( {"before": options_map.before - one_year} ) + "'> >>></a>";
  1590.  
  1591. // Change the displayed date and it'll go there. Play stupid games, win stupid prizes.
  1592. let jump_button_code = 'window.location += "?before=" + parseInt( new Date( document.getElementById( "on_or_before" ).value ).getTime() / 1000 );';
  1593. let date_jump = "<button onclick='" + jump_button_code + "'>Go</button>";
  1594.  
  1595. mydiv.innerHTML += "<center>" + www_tagged_next + "<br><br>" + pages_www_tagged + "</center><br>";
  1596. mydiv.innerHTML += "<center>" + approximate_place + date_jump + "<br>" + coverage + "<br>" + time_www_tagged + "</center>";
  1597. mydiv.innerHTML += "This mode is under development and subject to change.<br><br>";
  1598. document.getElementById("bottom_controls_div").innerHTML += "<center>" + pages_www_tagged + "<br><br>" + www_tagged_next + "</center>";
  1599.  
  1600. mydiv.innerHTML += image_size_options() + "<br><br>";
  1601. mydiv.innerHTML += image_resolution_options() + "<br><br>";
  1602.  
  1603. // Fetch first page specified by ?before=timestamp.
  1604. let tagged_url = "" + options_map.find + "?before=" + options_map.before; // Relative URLs are guaranteed to be same-domain, even if they're garbage.
  1605. fetch( tagged_url, { credentials: 'include' } ).then( r => r.text() ).then( text => {
  1606. display_www_tagged( text, options_map.before, options_map.pagesatonce ); // ... pagesatonce gets set in the pre-amble, right? It should have a default.
  1607. // Optionally we could check here if options_map.before == 0 and instead send max_safe_integer.
  1608. } )
  1609. }
  1610.  
  1611. // Either we need a global variable for how many more pages per... page... or else I should pass a how_many_more value to this recursive function.
  1612. function display_www_tagged( content, timestamp, pages_left ) {
  1613. // First, grab the Next link - i.e. its ?before=timestamp value.
  1614. // let next_timestamp_index = content.lastIndexOf( '?before=' );
  1615. // let next_timestamp = content.substring( next_timestamp_index + 8, content.indexOf( next_timestamp_index, '"' ) ); // Untested
  1616. let next_timestamp = content.split( '?before=' ).pop().split( '"' ).shift(); // The last "?before=12345'" string on the page. Clunky but tolerable.
  1617. next_timestamp = "" + parseInt( next_timestamp ); // Guarantee this is a string of a number. (NaN "works.") Pages past the end may return nonsense.
  1618. if( pages_left > 1 ) { // If we're displaying more pages then fetch that and recurse.
  1619. let tagged_url = "" + options_map.find + "?before=" + next_timestamp; // Relative URLs are guaranteed to be same-domain, even if they're garbage.
  1620. // console.log( tagged_url );
  1621. fetch( tagged_url, { credentials: 'include' } ).then( r => r.text() ).then( text => {
  1622. display_www_tagged( text, next_timestamp, pages_left - 1 );
  1623. } )
  1624. } else { // Otherwise put that timestamp in our constructed Next link(s).
  1625. // I guess... get HTMLcollection of elements for "next" links, and change each one.
  1626. // Downside: links will only change once the last page is fetched. We could tack on a ?before for every fetch, but it would get silly. Right?
  1627. let next_links = Array.from( document.getElementsByClassName( 'www_next' ) ); // I'm not dealing with a live object unless I have to.
  1628. for( link of next_links ) { link.href += "?before=" + next_timestamp; }
  1629.  
  1630. // Oh right, and update the header to guesstimate what span of time we're looking at.
  1631. let coverage = document.getElementById( 'coverage' );
  1632. coverage.innerHTML += "Displaying ";
  1633. if( is_first_page ) { coverage.innerHTML += "the most recent "; } // Condition is no longer sensible, but it's a good placeholder.
  1634. // We can safely treat ?before as the initial timestamp. Perfect accuracy is not important.
  1635. let time_covered = Math.abs( options_map.before - next_timestamp ); // Absolute so I don't care if I have it backwards.
  1636. if( time_covered > 48 * 60 * 60 ) { coverage.innerHTML += parseInt( time_covered / (24*60*60) ) + " days of posts"; } // Over two days? Display days.
  1637. else if( time_covered > 2 * 60 * 60 ) { coverage.innerHTML += parseInt( time_covered / (60*60) ) + " hours of posts"; } // Over two hours? Display hours.
  1638. else if( time_covered > 2 * 60 ) { coverage.innerHTML += parseInt( time_covered / 60 ) + " minutes of posts"; } // Over two minutes? Display minutes.
  1639. else { coverage.innerHTML += " several seconds of posts"; } // Otherwise just say it's a damn short time.
  1640. if( time_covered == 0 ) { coverage.innerHTML = "Displaying the first available posts"; } // Last page, earlier posts. No "next" page.
  1641. }
  1642.  
  1643. // Insert div for this timestamp's page.
  1644. let new_div = document.createElement( 'span' ); // Span, because divs cause line breaks. Whoops.
  1645. new_div.id = "" + timestamp;
  1646. let target_node = document.getElementById( 'bottom_controls_div' );
  1647. target_node.parentNode.insertBefore( new_div, target_node ); // Insert each page before the footer.
  1648. let div_html = "";
  1649.  
  1650. // Separate page HTML by posts.
  1651. // At least the "li" elements aren't nested, so I can terminate the last one on "</li>". Or... all of them.
  1652. let posts = content.split( '<li class="post_container"' );
  1653. posts.shift();
  1654. posts.push( posts.pop().split( '</li>' )[0] ); // Terminate last element at </li>. Again, not great code, but clunk clunk clunk get it done.
  1655.  
  1656. // For each post:
  1657. for( post of posts ) {
  1658. // Extract images from each post.
  1659. let links = links_from_page( post );
  1660. links = links.map( image_standardizer ); // This goes before grabbing the permalink because /post URLs do get standardized. No &media guff.
  1661. let permalink = links.filter( s => s.indexOf( '.tumblr.com/post' ) > 0 )[0]; // This has to go before de-duping, or posts linking to posts can leave permalinks blank.
  1662. links = links.filter( novelty_filter );
  1663. links = links.filter( tumblr_blacklist_filter );
  1664. // document.body.innerHTML += links.join( "<br>" ) + "<br><br>"; // Debug
  1665.  
  1666. // Separate the images.
  1667. let images = links.filter( s => s.indexOf( 'media.tumblr.com' ) > 0 ); // Note: this will exclude external images, e.g. embedded Twitter stuff.
  1668.  
  1669. // If this post has images:
  1670. if( images.length > 0 ) { // Build HTML xor insert div for each post, to display images.
  1671. // Get /post URL, including blog name etc.
  1672. //let permalink = links.filter( s => s.indexOf( '.tumblr.com/post' ) > 0 )[0];
  1673.  
  1674. let post_html = "";
  1675. for( image of images ) {
  1676. post_html += '<span class="container">';
  1677. post_html += '<a target="_blank" id="' + encodeURI( image ) + '" href="' + image + '">';
  1678. post_html += '<img onerror=\'' + error_function( image ) + '\' id="' + image + '" src="' + image + '"></a>';
  1679. post_html += '<span class="postlink"><a target="_blank" href="' + permalink + '">Permalink</a></span></span> ';
  1680. }
  1681. div_html += post_html;
  1682. }
  1683. }
  1684.  
  1685. // Insert accumulated HTML into this div.
  1686. new_div.innerHTML = div_html;
  1687. }
  1688.  
  1689.  
  1690.  
  1691.  
  1692.  
  1693.  
  1694.  
  1695.  
  1696.  
  1697. // ------------------------------------ HTML-returning functions for duplication prevention ------------------------------------ //
  1698.  
  1699.  
  1700.  
  1701.  
  1702.  
  1703.  
  1704.  
  1705.  
  1706.  
  1707. // Return HTML for standard Previous / Next controls (<<< Previous - Next >>>)
  1708. function html_previous_next_navigation() {
  1709. let prev_next_controls = "";
  1710. if( options_map.startpage > 1 ) {
  1711. prev_next_controls += "<a href='" + options_url( "startpage", options_map.startpage - options_map.pagesatonce ) + "'><<< Previous</a> - ";
  1712. }
  1713. prev_next_controls += "<a href='" + options_url( "startpage", options_map.startpage + options_map.pagesatonce ) + "'>Next >>></a>";
  1714.  
  1715. return prev_next_controls;
  1716. }
  1717.  
  1718. // Return HTML for pages-at-once versions of previous/next page navigation controls (<<< 10, 5, 1 - 1, 5, 10 >>>)
  1719. function html_page_count_navigation() {
  1720. let prev_next_controls = "";
  1721. if( options_map.startpage > 1 ) { // <<< 10, 5, 1 -
  1722. prev_next_controls += "<<< ";
  1723. prev_next_controls += "<a href='" + options_url( {"startpage": options_map.startpage - 1, "pagesatonce": 1} ) + "'> 1</a>, ";
  1724. prev_next_controls += "<a href='" + options_url( {"startpage": options_map.startpage - 5, "pagesatonce": 5} ) + "'> 5</a>, ";
  1725. prev_next_controls += "<a href='" + options_url( {"startpage": options_map.startpage - 10, "pagesatonce": 10} ) + "'> 10</a> - ";
  1726. }
  1727. prev_next_controls += "<a href='" + options_url( {"startpage": options_map.startpage + options_map.pagesatonce, "pagesatonce": 10} ) + "'> 10</a>, ";
  1728. prev_next_controls += "<a href='" + options_url( {"startpage": options_map.startpage + options_map.pagesatonce, "pagesatonce": 5} ) + "'> 5</a>, ";
  1729. prev_next_controls += "<a href='" + options_url( {"startpage": options_map.startpage + options_map.pagesatonce, "pagesatonce": 1} ) + "'> 1</a> - ";
  1730. prev_next_controls += ">>>";
  1731.  
  1732. return prev_next_controls;
  1733. }
  1734.  
  1735. // Return HTML for image-size options (changes via CSS or via URL parameters)
  1736. // This used to work. It still works, in the new www mode I just wrote. What the fuck do I have to do for some goddamn onclick behavior?
  1737. // "Content Security Policy: The page’s settings blocked the loading of a resource at self (“script-src https://lalilalup.tumblr.com https://assets.tumblr.com/pop/ 'nonce-OTA4NjViZmE2MzZkYTFjMjM1OGZkZGM1MzkwYWU4NTA='”)." What the fuck.
  1738. // Jesus Christ, it might be yet again because of Tumblr's tightass settings:
  1739. // https://stackoverflow.com/questions/37298608/content-security-policy-the-pages-settings-blocked-the-loading-of-a-resource
  1740. // The function this still works in is on a "not found" page. The places it will not work are /archive pages.
  1741. // Yeah, on /mobile instead of /archive it works fine. Fuck you, Tumblr.
  1742. // Jesus, that means even onError can't work.
  1743. function image_size_options() {
  1744. var html_string = "Immediate: \t"; // Change <body> class to instantly resize images, temporarily
  1745.  
  1746. html_string += "<a class='interactive' href='javascript: void(0);' onclick=\"(function(o){ document.body.className = o; })('')\">Original image sizes</a> - ";
  1747. html_string += "<a class='interactive' href='javascript: void(0);' onclick=\"(function(o){ document.body.className = o; })('fixed-width')\">Snap columns</a> - ";
  1748. html_string += "<a class='interactive' href='javascript: void(0);' onclick=\"(function(o){ document.body.className = o; })('fixed-height')\">Snap rows</a> - ";
  1749. html_string += "<a class='interactive' href='javascript: void(0);' onclick=\"(function(o){ document.body.className = o; })('fit-width')\">Fit width</a> - ";
  1750. html_string += "<a class='interactive' href='javascript: void(0);' onclick=\"(function(o){ document.body.className = o; })('fit-height')\">Fit height</a> - ";
  1751. html_string += "<a class='interactive' href='javascript: void(0);' onclick=\"(function(o){ document.body.className = o; })('smart-fit')\">Fit both</a><br><br>";
  1752.  
  1753. html_string += "Persistent: \t"; // Reload page with different image mode that will stick for previous/next pages
  1754. html_string += "<a href=" + options_url( { thumbnails: "original" } ) + ">Original image sizes</a> - "; // This is the CSS default, so any other value works
  1755. html_string += "<a href=" + options_url( { thumbnails: "fixed-width" } ) + ">Snap columns</a> - ";
  1756. html_string += "<a href=" + options_url( { thumbnails: "fixed-height" } ) + ">Snap rows</a> - ";
  1757. html_string += "<a href=" + options_url( { thumbnails: "fit-width" } ) + ">Fit width</a> - ";
  1758. html_string += "<a href=" + options_url( { thumbnails: "fit-height" } ) + ">Fit height</a> - ";
  1759. html_string += "<a href=" + options_url( { thumbnails: "smart-fit" } ) + ">Fit both</a>";
  1760. return html_string;
  1761. }
  1762.  
  1763. // Return HTML for links to ?maxres versions of the same page, e.g. "_raw" versus "_1280"
  1764. function image_resolution_options() {
  1765. var html_string = "Maximum resolution: \t";
  1766. html_string += "<a href=" + options_url( { maxres: "raw" } ) + ">Raw</a> - "; // I'm not 100% sure "_raw" works anymore, but the error function handles it, so whatever.
  1767. html_string += "<a href=" + options_url( { maxres: "1280" } ) + ">1280</a> - ";
  1768. html_string += "<a href=" + options_url( { maxres: "500" } ) + ">500</a> - ";
  1769. html_string += "<a href=" + options_url( { maxres: "400" } ) + ">400</a> - ";
  1770. html_string += "<a href=" + options_url( { maxres: "250" } ) + ">250</a> - ";
  1771. html_string += "<a href=" + options_url( { maxres: "100" } ) + ">100</a>";
  1772. return html_string;
  1773. }
  1774.  
  1775. // Return links to other parts of Eza's Tumblr Scrape functionality, possibly excluding whatever you're currently doing
  1776. // Switch to full-size images - Toggle image size - Show one page at once - Scrape whole Tumblr - (Experimental fetch-every-post image browser)
  1777. // This mode is under development and subject to change. - Return to original image browser
  1778. function html_ezastumblrscrape_options() {
  1779. let html_string = "";
  1780.  
  1781. // "You are browsing" text? Tell people where they are and what they're looking at.
  1782. html_string += "<a href='" + options_url( { scrapemode: "scrapewholesite" } ) + "'>Scrape whole Tumblr</a> - ";
  1783. html_string += "<a href='" + options_url( { scrapemode: "pagebrowser" } ) + "'>Browse images</a> - "; // Default mode; so any value works
  1784. html_string += "<a href='" + options_url( { scrapemode: "everypost" } ) + "'>(Experimental fetch-every-post image browser)</a> ";
  1785.  
  1786. return html_string;
  1787. }
  1788.  
  1789. function error_function( url ) {
  1790. // This clunky <img onError> function looks for a lower-res image if the high-res version doesn't exist.
  1791. // Surprisingly, this does still matter. E.g. http://66.media.tumblr.com/ba99a55896a14a2e083cec076f159956/tumblr_inline_nyuc77wUR01ryfvr9_500.gif
  1792. // This might mismatch _100 images and _250 links because of that self-erasing clause... but it's super rare, so meh.
  1793. let on_error = 'if(this.src.indexOf("_raw")>0){this.src=this.src.replace("_raw","_1280").replace("//media","//66.media");}'; // Swap _raw for 1280, add CDN number
  1794. on_error += 'else if(this.src.indexOf("_1280")>0){this.src=this.src.replace("_1280","_500");}'; // Swap 1280 for 500
  1795. on_error += 'else if(this.src.indexOf("_500")>0){this.src=this.src.replace("_500","_400");}'; // Or swap 500 for 400
  1796. on_error += 'else if(this.src.indexOf("_400")>0){this.src=this.src.replace("_400","_250");}'; // Or swap 400 for 250
  1797. on_error += 'else{this.src=this.src.replace("_250","_100");this.onerror=null;}'; // Or swap 250 for 100, then give up
  1798. on_error += 'document.getElementById("' + encodeURI( url ) + '").href=this.src;'; // Link the image to itself, regardless of size
  1799.  
  1800. // 2020: This has to get more complicated, or at least more verbose, to support shitty new image URLs.
  1801. // https://66.media.tumblr.com/eb7d40a8683e623f173f81fc253056dc/e4277131a4c174c7-a5/s1280x1920/5b3c54b53a09f91b08f9af7fb88e30a253f3c5db.jpg
  1802. //'s400x600'
  1803. //'s500x750'
  1804. //'s640x960'
  1805. //'s1280x1900'
  1806. // 's2048x3072'
  1807. // This has to go in Image Glutton, because Tumblr only ever gets worse
  1808.  
  1809. // Start over.
  1810. // on_error = 'this.src = this.src.replace( "_250", "_100" ); '; // Unconditional, reverse ladder pattern. Don't Google that. I just made it up.
  1811. on_error = "let oldres = [ '_100', '_250', '_400', '_500', '_1280' ]; "; // Inverted quotes, Eh.
  1812. on_error += "let newres = [ 's400x600', 's500x750', 's640x960', 's1280x1920' ];"; // Should probably include slashes. Rare collisions.
  1813. // on_error += "for( let i = old.length-1; i > 1; i-- ) { this.src.replace( old[i], old[i-1] ); } "; // 1280 -> 500, 500 -> 400... shit.
  1814. on_error += "for( let i = 1; i < oldres.length; i++ ) { this.src = this.src.replace( oldres[i+1], oldres[i] ); } "; // 250 -> 100, 400 -> 250, etc. One step per error.
  1815. on_error += "for( let i = 1; i < newres.length; i++ ) { this.src = this.src.replace( newres[i+1], newres[i] ); } ";
  1816. on_error += 'document.getElementById("' + encodeURI( url ) + '").href=this.src;'; // Original quotes - be cautious when editing
  1817.  
  1818. // God dammit. The links work, but the embedded images don't.
  1819. // https://66.media.tumblr.com/eb7d40a8683e623f173f81fc253056dc/e4277131a4c174c7-a5/s1280x1920/ba0f0dd5d5e76726639782f57c4b732f156a5219.jpg
  1820. // https://66.media.tumblr.com/eb7d40a8683e623f173f81fc253056dc/e4277131a4c174c7-a5/s1280x1920/5b3c54b53a09f91b08f9af7fb88e30a253f3c5db.jpg
  1821.  
  1822. /*
  1823. x = 'https://66.media.tumblr.com/eb7d40a8683e623f173f81fc253056dc/e4277131a4c174c7-a5/s1280x1920/44b457df0d9826135b062563cd802afdfe496888.jpg'
  1824. newres = [ 's400x600', 's500x750', 's640x960', 's1280x1920' ];
  1825. for( let i = 1; i < newres.length; i++ ) { this.src.replace( newres[i+1], newres[i] ); }
  1826. */
  1827.  
  1828. return on_error;
  1829. }
  1830.  
  1831.  
  1832.  
  1833.  
  1834.  
  1835.  
  1836.  
  1837.  
  1838.  
  1839. // ------------------------------------ Universal page-scraping function (and other helper functions) ------------------------------------ //
  1840.  
  1841.  
  1842.  
  1843.  
  1844.  
  1845.  
  1846.  
  1847.  
  1848.  
  1849. // Add URLs from a 'blank' page to page_dupe_hash (without just calling soft_scrape_page_promise and ignoring its results)
  1850. function exclude_content_example( url ) {
  1851. fetch( url, { credentials: 'include' } ).then( r => r.text() ).then( text => {
  1852. let links = links_from_page( text );
  1853. links = links.filter( novelty_filter ); // Novelty filter twice, because image_standardizer munges some /post URLs
  1854. links = links.map( image_standardizer );
  1855. links = links.filter( novelty_filter );
  1856. } )
  1857. // No return value
  1858. }
  1859.  
  1860. // Spaghetti to reduce redundancy: given a page's text, return a list of URLs.
  1861. function links_from_page( html_copy ) {
  1862. // Cut off the page at the "More you might like" / "Related posts" footer, on themes that have one
  1863. html_copy = html_copy.split( '="related-posts' ).shift();
  1864.  
  1865. let http_array = html_copy.split( /['="']http/ ); // Regex split on anything that looks like a source or href declaration
  1866. http_array.shift(); // Ditch first element, which is just <html><body> etc.
  1867. http_array = http_array.map( s => { // Theoretically parallel .map instead of maybe-linear .forEach or low-level for() loop
  1868. if( s.indexOf( "&" ) > -1 ) { s = htmlDecode( s ); } // Yes a fucking &quot should match a goddamn regex for terminating on quotes!
  1869. s = s.split( /['<>"']/ )[0]; // Terminate each element (split on any terminator, take first subelement)
  1870. s = s.replace( /\\/g, '' ); // Remove escaping backslashes (e.g. http\/\/ -> http//)
  1871. if( s.indexOf( "%3A%2F%2F" ) > -1 ) { s = decodeURIComponent( s ); } // What is with all the http%3A%2F%2F URLs?
  1872. // s = s.split( '&quot;' )[0]; // Yes these count as doublequotes you stupid broken scripting language.
  1873. return "http" + s; // Oh yeah, add http back in (regex eats it)
  1874. } ) // http_array now contains an array of strings that should be URLs
  1875.  
  1876. let post_array = html_copy.split( /['="']\/post/ ); // Regex split on anything that defines looks similar to a src="/post" link
  1877. post_array.shift(); // Ditch first element, which is just <html><body> etc.
  1878. post_array = post_array.map( s => { // Theoretically parallel .map instead of maybe-linear .forEach or low-level for() loop
  1879. s = s.split( /['<>"']/ )[0]; // Terminate each element (split on any terminator, take first subelement)
  1880. return window.location.protocol + "//" + window.location.hostname + "/post" + s; // Oh yeah, add /post back in (regex eats it)
  1881. } ) // post_array now contains an array of strings that should be photoset URLs
  1882.  
  1883. http_array = http_array.concat( post_array ); // Photosets are out of order again. Blar.
  1884.  
  1885. return http_array;
  1886. }
  1887.  
  1888. // Filter: Return false for typical Tumblr nonsense (JS, avatars, RSS, etc.)
  1889. function tumblr_blacklist_filter( url ) {
  1890. if( url.indexOf( "/reblog/" ) > 0 ||
  1891. url.indexOf( "/tagged/" ) > 0 || // Might get removed so the script can track and report tag use. Stupid art tags like 'my-draws' or 'art-poop' are a pain to find.
  1892. url.indexOf( ".tumblr.com/avatar_" ) > 0 ||
  1893. url.indexOf( ".tumblr.com/image/" ) > 0 ||
  1894. url.indexOf( ".tumblr.com/rss" ) > 0 ||
  1895. url.indexOf( "srvcs.tumblr.com" ) > 0 ||
  1896. url.indexOf( "assets.tumblr.com" ) > 0 ||
  1897. url.indexOf( "schema.org" ) > 0 ||
  1898. url.indexOf( ".js" ) > 0 ||
  1899. url.indexOf( ".css" ) > 0 ||
  1900. url.indexOf( "twitter.com/intent" ) > 0 || // Weirdly common now
  1901. url.indexOf( "tmblr.co/" ) > 0 ||
  1902. //https://66.media.tumblr.com/dea691ec31c9f719dd9b057d0c2be8c3/a533a91e352b4ac7-29/s16x16u_c1/f1781cbe6a6eb7417aa6c97e4286d46372e5ba37.jpg
  1903. url.indexOf( "s16x16u" ) > 0 || // Avatars
  1904. url.indexOf( "s64x64u" ) > 0 || // Avatars
  1905. //https://66.media.tumblr.com/49fac1010ada367bcaee721ea49d0de5/3ddc8ebad67ae55a-ca/s64x64u_c1/592c805c7ce2c44cd389421697e0360b83d89f37.jpg
  1906. url.indexOf( "u_c1/" ) > 0 || // Avatars
  1907. // https://66.media.tumblr.com/a4fe5011f9e07f95588c789128b60dca/603e412a60dc5227-4a/s64x64u_c1_f1/7b50fe887c756759973d442a89ef86ab1359f6e5.gif
  1908. url.indexOf( "ezastumblrscrape" ) > 0 ) // Somehow this script is running on pages being fetched, inserting a link. Okay. Sure.
  1909. { return false } else { return true }
  1910. }
  1911.  
  1912. // Return standard canonical URL for various resizes of Tumblr images - size of _1280, single CDN
  1913. // 10/14 - ?usesmall seems to miss the CDN sometimes?
  1914. // e.g. http://mooseman-draws.tumblr.com/archive?startpage=1?pagesatonce=5?thumbnails?ezastumblrscrape?scrapemode=everypost?lastpage=37?usesmall
  1915. // https://66.media.tumblr.com/d970fff86185d6a51904e0047de6e764/tumblr_ookdvk7foy1tf83r7o1_400.png sometimes redirects to 78.media and _raw. What?
  1916. // Oh, probably not my script. I fucked with the Tumblr redirect script I use, but didn't handle the lack of CDN in _raw sizes.
  1917. // 2020: tumblr scrape:
  1918. // https://66.media.tumblr.com/b1515c45637955e1e52ec213944db662/08f2b2e47bbce703-70/s400x600/9077891dda98f127c040bd77581014ce4019fe75.gifv
  1919. // obviously can go to .gif, but that .gif saves as tumblr_b1515c45637955e1e52ec213944db662_9077891d_400.gif.
  1920. // Also, bumping up to /s500x750 works, so I can probably still slam everything to maximum size via 'canonical' urls.
  1921. function image_standardizer( url ) {
  1922. // Some lower-size images are automatically resized. We'll change the URL to the maximum size just in case, and Tumblr will provide the highest resolution.
  1923. // Replace all resizes with _1280 versions. Nearly all _1280 URLs resolve to highest-resolution versions now, so we don't need to e.g. handle GIFs separately.
  1924. // Oh hey, Tumblr now has _raw for a no-bullshit as-large-as-possible setting.
  1925. // _raw only works without the CDN - so //media.tumblr yes, but //66.media.tumblr no. This complicates things.
  1926. // Does //media and _raw always work? No, of course not. So we still need on_error.
  1927. // url = url.replace( "_540.", "_1280." ).replace( "_500.", "_1280." ).replace( "_400.", "_1280." ).replace( "_250.", "_1280." ).replace( "_100.", "_1280." );
  1928. let maxres = "1280"; // It is increasingly unlikely that _raw still works. Reconsider CDN handling if that's the case.
  1929. if( options_map.maxres ) { maxres = options_map.maxres } // If it's set, use it. Should be _100, _250, whatever. ?usesmall should set it to _400. ?notraw, _1280.
  1930. maxres = "_" + maxres + "."; // Keep the URL options clean: "400" instead of "_400." etc.
  1931. url = url.replace( "_raw", maxres ).replace( "_1280.", maxres ).replace( "_640.", maxres ).replace( "_540.", maxres )
  1932. .replace( "_500.", maxres ).replace( "_400.", maxres ).replace( "_250.", maxres ).replace( "_100.", maxres );
  1933.  
  1934. // henrythehangman.tumblr.com has doubled images from /image posts in ?scrapemode=everypost. Lots of _1280.jpg?.jpg nonsense.
  1935. // Is that typical for tumblrs with this theme? It's one of those annoying magnifying-glass-on-hover deals. If it's just that one weird fetish site, remove this later.
  1936. url = url.split('?')[0]; // Ditch anything past the first question mark, if one exists
  1937. url = url.split('&')[0]; // Ditch anything past the first ampersand, if one exists - e.g. speikobrarote.tumblr.com
  1938. if( url.indexOf( 'tumblr.com' ) > 0 ) { url = url.split( ' ' )[0]; } // Ditch anything past a trailing space, if one exists - e.g. cinnasmut.tumblr.com
  1939.  
  1940. // Standardize media subdomain / CDN subsubdomain, to prevent duplicates and fix _1280 vs _raw complications.
  1941. if( url.indexOf( '.media.tumblr.com/' ) > 0 ) {
  1942. let url_parts = url.split( '/' )
  1943. url_parts[2] = '66.media.tumblr.com'; // This came first. Then //media.tumblr.com worked, even for _raw. Then _raw went away. Now it needs a CDN# again. Bluh.
  1944. // url_parts[2] = 'media.tumblr.com'; // 2014: write a thing. 2016: comment out old thing, write new thing. 2018: uncomment old thing, comment new thing. This script.
  1945. url = url_parts.join( '/' ).replace( 'http:', 'https:' );
  1946.  
  1947. /* // Doesn't work - URLs resolve to HTML when clicked, but don't embed as images. Fuck this website.
  1948. // I need some other method for recognizing duplicates - the initial part of the URL is the same.
  1949. // Can HTML5 / CSS suppress certain elements if another element is present?
  1950. // E.g. #abcd1234#640 is display:none if #abcd1234#1280 exists.
  1951. // https://support.awesome-table.com/hc/en-us/articles/115001399529-Use-CSS-to-change-the-style-of-each-row-depending-on-the-content
  1952. // The :empty pseudoselector works like :hover. And you can condition it on attributes, like .picture[src=""]:empty.
  1953. // The :empty + otherSelector{} syntax is weird, but it's CSS, so of course it's weird.
  1954. // Here we'd... generate a style element? Probably just insert a <style> block in innerHTML, honestly. Is inline an option?
  1955. // http://rafaelservantez.com/writings_tutorials_web_css_cond.html
  1956. // Ech, + requires immediate adjacency. ~ selects within siblings, but only siblings "after" the first element.
  1957.  
  1958. // Late 2020: shitty new URLs don't contain _tumblr.
  1959. if( url.indexOf( '_tumblr' ) < 0 ) { // Should be !String.match, but Christ, one problem at a time.
  1960. let othermax = 's1280x1920';
  1961. url = url.replace( 's400x600', othermax ).replace( 's500x750', othermax ).replace( 's640x960', othermax );
  1962. //'s400x600'
  1963. //'s500x750'
  1964. //'s640x960'
  1965. //'s1280x1920'
  1966. }
  1967. */
  1968.  
  1969. }
  1970. /* // ?notraw and ?usesmall are deprecated. Use ?maxres=1280 or ?maxres=400 instead.
  1971. // Change back to _1280 & CDN for ?scrapewholesite (which does no testing). _raw is unreliable.
  1972. if( options_map.scrapemode == 'scrapewholesite' || options_map.notraw ) {
  1973. url = url.replace( "_raw.", "_1280." ).replace( '//media', '//66.media' );
  1974. }
  1975. // Change to something smaller for quick browsing, like _500 or _250
  1976. // https://78.media.tumblr.com/166565b897f228352069b290067215c0/tumblr_oozoucFu0O1v1bsoxo2_raw.jpg etc don't work. What?
  1977. if( options_map.scrapemode != 'scrapewholesite' && options_map.usesmall ) { // Ignore this on scrapewholesite; that'd be a dumb side effect.
  1978. url = url.replace( "_raw.", "_400." ).replace( "_1280.", "_400." ).replace( '//media', '//66.media' );
  1979. }
  1980. */
  1981.  
  1982. return url;
  1983. }
  1984.  
  1985. // Remove duplicates from an array (from an iterable?) - returns de-duped array
  1986. // Credit to http://stackoverflow.com/questions/9229645/remove-duplicates-from-javascript-array for hash-based string method
  1987. function remove_duplicates( list ) {
  1988. let seen = {};
  1989. list = list.filter( function( item ) {
  1990. return seen.hasOwnProperty( item ) ? false : ( seen[ item ] = true );
  1991. } );
  1992. return list;
  1993. }
  1994.  
  1995. // Filter: Return true ONCE for any given string.
  1996. // Global duplicate remover - return false for items found in page_dupe_hash, otherwise add new items to it and return true
  1997. // Now also counts instances of each non-unique argument
  1998. function novelty_filter( url ) {
  1999. // return page_dupe_hash.hasOwnProperty( url ) ? false : ( page_dupe_hash[ url ] = true );
  2000. // console.log( page_dupe_hash ); // DEBUG
  2001. url = url.toLowerCase(); // Debug-ish, mostly for "tag overview." URLs can be case-sensitive but collisions will be rare. "Not all images are guaranteed to appear."
  2002. if( page_dupe_hash.hasOwnProperty( url ) ) {
  2003. page_dupe_hash[ url ] += 1;
  2004. return false;
  2005. } else {
  2006. page_dupe_hash[ url ] = 1;
  2007. return true;
  2008. }
  2009. }
  2010.  
  2011. // Filter: Return true for any Tumblr /tagged URL.
  2012. // This is used separately from the other filters. Those go X=X.filter( remove stuff ), X=X.filter( remove other stuff ). This should run first, like Y=X.filter( return_tags ).
  2013. function return_tags( url ) {
  2014. // return url.indexOf( '/tagged' ) > 0; // 'if( condition ) { return true } else { return false }' === return condition.
  2015. if( url.indexOf( '/tagged' ) > 0 ) { return true; } else { return false; }
  2016. }
  2017.  
  2018. // Viewing a bare image, redirect to largest available image size.
  2019. // // e.g. https://66.media.tumblr.com/d020ac15b0e04ff9381f246ed08c9f05/tumblr_o2lok9yf8O1ub76b5o1_1280.jpg
  2020. // This is a sloppy knockoff of Tumblr Image Size 1.1, because that script doesn't affect //media.tumblr.com URLs.
  2021. // Also, this is a distinct mode, not a helper function. Maybe just move it up to the start?
  2022. function maximize_image_size() {
  2023. if( window.location.href.indexOf( "_1280." ) > -1 ) { return false; } // If it's already max-size, die.
  2024. if( window.location.href.indexOf( "_raw." ) > -1 ) { return false; } // If it's already max-size, die.
  2025. // Nope, still broken. E.g. https://78.media.tumblr.com/tumblr_mequmpAxgp1r9tb2u.gif
  2026. // Needs a positive test for all the replacements... or a check if this URL is different from the changed URL.
  2027.  
  2028. // This should probably use image_standardizer to avoid duplicate code.
  2029. let replacement = image_standardizer( window.location.href );
  2030. if( window.location.href != replacement ) { window.location.href = replacement; }
  2031. /*
  2032. let maxres = "_1280.";
  2033. // maxres = "_" + maxres + "."; // Keep the URL options clean: "400" instead of "_400." etc.
  2034. url = window.location.href;
  2035. url = url.replace( "_1280.", maxres ).replace( "_540.", maxres ).replace( "_500.", maxres ).replace( "_400.", maxres ).replace( "_250.", maxres ).replace( "_100.", maxres );
  2036. if( window.location.href != url ) { window.location.href = url; }
  2037. */
  2038. }
  2039.  
  2040. // Decode entity references.
  2041. // This is officially the stupidest polyfill I've ever used. How is there an entire class of standard escaped text with NO standard decoder in Javascript?
  2042. // Copied straight from https://stackoverflow.com/questions/1912501/unescape-html-entities-in-javascript because decency was not an option.
  2043. function htmlDecode(input) {
  2044. return new DOMParser().parseFromString(input, "text/html").documentElement.textContent;
  2045. }
  2046.  
  2047. // Convert fetch()'d URLs into local URLs. Offsite URLs will 404 instead of triggering CORS as an unhandled error. This language.
  2048. // Hmm. This works, but the test case I thought I had worked anyway (with plain fetches) after several refreshes.
  2049. // Currently not used anywhere - needs further testing.
  2050. function safe_fetch( ){
  2051. // args[0] = args[0]; // Editing TBD
  2052. // var video_post = window.location.protocol + "//" + subdomain[4] + ".tumblr.com/post/" + subdomain[5] + "/";
  2053. // let local_root = window.location.protocol + "//" + window.location.host + "/";
  2054. // arguments[0] = arguments[0]; // Editing TBD - arguments is the argv for JS.
  2055. arguments[0] = '/' + arguments[0].split( '/' ).slice( 3 ).join( '/' );
  2056. return fetch( ... arguments ); // Straightforward. If this returns a thing that returns a promise, safe_fetch.then() -should- "just work."
  2057. }
  2058.  
  2059.  
  2060.  
  2061.  
  2062.  
  2063. // Given the bare HTML of a Tumblr page, return an array of Promises for image/video/link URLs
  2064. function soft_scrape_page_promise( html_copy ) {
  2065. // Linear portion:
  2066. let http_array = links_from_page( html_copy ); // Split bare HTML into link and image sources
  2067. http_array.filter( url => url.indexOf( '/tagged/' ) > 0 ).filter( novelty_filter ); // Track tags for statistics, before the blacklist removes them
  2068. http_array = http_array.filter( tumblr_blacklist_filter ); // Blacklist filter for URLs - typical garbage
  2069.  
  2070. function is_an_image( url ) {
  2071. // Whitelist URLs with image file extensions or Tumblr iframe indicators
  2072. var image_link = false;
  2073. if( url.indexOf( ".gif" ) > 0 ) { image_link = true; }
  2074. if( url.indexOf( ".jpg" ) > 0 ) { image_link = true; }
  2075. if( url.indexOf( ".jpeg" ) > 0 ) { image_link = true; }
  2076. if( url.indexOf( ".png" ) > 0 ) { image_link = true; }
  2077. if( url.indexOf( "/photoset_iframe" ) > 0 ) { image_link = true; }
  2078. if( url.indexOf( ".tumblr.com/video/" ) > 0 ) { image_link = true; }
  2079. if( url.indexOf( "/audio_player_iframe/" ) > 0 ) { image_link = true; }
  2080. return image_link;
  2081. }
  2082.  
  2083. // Separate the images
  2084. http_array = http_array.map( url => {
  2085. if( is_an_image( url ) ) { // If it's an image, get rid of any Tumblr variability about resolution or CDNs, to avoid duplicates with nonmatching URLs
  2086. return image_standardizer( url );
  2087. } else { // Else if not an image
  2088. if( url.indexOf( window.location.host ) > 0 ) { url += "#local" } else { url += "#offsite" } // Mark in-domain vs. out-of-domain URLs.
  2089. if( options_map.imagesonly ) { return ""; } // ?imagesonly to skip links on ?scrapewholesite
  2090. return url + "#link";
  2091. }
  2092. } )
  2093. .filter( n => { // Remove all empty strings, where "empty" can involve a lot of #gratuitous #tags.
  2094. if( n.split("#")[0] === "" ) { return false } else { return true }
  2095. } );
  2096.  
  2097. http_array = remove_duplicates( http_array ); // Remove duplicates within the list
  2098. http_array = http_array.filter( novelty_filter ); // Remove duplicates throughout the page
  2099. // Should this be skipped on scrapewholesite? Might be slowing things down.
  2100.  
  2101. // Async portion:
  2102. // Return promise that resolves to list of URLs, including fetched videos and photoset sub-images
  2103. return Promise.all( http_array.map( s => {
  2104. if( s.indexOf( '/photoset_iframe' ) > 0 ) { // If this URL is a photoset, return a promise for an array of URLs
  2105. return fetch( s, { credentials: 'include' } ).then( r => r.text() ).then( text => { // Fetch URL, get body text from response
  2106. var photos = text.split( 'href="' ); // Isolate photoset elements from href= declarations
  2107. photos.shift(); // Get rid of first element because it's everything before the first "href"
  2108. photos = photos.map( p => p.split( '"' )[0] + "#photoset" ); // Tag all photoset images as such, just because
  2109. photos[0] += "#" + s; // Tag first image in set with photoset URL so browse mode can link to it
  2110. return photos;
  2111. } )
  2112. }
  2113. else if ( s.indexOf( '.tumblr.com/video/' ) > 0 ) { // Else if this URL is an embedded video, return a Tumblr-standard URL for the bare video file
  2114. var subdomain = s.split( '/' ); // E.g. https://www.tumblr.com/video/examplename/123456/500/ -> https,,www.tumblr.com,video,examplename,123456,500
  2115. var video_post = window.location.protocol + "//" + subdomain[4] + ".tumblr.com/post/" + subdomain[5] + "/";
  2116. // e.g. http://examplename.tumblr.com/post/123456/ - note window.location.protocol vs. subdomain[0], maintaining http/https locally
  2117.  
  2118. return fetch( video_post, { credentials: 'include' } ).then( r => r.text() ).then( text => {
  2119. if( text.indexOf( 'og:image' ) > 0 ) { // property="og:image" content="http://67.media.tumblr.com/tumblr_123456_frame1.jpg" --> tumblr_123456_frame1.jpg
  2120. var video_name = text.split( 'og:image' )[1].split( 'media.tumblr.com' )[1].split( '"' )[0].split( '/' ).pop();
  2121. } else if( text.indexOf( 'poster=' ) > 0 ) { // poster='https://31.media.tumblr.com/tumblr_nuzyxqeJNh1rjoppl_frame1.jpg'
  2122. var video_name = text.split( "poster='" )[1].split( 'media.tumblr.com' )[1].split( "'" )[0].split( '/' ).pop(); // Bandaid solution. Tumblr just sucks.
  2123. } else {
  2124. return video_post + '#video'; // Current methods miss the whole page if these splits miss, so fuck it, just return -something.-
  2125. }
  2126.  
  2127. // tumblr_abcdef12345_frame1.jpg -> tumblr_abcdef12345.mp4
  2128. video_name = "tumblr_" + video_name.split( '_' )[1] + ".mp4#video";
  2129. video_name = "https://vt.tumblr.com/" + video_name; // Standard Tumblr-wide video server
  2130.  
  2131. return video_name; // Should be e.g. https://vt.tumblr.com/tumblr_abcdef12345.mp4
  2132. } )
  2133. }
  2134. else if ( s.indexOf( "/audio_player_iframe/" ) > 0 ) { // Else if this URL is an embedded audio file, return... well not a standard URL perhaps, but an URL.
  2135. // How the fuck do I download audio? Video works-ish. Audio is not well-supported.
  2136. // http://articulatelydecomposed.tumblr.com/post/176171225450/you-know-i-had-to-do-it-the-new-friendsim-had-a
  2137. // Ctrl+F "plays". Points to:
  2138. // http://articulatelydecomposed.tumblr.com/post/176171225450/audio_player_iframe/articulatelydecomposed/tumblr_pcaffm6o6H1rc8keu?audio_file=https%3A%2F%2Fwww.tumblr.com%2Faudio_file%2Farticulatelydecomposed%2F176171225450%2Ftumblr_pcaffm6o6H1rc8keu&color=white&simple=1
  2139. // Still no bare .mp3 link in that iframe.
  2140. // data-stream-url="https://www.tumblr.com/audio_file/articulatelydecomposed/176171225450/tumblr_pcaffm6o6H1rc8keu" ?
  2141. // Indeed. Actually that might be in the iframe link. What's that function for decoding URIs?
  2142. // Wait, the actual file resolves to https://a.tumblr.com/tumblr_pcaffm6o6H1rc8keuo1.mp3 - this is trivial.
  2143. // let audio_name = s.split( '/' ).pop(); // Ignore text before last slash.
  2144. // audio_name = audio_name.split( '?' ).shift(); // Ignore text after a question mark, if there is one.
  2145. // audio_name should now look something like 'tumblr_12345abdce'.
  2146. // return 'https://a.tumblr.com/' + audio_name + '1.mp3#FUCK'; // Standard tumblr-wide video server. Hopefully.
  2147.  
  2148. // Inconsistent.
  2149. // e.g. http://articulatelydecomposed.tumblr.com/post/176307067285/mintchocolatechimp-written-by-reddit-user shows
  2150. // https://a.tumblr.com/tumblr_pce3snfthB1ufch4go1.mp3#offsite#link which works, but also
  2151. // https://a.tumblr.com/tumblr_pce3snfthB1ufch4go1.mp3&color=white&simple=1#offsite#link which doesn't and
  2152. // https://a.tumblr.com/tumblr_pce3snfthB1ufch4g1.mp3 which also doesn't.
  2153. // ... both of the 'go1' links are present even without this iframe-based code, because that audio has a 'download' link.
  2154.  
  2155. // Yeah, so much for the simple answer. Fetch. No URL processing seems necessary - 's' is already on this domain.
  2156. return fetch( s, { credentials: 'include' } ).then( r => r.text() ).then( text => {
  2157. // data-stream-url="https://a.tumblr.com/tumblr_pce3snfthB1ufch4go1.mp3"
  2158. let data_url = text.split( 'data-stream-url="' )[1]; // Drop everything before the file declaration.
  2159. data_url = data_url.split( '"' )[0]; // Drop everything after the doublequote. Probably not efficient, but fuck regexes.
  2160. return data_url + "#audio";
  2161. } )
  2162. // Alright, well now it sort of works, but sometimes it returns e.g.
  2163. // https://www.tumblr.com/audio_file/articulatelydecomposed/176010754365/tumblr_pc1pq1TsD31rc8keu#FRIG which resolves to
  2164. // https://a.tumblr.com/tumblr_pc1pq1TsD31rc8keuo1.mp3#FRIG
  2165. // Huh. That -is- the correctly-identified "data-stream-url" in some iframes. Shit.
  2166. // This is set up to handle multiple return values, right? For photosets? So I -could- return the correct URL and some guesses.
  2167. // But at that point - why not just guess from the iframe URL? +1.mp3 and also +o1.mp3.
  2168. // Can't just make up a username or post number for the /audio_file sort of URL.
  2169. }
  2170. return Promise.resolve( [s] ); // Else if this URL is singular, return a single element... resolved as a promise for Promise.all, in an array for Array.concat. Whee.
  2171. } ) )
  2172. .then( nested_array => { // Given the Promise.all'd array of resolved URLs and URL-arrays
  2173. return [].concat.apply( [], nested_array ); // Concatenate array of arrays - apply turns array into comma-separated list, concat turns CSL of arrays into a single array
  2174. } )
  2175. }
  2176.  
  2177.  
  2178.  
  2179.  
  2180.  
  2181. // Returns a URL with all the options_map options in ?key=value format - optionally allowing changes to options in the returned URL
  2182. // Valid uses:
  2183. // options_url() -> all current settings, no changes
  2184. // options_url( "name", number ) -> ?name=number
  2185. // options_url( "name", true ) -> ?name
  2186. // options_url( {name:number} ) -> ?name=number
  2187. // options_url( {name:number, other:true} ) -> ?name=number?other
  2188. // Note that simply passing "name" will remove ?name, not add it, because the value will evaluate false. I should probably change this? Eh, { key } without :value causes errors.
  2189. function options_url( key, value ) {
  2190. var copy_map = new Object();
  2191. for( var i in options_map ) { copy_map[ i ] = options_map[ i ]; }
  2192. // In any sensible language, this would read "copy_map = object_map." Javascript genuinely does not know how to copy objects. Fuck's sake.
  2193.  
  2194. if( typeof key === 'string' ) { // the parameters are optional. just calling options_url() will return e.g. example.tumblr.com/archive?ezastumblrscrape?startpage=1
  2195. if( !value ) { value = false; } // if there's no value then use false
  2196. copy_map[ key ] = value; // change this key, so we can e.g. link to example.tumblr.com/archive?ezastumblrscrape?startpage=2
  2197. }
  2198.  
  2199. else if( typeof key === 'object' ) { // If we're passed a hashmap
  2200. for( var i in key ) {
  2201. if( ! key[ i ] ) { key[ i ] = false; } // Turn any false evaluation into an explicit boolean - this might not be necessary
  2202. copy_map[ i ] = key[ i ]; // Press key-object values onto copy_map-object values
  2203. }
  2204. }
  2205.  
  2206. // Construct URL from options
  2207. var base_site = window.location.href.substring( 0, window.location.href.indexOf( "?" ) ); // should include /archive, but if not, it still works on most pages
  2208. for( var k in copy_map ) { // JS maps are weird. We're actually setting attributes of a generic object. So map[ "thumbnails" ] is the same as map.thumbnails.
  2209. if( copy_map[ k ] ) { // Unless the value is False, print a ?key=value pair.
  2210. base_site += "?" + k;
  2211. if( copy_map[ k ] !== true ) { base_site += "=" + copy_map[ k ]; } // If the value is boolean True, just print the value as a flag.
  2212. }
  2213. }
  2214. return base_site;
  2215. }
  2216.  
  2217.  
  2218.  
  2219.  
  2220.  
  2221.  
  2222.  
  2223.  
  2224.  
  2225.  
  2226.  
  2227.  
  2228.  
  2229.  
  2230.  
  2231.  
  2232.  
  2233.  
  2234.  
  2235.  
  2236.  
  2237.  
  2238.  
  2239.  
  2240.  
  2241.  
  2242.  
  2243.  
  2244.  
  2245.  
  2246.  
  2247.  
  2248.  
  2249.  
  2250.