Follow

Hey everybody, the ArchiveTeam tumblr tracker is up and running!

tracker.archiveteam.org/tumblr

If you have resources, please install Archiveteam's warrior program to contribute to the project! We're already up to 11TB and 187 million pages archived, but we're going to need a lot more help to get all the NSFW content before the 17th!

irc is on efnet, you can add blogs to be saved at goo.gl/RtXZEq

@nightpool it's a shame my vps'ses connections aren't being utilised at all, instead only doing a few mb/s
not sure if i can improve this since they're both quite low on storage

@f0x hmm, you might be able to tinker with job sizes and then run a higher concurrency.

@f0x i'm unsure if the job size is enforced by the tracker or the requested by the client

@nightpool hmmm

do you need a whole lot of disk space sitting around to do this?

could get my desktop running this in the background

@nightpool actually those requirements are pretty doable

we all should be on this imo

@KitRedgrave disk space is one of our biggest obstacles to parallelization right now, yeah. However, one thing to mention is that the warrior has a pretty low RAM limit by default, so if you want to crank the concurrency up (to take advantage of the disk space), you should increase the default RAM limit in the appliance (or run the grab scripts standalone)

@nightpool i happen to have a freebsd server sitting mostly idle apart from serving Minecraft

could be useful

@KitRedgrave definitely! the instructions are here: github.com/ArchiveTeam/tumblr- and they're super simple! just 2 python dependencies

@nightpool "For FreeBSD: Honestly I have no idea."

that's a mood

i'll get it working

@KitRedgrave @nightpool if it's 2 python dependencies, it should run just fine from a virtualenv, and _theoretically_ that would mean pip install would handle everything.

Of course, if it needs to compile something like psycopg2, then you should probably install that from ports and use system packages within the virtualenv.

Yes, as a FreeBSD admin, you likely knew this already, but in case you didn't think of the virtualenv possibility, throwing it out there.

@Truck @KitRedgrave it also needs to compile wget-lua, i don't know what the current status of that is on FreeBSD (my guess is "works but not thoroughly tested")

@Truck (just as like.... a general note, maybe check to see if your advice is relevant before giving it?)

@nightpool You're right, we old unix admins with years of experience should just fuck right off, nothing we ever say could possibly be useful and stop sharing information. We are useless and should go die.

How do i block someone on this? Back in the day, you'd be added to my killfile. What is the equivalent of a killfile? Because you need plonked.

@nightpool @Truck compiling wget-lua was the tricky part, actually, because the lua 5.1 port puts the headers and libraries in a slightly different place than what the automake config was expecting

@nightpool I bunmped up the RAM limit (and concurrent uploads and downloads) on mine. Does it make good use of extra cores?

@qwertystop ish? i mean, it's not like it's a particularly CPU bound operation, so there's no real benefit to giving it more then 1 core

@nightpool I'd assume compression is happening; doesn't that take a good chunk of CPU? Plus some sort of metadata aggregation, going by step names.

@qwertystop there is some compression happening, but it's vastly more IO bound then CPU bound

@nightpool Is something wrong with it? I've been getting "No items available currently" on both attempts (separated by several hours) today.

@nightpool 👏 They 👏 need 👏 a 👏 better 👏 client 👏

@Sir_Boops archive warrior itself is not particularly slow, these jobs are exceedingly big—they cover nearly an entire blog—and wget-lua, which is somewhat unparalleled in it's WARC creation ability, is not as great at parsing html on the fly

@nightpool

Why parse html at all the api shows all publicly XD

tumblr-utils is bae for backing up blogs :blobuwu:

@Sir_Boops pretty hard to shove stuff from tumblr-utils into the wayback machine

@nightpool Sadly

I was pulling from tumblr at 700+ Mb/s using it :<

@nightpool
Got 30 concurrent threads running, let's see how things look tomorrow morning :p

@frinkeldoodle rule of thumb is to make sure you have about 250MB of ram for each worker and 15gb of disk space. If you're running in the warrior virtual appliance, you need to increase the default ram limit or you will quickly run out

@nightpool
Running one warrior docker instance (without a RAM limit) and four instances of the script without warrior. Server has 20 GB RAM, 30 GB swap, and at least 1 TB of free space.

@nightpool sorry if this has been asked before, but do you know the reason for the switch from wpull to wget-lua for the warriors/scripts?

@007 No, the scripts have been using wget-lua for most of the time i've been involved with them.

@nightpool Just started my VM. Where is the data going to? How do you access it?

@PuppyJack data is uploaded to archive.org after the completion of a job, which may take many hours. the blogs archived will be made available as part of the Wayback Machine

Sign in to participate in the conversation
Cybrespace

cybrespace: the social hub of the information superhighway

jack in to the mastodon fediverse today and surf the dataflow through our cybrepunk, slightly glitchy web portal