Hey everybody, the ArchiveTeam tumblr tracker is up and running!
If you have resources, please install Archiveteam's warrior program to contribute to the project! We're already up to 11TB and 187 million pages archived, but we're going to need a lot more help to get all the NSFW content before the 17th!
do you need a whole lot of disk space sitting around to do this?
could get my desktop running this in the background
@KitRedgrave disk space is one of our biggest obstacles to parallelization right now, yeah. However, one thing to mention is that the warrior has a pretty low RAM limit by default, so if you want to crank the concurrency up (to take advantage of the disk space), you should increase the default RAM limit in the appliance (or run the grab scripts standalone)
@nightpool i happen to have a freebsd server sitting mostly idle apart from serving Minecraft
could be useful
Of course, if it needs to compile something like psycopg2, then you should probably install that from ports and use system packages within the virtualenv.
Yes, as a FreeBSD admin, you likely knew this already, but in case you didn't think of the virtualenv possibility, throwing it out there.
@Truck (just as like.... a general note, maybe check to see if your advice is relevant before giving it?)
@nightpool You're right, we old unix admins with years of experience should just fuck right off, nothing we ever say could possibly be useful and stop sharing information. We are useless and should go die.
How do i block someone on this? Back in the day, you'd be added to my killfile. What is the equivalent of a killfile? Because you need plonked.
@nightpool I bunmped up the RAM limit (and concurrent uploads and downloads) on mine. Does it make good use of extra cores?
@qwertystop ish? i mean, it's not like it's a particularly CPU bound operation, so there's no real benefit to giving it more then 1 core
@nightpool I'd assume compression is happening; doesn't that take a good chunk of CPU? Plus some sort of metadata aggregation, going by step names.
@nightpool Is something wrong with it? I've been getting "No items available currently" on both attempts (separated by several hours) today.
@Sir_Boops archive warrior itself is not particularly slow, these jobs are exceedingly big—they cover nearly an entire blog—and wget-lua, which is somewhat unparalleled in it's WARC creation ability, is not as great at parsing html on the fly
Why parse html at all the api shows all publicly XD
tumblr-utils is bae for backing up blogs
@frinkeldoodle rule of thumb is to make sure you have about 250MB of ram for each worker and 15gb of disk space. If you're running in the warrior virtual appliance, you need to increase the default ram limit or you will quickly run out
Running one warrior docker instance (without a RAM limit) and four instances of the script without warrior. Server has 20 GB RAM, 30 GB swap, and at least 1 TB of free space.
@nightpool sorry if this has been asked before, but do you know the reason for the switch from wpull to wget-lua for the warriors/scripts?
@PuppyJack data is uploaded to archive.org after the completion of a job, which may take many hours. the blogs archived will be made available as part of the Wayback Machine
ｃｙｂｒｅｓｐａｃｅ: the social hub of the information superhighway
jack in to the mastodon fediverse today and surf the dataflow through our cybrepunk, slightly glitchy web portal