wow the distilled version of gpt2 is like, loads faster

@violet reading some papers about quantizing the parameters down to 8 bits without much accuracy loss, and it's pretty cool

@violet I'm not talking about the training step I'm talking about the evaluation step

I wonder if anyone has performance optimized neural networks yet

I tried getting GPT-2 working on my computer but I've got a too-old graphics card to use CUDA so I had to use a single thread of my CPU (pytorch doesn't know how to multi-process apparently) and I got on average 1 token generated every 5 seconds 😩

@RumPartov what is your opinion on powerful language models that can generate plausible text extremely quickly?

@halcy well, my understanding is GPT-2 actually has quite a bit of long-scale context at its disposal, so lack of context isn't that big of a problem?

@SuricrasiaOnline better example: train a model to predict a (fair) coin toss result

say it learns it's heads 51% of the time and tails 49% of the time

if you sample using T=1, you'll get something like THHTTHTTHHHTHTH

if you sample using T=0 (ie max), you'll get HHHHHHHHHHHHHHHH

which looks more natural?

however I think most of the difficulty in predicting the next word is when someone is communicating an unguessable statement. like "I couldn't find the ____" isn't guessable with high confidence. like of course the next word isn't going to be "is", but the number of likely next works is incredibly large, and that's the source of the perplexity

@halcy yeah, 1 is true for certain words in a sentence. for example common phrases where the next word is highly likely (the paper gives "I at the pizza while it was still ____", gpt-2 gives HOT or WARM, with HOT being about 80% likely.)


so maybe the fact that your language model considers a repeated, meaningless sentence to be "low entropy" is a feature and not a bug

my idea is perhaps human text is more unlikely than repeated sentences because human text is communication. Meaning, it is an encoding of actual, incompressible semantic data. So the most likely sample is a sample that contains no semantic information. Would it then make sense to optimize for a sample that has a specific (non-maximal) likelihood? :thounking:

they talk about how text generated to be "maximally likely" just repeats itself forever, however human text does not do that. This is perplexing given that these models are designed to give high probability to human-written text. So why is the most likely sample from a language model actually the most unlikely thing for a person to write?

I forgot to post this here, but here was my freestyle graphics entry to Demosplash 2019. I made it in 15 minutes in bonzomatic while waiting for my flight. It got last place 😂

Show more

Cybrespace is an instance of Mastodon, a social network based on open web protocols and free, open-source software. It is decentralized like e-mail.