Random insight of the night: every couple years, someone stands up and bemoans the fact that programming is still primarily done through the medium of text. And surely with all the power of modern graphical systems there must be a better way. But consider:
* the most powerful tool we have as humans for handling abstract concepts is language
* our brains have several hundred millenia of optimizations for processing language
* we have about 5 millenia of experimenting with ways to represent language outside our heads, using media (paper, parchment, clay, cave walls) that don't prejudice any particular form of representation at least in two dimensions
* the most wildly successful and enduring scheme we have stuck with over all that time is linear strings of symbols. Which is text.
So it is no great surprise that text is well adapted to our latest adventure in encoding and manipulating abstract concepts.
@rafial Both accurate and also misses the fact that Excel is REGULARLY misused for scientific calculations and near-programming level things since its GUI is so intuitive for doing math on things.
Like, GUI programming is HERE, we just don't want to admit it due to how embarrassing it is.
@rafial Now what we need to do is make a cheap, easy to use version of it that is designed for what scientists are using it for it. Column labels, semantic labels, faster calculations, better dealing with mid-sized data (tens of thousands of data point range), etc
@rafial I have not, I've done math and such in Excel (making a molecular weight calculator, sheets to automatically work out student marks by letting me see which step of a calculation they got wrong, etc) and I've done actual programming (A little python, C, C++, heck, QBASIC back in high school and a tiny bit of FORTRAN90 one summer). but not anything in between.
I can't figure out the use of them? They seem like the worst of both worlds. You have to debug python AND you have to deal with a slow loading GUI program
@Canageek I'm very interested in that space, in so far as it seems to intersect with the ideas of Don Knuth's literate programming. But I also admit to be slightly unclear as to what domains it is best for. I think it's big with the data science crowd?
@rafial Yeah, I'm a chemist who works with very small data, no statistical analysis or anything like that. About the most I have to do is "Adjust this data to account for the lamp response on that day" or "Normalize and scale these two spectra against one another"
@Canageek if I had to guess, I would say if your domain involved exploration of data sets, with visualization as a key component of that, the notebook things might well be a killer app.
@rafial @Canageek I've done a fair amount with Python notebooks when that first happened, a little with Julia notebooks.
It's hard to make a coherent program in them, it's a long series of fragments that get run in arbitrary (creation) order.
But as a mix of documentation and calculation, it's super useful. I'd rather work in a REPL (and Julia has a fantastic REPL), but notebooks are better for persistent math and proving your results.
@rafial @Canageek There's annoyances with #Julia: Startup time is many (5-30) seconds, indexing starts at 1 unless you do crazy things to "fix it", docs can be confusing/nonexistent.
But it has a great method dispatch/object model which is not just "static Smalltalk" like most, it's super fast at runtime, and like I say the REPL is amazing.
@mdhughes I mean for those who, unlike us, didn't come up through C and "indexing is really pointer multiplication", index base 1 kinda makes sense, nah?
@rafial
"Should array indices start at 0 or 1? My compromise of 0.5 was rejected without, I thought, proper consideration."
—Stan Kelly-Bootle
@mdhughes @rafial @Canageek Also relevant to this discussion:
https://juliapackages.com/p/pluto
I found out about this here:
https://www.youtube.com/watch?v=g8RkArhtCc4
@urusan @rafial It totally ignores the advantages of PDF though, like the fact there are a stack of independent implementations that can view it, which means we will still be able to read these files in 50 years, unlike whatever format they are using, and we can even print them out on paper to edit them (for example, how my boss and I do it, as he doesn't know LaTeX, which is what I write in).
Or the fact that like, 90% of scientists don't know how to program and are unlikely to learn.
@urusan @rafial In my lab right now... I can program a little, one post doc has spent some time doing python tutorials, and I think that is it, out of ten people.
Organic labs might have less then that, as there is a reputation at least that organic chemists and programming are like oil and water: if you are good at one, you are likely terrible at the other.
@Canageek @rafial Well, I don't personally think this kind of dynamic documentation is going to fully replace static documentation. Static webpages haven't been fully replaced by dynamic webpages either.
However, the growing notebook ecosystem does address issues that didn't have well-formed norms around them before.
In addition to the obvious CS/Math applications, there's a lot of areas where complicated statistical work gets done with code, and then how do you publish that?
@Canageek @rafial Actually, this gets at the heart of what I'm talking about.
Jupyter handles the software dependencies for you as part of the kernel selection.
As these norms become better established over time, the tools to deal with these issues automatically will just be there.
Just dropping some source code on someone is putting the burden of replication on them and not on the platform.
@urusan @rafial Right but then you are stuck with a system 0 people understand instead of 1 person.
If I wanted to use GNUplot as it was built I could use the script a former grad student passed on, but I want it in LaTeX so the fonts and such match.
(Honestly, if I could go into campus it would be done, but the software we have that makes these graphs is trapped on one computer, and not worth me going into campus during a pandemic to get, at least not yet)
Honestly, if you are going to try and replicate any of my work you aren't going to go to my data: You are going to synthesize the compounds yourself and take the measurements on your equipment against your standards so that it doesn't turn out to be some dumb different between how I set up my experiments and how you do, or some defect in my hardware, etc etc.
@clacke @urusan @rafial Well yeah, but most scientists don't do statistics. Most chemists, most biologists, geologists, etc.
Like, there is a reason computational is a subfeild of every discipline.
I think it is going up, due to more stats and computations being used, but I also think we are way to reliant on stats these days and use them instead of getting good data.
@Canageek @clacke @rafial Getting good data is a noble goal, but you'll always have to cope with statistical uncertainty in science.
Even in computer science, where we theoretically control the underlying systems we're studying perfectly, there's often still statistical uncertainty to deal with.
I don't see how that would be any better in the real world where there's uncertainty in measurement.
That said, you're right that you want to get good enough data that your statistics are simple.
@urusan @clacke @rafial Right, I've been frustrated with this in science for decades. We should do half as many studies and put twice or more as much funding into each one so we have actually decent stats.
For example, lately you have to justify the minimum number of rats for ethics committees for any experiment. Fuck that, use 4 times as many so we can be confident in our work instead of justifying it to heck and back.
@urusan @clacke @rafial Nope, its counted as bad, but justifiable so you have to minimize the number you use, at least as I understand it.
Likewise, academic human studies are typically very underfunded, which is why there is such a bias towards small sample sizes and all the participants being undergrads found on campus.
> Listen, my grandpa was a mason, from Borlänge, Gösta was his name, construction worker, he used to say this:
> "In my day, we made an honor out of building houses as strong as possible, so they would last as long as possible. But now, they have computers that calculate how weak they can build a joint without it falling in on itself."
> *ding*
> Ain't it weird?
Thanks @urusan, I found the article interesting, and it touched on the issue how to balance the coherence of a centrally designed tool with the need for something open, inspectable, non-gatekept, and universally accessible.
PDF started its life tied to what was once a very expensive, proprietary tool set. The outside implementations that @Canageek refers to were crucial in it becoming a universally accepted format.
I think the core idea of the computational notebook is a strong one. The question for me remains if we can arrive at a point where a notebook created 5, 10, 20 or more years ago can still be read and executed without resorting to software archeology. Even old PDFs sometimes break when viewed through new apps.
@rafial @urusan Fair, though I'd say source code is pointless and what we need is more focus on good, easy access to the raw data.
If you can't reproduce what was done from what is in the paper, you haven't described what you've done well enough, and redoing it is better then just rerunning code as a bug might have been removed between software versions, you might notice something not seen in the original, etc.
@Canageek @rafial This is something I have been thinking about while talking about this. The Jupyter notebook approach is much better when code gets involved.
However, the main alternative is to just eschew code entirely. I think this is valid, especially in fields where code is largely irrelevant and you can just provide your data and describe your statistical approach and let the reader deal with it.
@Canageek @rafial You aren't processing those ShelX files on any sort of hardware (or software binaries) that existed in the late 1960's. At best, you're running the original code in an emulation of the original hardware, but you are probably running it on modern software designed to run on modern hardware
Software archeology is inevitable and even desirable
What we want is an open platform maintained by software archeology experts that lets users not sweat the details
@Canageek @rafial Admittedly, we also want the software we use to communicate science with to not change at a blistering pace.
However, natural language and scientific techniques naturally change over time too, so it's inevitable that we will have to cope with change.
We already have to do this, it's just our brains do a good job smoothing inconsistencies out.
@urusan @rafial No, they've kept updating the software since then so it can use the same input files and data files. I'm reprocessing the data using the newest version of the software using the same list of reflections that was measured using optical data from wayyyy back.
The code has been through two major rewrites in that time, so I don't know how much of the original Fortran is the same, but it doesn't matter? I'm doing the calculations on the same raw data as was measured in the 60s.
There is rarely a POINT to doing so rather then growing a new crystal but I know someone that has done it (he used Crystals rather then Shelx, but he could do that as the modern input file converter works on old data just fine)
@Canageek @rafial We're talking about 2 different things here. Of course data from over half a century ago is still useful.
The thing that's hard to keep running decades later is the code, and code is becoming more and more relevant in many areas of science.
Keeping old code alive so it can produce consistent results for future researchers is a specialized job
Ignoring the issue isn't going to stop researchers from using and publishing code, so it's best to have norms
@urusan @Canageek one other thing to keep in mind is that data formats are in some ways only relevant if there is code that consumes it. Even with a standard, at the end of the day a valid PDF document is by de-facto definition, one that can be rendered by extent software. Similar with ShelX scripts. To keep the data alive, one must also keep the code alive.
@rafial @urusan No, what you need is a good description of how the data was gathered. Analysis is just processing and modeling and can be redone whenever. As long as you know enough about the data.
There are *six* programs I can think of that can process hkl data and model it (shelx, crystals, GSAS-II, Jana, olex2) so it doesn't REALLY matter which you use or if any of them are around in ten years as long as there is *A* program that can do the same type or better modeling (reading the same input file is a really good idea as well as it makes thing easy)
If a solution is physically relevant any program should be able to do the same thing.
@mdhughes @rafial @urusan I mean, that is why Shelx first major version came out in 1965 and the most recent one in 2013 (last minor revision was 2018)
I mean, modern versions of Fortran aren't any harder to write them C, which is still one of the most used programming languages in the planet, I don't see why everyone makes fun of it.
@Canageek @rafial @urusan I'm kind of not making fun of Fortran, though the last time I saw any in production it was still F-77, because F-90 changed something they relied on and was too slow; I last worked on some F-77 for the same reason ~30 years ago.
I am indeed making fun of COBOL, but it'll outlive us by thousands of years as well.
Stable languages are good… but also fossilize practices that we've improved on slightly in the many decades since.
> SHELX is developed by George M. Sheldrick since the late 1960s. Important releases are SHELX76 and SHELX97. It is still developed but releases are usually after ten years of testing.
This is amazing.@clacke @mdhughes @urusan @rafial yeah, the big worry is that George Sheldrick is getting very, very old and there are wonders if anyone will take over maintaining and improving the software when he dies. luckily it's largest competitor does have two people working on it the original author and a younger professor so it has a clear succession path.
@Canageek I'm wondering, given your professional leanings if you can comment on the use of "notebook" style programming systems such as Jupyter and of course Mathematica. Do you have experience with those? And if so how do they address those needs?