I'm a scientist programmer working in a field comprised by biologists and computer scientists, and what I've experienced is almost exactly the opposite of the author.
I've found the problems that biologists cause are mostly:
* Not understanding dependencies, public/private, SCM or versioning, making their own code uninstallable after a few months
* Writing completely unreadable code, even to themselves, making it impossible to maintain. This means they always restart from zero, and projects grow into folders of a hundred individual scripts with no order, depending on files that no longer exists
* Foregoing any kind of testing or quality control, making real and nasty bugs rampant.
IMO the main issue with the software people in our field (of which I am one, even though I'm formally trained in biology) is that they are less interested in biology than in programming, so they are bad at choosing which scientific problems to solve. They are also less productive when coding than the scientists because they care too much about the quality of their work and not enough about getting shit done.
>They are also less productive when coding than the scientists because they care too much about the quality of their work and not enough about getting shit done.
Ultimately I’d say the core issue here is that research is complex and those environments are often resource strapped relative to other environments. As such this idea of “getting shit done” takes priority over everything. To some degree it’s not that much different than startup business environments that favor shipping features over writing maintainable and well (or even partially) documented code.
The difference in research that many fail to grasp is that the code is often as ephemeral as the specific exploratory path of research it’s tied to. Sometimes software in research is more general purpose but more often it’s tightly coupled to a new idea deep seated in some theory in some fashion. Just as exploration paths into the unknown are rapidly explored and often discarded, much of the work around them is as well, including software.
When you combine that understanding with an already resource strapped environment, it shouldn’t be surprising at all that much work done around the science, be it some physical apparatus or something virtual like code is duct taped together and barely functional. To some degree that’s by design, it’s choosing where you focus your limited resources which is to explore and test and idea.
Software very rarely is the end goal, just like in business. The exception with business is that if the software is viewed as a long term asset more time is spent trying to reduce long term costs. In research and science if something is very successful and becomes mature enough that it’s expected to remain around for awhile, more mature code bases often emerge. Even then there’s not a lot of money out there to create that stuff, but it does happen, but only after it’s proven to be worth the time investment.
>Ultimately I’d say the core issue here is that research is complex and those environments are often resource strapped relative to other environments. As such this idea of “getting shit done” takes priority over everything.
The rule-of-thumb of factoring out only when you've written the same code three times rarely gets a chance here, because as soon as you notice a regularity, and you think critically about it, your next experiment breaks that regularity.
It's tempting to create reusable modules, but for one-off exploratory code, for testing hypotheses, it's far more efficient to just write it.
I have tons of examples of code where I did the simplest thing to solve the problem. Then later needed a change. I could refactor the entire thing to add this change or just hack in the change. Refactoring the entire thing takes more work than the hack so hack it is unless I forsee this is going to matter later. Usually it doesn't
That’s just anecdote, just like mine. Even simple lack of experience or lack of skills can cause that (which were definitely in my case). Also, I’m quite sure that a terrific coder can create maintainable code faster than an average one bad code. That’s why I asked some statistical data about that.
>I've found the problems that biologists cause are mostly 1. Not understanding dependencies, public/private, SCM or versioning, making their own code uninstallable after a few months
That's not on them though. That's on the state of the tooling in the industry.
Most of the time, dependencies could just be a folder you delete, and that's that (node_modules isn't very far from that). Instead it's a nightmare - and not for any good reason, except historical baggage.
The biologists writing scientific programs don't want "shared libraries" and other such BS. But the tooling often doesn't give them the option.
And the higher level abstractions like conda and pip and poetry and whatever, are just patches on top of a broken low level model.
None of those should be needed for isolated environments, only for dependency installation and update. Isolated environments should just come for free based on lower level implementation.
While I agree tooling could be better, while in grad school I found that a lot of academics / grad students don't know that any of the tooling even exists and never bothered to learn if and such tooling existed that could improve their life. Ditto with updating their language runtimes. It really seemed like they viewed code as a necessary evil they had to do to achieve their research goal.
I was going to write a response but you've put what I would have said perfectly. The problem, at least in academia, is the pressure to publish. There is very little incentive to write maintainable code and finalise a project to be something accessible to an end user. The goal is to come up with something new, publish and move on or develop the idea further. This alone is not enough reason not to partake in practices such as unit tests, containerisation and versatile code but most academic code is written by temporary "employees". PhD's a in a department for 3-4 years, Post Doc's are there about the same amount of time.
For someone to shake these bad practices, they need to fight an uphill battle and ultimately sacrifice their research time so that others will have an easier time understanding and using their codes. Another battle that people trying to write "good" code would need to fight is that a lot of academics aren't interested in programming and see coding as simply as means to an end to solve a specific problem.
Also, another bad practice few bad practices to add to the list:
* Not writing documentation.
* Copying, cutting, pasting and commenting out lines of code in lieu of version control.
* Not understanding the programming language their using and spending time solving problems that the language has a built in solution for.
This is at least based on my own experience as a PhD student in numerical methods working with Engineers, Physicists, Biologists and Mathematicians.
Sometimes I don’t blame people for committing the ‘sin’ of leaving commented code; unless you know that code used to exist in a previous version, it may well have never existed.
It can be very warranted. For a client I'm working with now I'll routinely comment out big swaths of code as they change their mind back and forth every month or so on certain things. They won't even remember it used to exist.
These patterns appear in many fields. I take it as a sign that the tooling in the field is underdeveloped.
This leads to a split between domain problem solvers, who are driven to solve the field's actual problems at all costs (including unreliable code that produces false results) and software engineers, who keep things tidy but are too risk-averse to attempt any real problems.
I encourage folks with interests in both software and an area of application to look at what Hadley Wickham did for tabular data analysis and think about what it would look like to do that for your field.
Unreliable code that produces false results does not solve the field's actual problems, and is likely to contribute to the reproducibility problem. It might solve the author's immediate problem of needing to publish something.
Update: I guess I misinterpreted OP's intent here, with "unreliable code that produces false results" being part of the field's actual problems rather than one of the costs to be borne.
I meant that the drive to solve problems at all costs can be self-defeating if you overextend yourself by making unreliable code that produces false results.
May be biology (or really, may be not) but honestly it's just the nature of the beast. Literally fortran is the oldest language, it's just the attitude and spirit is different than that of software development.
journals, research universities/institutions, and grant orgs have the resources and gatekeeping role to encourage and enforce standards, train and support investigators in conducting real science not just pseudoscience, but these entities are actively disowning their responsibility in the name of empty "empowerment" (of course because rationally no one has a real chance of successfully pushing through a reform, so the smart choice is to just not rock the boat)
Not the person you are replying to, but here are my thoughts:
He wrote the tidyverse package/group of packages which includes/is tightly associated with ggplot. It is an extensive set of tools for analyzing and plotting data. None of it is can't be done in base or or with existing packages, but it streamlined the process. It is an especially big improvement when doing grouped/apply functions, which, in my experience, is a huge part of scientific data analysis.
For many R users (especially those trained in the past 5 years or so) tidyverse and ggplot are barely distinguishable as libraries as opposed to core R features. I personally don't like ggplot for plotting and do all my figures in base R graphics, but the rest of tidyverse has dramatically improved my workflow. Thanks to tidyverse, while my code is by no means perfect (I agree with all the aforementioned criticisms of academic coding, especially in biology/ecology), it is cleaner, more legible, and more reproducible in large part thanks to tidyverse.
I work in an R&D environment with a lot of people from scientific backgrounds who have picked up some programming but aren't software people at heart. I couldn't agree more with your assessment, and I say that without any disrespect to their competence. (Though, perhaps with some frustration for having to deal with bad code!)
As ever, the best work comes when you're able to have a tight collaboration between a domain expert and a maintainability-minded person. This requires humility from both: the expert must see that writing good software is valuable and not an afterthought, and the developer must appreciate that the expert knows more about what's relevant or important than them.
> As ever, the best work comes when you're able to have a tight collaboration between a domain expert and a maintainability-minded person. This requires humility from both: the expert must see that writing good software is valuable and not an afterthought, and the developer must appreciate that the expert knows more about what's relevant or important than them.
I do work in such an environment (though in some industry, and not in academia).
An important problem in my opinion is that many "many software-minded people" have a very different way of using a computer than typical users, and are always learning/thinking about new things, while the typical user has a much less willingness to be permanently learning (both in their subject matter area and computers).
So, the differences in the mindsets and usage of computers are in my opinion much larger than your post suggest. What you list are in my experience differences that are much easier to resolve, and - if both sides are open - not really a problem practice.
> They are also less productive when coding than the scientists because they care too much about the quality of their work and not enough about getting shit done.
You can't solve the first 3 issues without having people who care about software quality. People not caring about the quality of the software is what caused those initial 3 problems in the first place.
And you can't fix any of this as long as "software quality" (the "best practices") means byzantine enterprise architecture mammoths that don't even actually fix any of the quality issues.
There are crazy over-engineered solutions with strict requirements and insane dependency management with terrible trade-offs and compromises. I've worked in the aerospace field before, so I've seen how terrible this can be. It's also possible to have unit tests, have a design and have documentation without the above and would go a long way to solve the original 3 issues.
> Yeah, if only scientists would put the same care into the quality of their science...
I guess we see survivorship bias here: the people who deeply care about the quality of their science instead of bulk producing papers are weeded out from their scientific jobs ... :-( Publish or perish.
I only worked briefly in software for research, and what you described matched my experience, but with a couple of caveats.
Firstly, a lot of the programs people were writing were messy, but didn't need to last longer than their current research project. They didn't necessarily need to be maintained long-term, and therefore the mess was often a reasonable trade-off for speed.
Secondly, almost none of the software people had any experience writing code in any industry outside of research. Many of them were quite good programmers, and there were a lot of "hacker" types who would fiddle with stuff in their spare time, but in terms of actual engineering, they had almost no experience. There were a lot of people who were just reciting the best practice rules they'd learned from blog posts, without really having the experience to know where the advice was coming from, or how best to apply it.
The result was often too much focus on easy-to-fix, visible, but ultimately low-impact changes, and a lot of difficulty in looking at the bigger picture issues.
> There were a lot of people who were just reciting the best practice rules they'd learned from blog posts, without really having the experience to know where the advice was coming from, or how best to apply it
This is exactly my experience too. Also, the problem with learning things from youtube and blogs is that whatever the author decides to cover is what we end up knowing, but they never intended to give a comprehensive lecture about these topics. The result is people who dogmatically apply some principles and entirely ignore others - neither of those really work. (I'm also guilty of this in ML topics.)
> Not understanding dependencies, public/private, SCM or versioning, making their own code uninstallable after a few months
I'm not sure what "uninstallable" code is, but why does it matter? Do scientists really need to know about dependencies when they need the same 3 libraries over and over? Pandas, numpy, Apache arrow, maybe OpenCV. Install them and keep them updated. Maybe let the IT guys worry about dependencies if it needs more complexity than that.
> Writing completely unreadable code, even to themselves, making it impossible to maintain. This means they always restart from zero, and projects grow into folders of a hundred individual scripts with no order, depending on files that no longer exists
This is actually kind of a benefit. Instead of following sunk cost and trying to address tech debt on years-old code, you can just toss a 200-liner script out of the window along with its tech debt, presumably because the research it was written for is already complete.
> Foregoing any kind of testing or quality control, making real and nasty bugs rampant.
Scientific code only needs to transform data. If it's written in a way that does that (e.g. uses the right function calls and returns a sensible data array) then it succeeded in its goal.
> They are also less productive when coding than the scientists because they care too much about the quality of their work and not enough about getting shit done.
Sooo...another argument in favor of the way scientists write code then? Isn't "getting shit done" kind of the point?
Yeah these problems with "engineer code" the author describes, they are real, but it's a well known thing in software engineering. It's exactly what you can expect from junior developers trying to do their best. More experienced programmers have gone through the suffering of having to work on such code, like the author himself, and don't do these mistakes. Meanwhile, experienced scientists still write terrible code...
I'm a software engineer working with scientist-turned-programmers, and what I've experienced is also exactly the opposite of the author. The code written by the physicists, geoscientists and data scientists I work with often suffers from the following issues:
* "Big ball of mud" design [0]: No thought given to how the software should be architected or what the entities that comprise the design space of the problem are and how they fit together. The symptoms of this lack of thinking are obvious: multi-thousand-line swiss-army-knife functions, blocks of code repeated in dozens of places with minor variations, and a total lack of composability of any components. This kind of software design (or lack of design, really) ends up causing a serious hit to productivity because it's often useless outside of the narrow problem it was written to solve and because it's exceedingly hard to maintain or add new features to.
* Lack of tests: some of this is that the scientist-turned-programmer doesn't want to "waste time" writing tests, but more often it's that they don't know _how_ to write good tests. Or they have designed the code in such a way (see above) that it's really hard to test. In any case--unsurprisingly--their code tends to be buggy.
* Lack of familiarity with common data structures and algorithms: this often results in overly-complicated brute-force solutions to problems being used when they needn't have and in sub-par performance.
This quote from the author stood out to me:
> I claim to have repented, mostly. I try rather hard to keep things boringly simple.
...because it's really odd to me. Writing code that is as simple as it can be is precisely what good programmers do! But in order to get to the simplest possible solution to a non-trivial problem you need to think hard about the design of the code and ensure that the abstractions you implement are the right ones for the problem space. Following the "unix philosophy" of building small, simple components that each do one thing well but are highly composable is undoubtedly the more "boringly simple" approach in terms of the final result, but it's a harder to do (in the sense that it may take more though and more experience) than diving into the problem without thinking and cranking out a big ball of mud. Similarly reaching for the correct data structure or algorithm often results in a massively simpler solution to your problem, but you have to know about it or be willing to research the problem a bit to find it.
The author did at least try to support his thesis with examples of "bad things software engineers do", but a lot of them seem like things that--in almost every organization I've worked at in the last ten years--would definitely be looked down on/would not pass code review. Or are things ("A forest of near-identical names along the lines of DriverController, ControllerManager, DriverManager, ManagerController, controlDriver") that are narrowly tailored to a specific language at a specific window in time.
> they care too much about the quality of their work and not enough about getting shit done.
I think the appearance of "I'm just getting shit done" is often a superficial one, because it doesn't factor in the real costs: other scientists and engineers can't use their solutions because they're not designed in a way that makes them work in any other setting than the narrow one they were solving for. Or other scientists and engineers have trouble using the person's solutions because they are hard to understand and badly-documented. Or other scientists and engineers spend time going back and fixing the person's solutions later because they are buggy or slow. The mindset of "let's just get shit done and crank this out as fast as we can" might be fine in a research setting where, once you've solved the problem, you can abandon it and move on to the next thing. But in a commercial setting (i.e. at a company that builds and maintains software critical for the organization to function) this mindset often starts to impose greater and greater maintenance costs over time.
> Lack of familiarity with common data structures and algorithms
This part I 100% agree with. I adapt a lot of scientific code as my day-to-day and most of the issues in them tend to be making things 100x slower than they need to be and then even implementing insane approximations to "fix" the speed issue instead of actually fixing it
>"Big ball of mud" design
Funny enough this was explicitly how my PI at my current job wants to implement software. In his opinion the biggest roadblock in scientific software is actually convincing scientists to use the software. And what scientists want is a big ball of mud which they can iterate on easily and basically requires no installation. In his opinion a giant Python file with a requirement.txt file and a Python version is all you need. I find the attitude interesting. For the record he is a software engineer turned scientist, not the other way around, but our mutual hatred for Conda makes me wonder if he is onto something ...
>I think the appearance of "I'm just getting shit done" is often a superficial one, because it doesn't factor in the real costs: other scientists and engineers can't use their solutions because they're not designed in a way that makes them work in any other setting than the narrow one they were solving for.
For the record my experience is the exact opposite. The crazy trash software probably written in Python that is produced by scientists are often the ones more easily iterated on and used by other scientists. The software scientists and researchers can't use are the over-engineered stuff written in a language they don't know (e.g. Scala or Rust) that requires them to install a hundred things before they are able to use it.
> The mindset … might be fine in a research setting
A vast amount of software is written for research papers that would be useful to people other than the paper’s authors. A lot of software that is in common use by commercial teams started off in academia.
One of the major issues I see is the lack of maintenance of this software, especially given all the problems written in your post and the one above. If the software is a big ball of mud, good luck to anyone trying to come in and make a modification for their similar research paper, or commercial application.
I don’t know the answer to this, but I think additional funding to biology labs to have something like a software developer who is devoted to making sure their lab’s software follows reasonably close to software development best practices would be a great start. If it’s a full time position where they’d likely stick around for many years, some of the maintenance issues would resolve themselves, too. This software-minded person at a lab would still be there even after the biology researchers have moved on elsewhere, and this software developer could answer questions from other people interested about code written years ago.
> * Not understanding dependencies, public/private, SCM or versioning, making their own code uninstallable after a few months
This is definitely true, but I've searched *far and wide* , and unfortunately it's not a simple task to get this right.
Ultimately, if there were a simple way to get data in the correct state in an os-independent, machine independent (from raspberry pi to HPC the code should always work), concise, and idempotent way - people would use it.
There isn't.
But the certainly could be.
The solution we desperately need is a basically a pull request to a simple build tool (make, Snakemake, just, task, etc) that makes this idempotent and os-independent setup simple.
Snakemake works on windows and Unix, so that's a decent start.
One big point is matching data outputs to source code and input state. *Allowing ipfs or torrent backends to Snakemake can solve this problem.*
The idea would be to simply wrap `input/output: "/my/file/here"` in `ipfs()`, wherein this would silently check if the file is locally cached to return, but if not go to IPFS as a secondary location to check for the file, then if the file isn't at either place, calculate it with the run command specified in Snakemake.
It's useful to have this type of decentralized cache, because it's extremely common to run commands that may take several months on a supercomputer that give files that may only be a few MBs (exchange correlation functional) or only a few GBs (NN weights) so downloading the file is *immensely* cheaper to do than re-running the code - and the output is specified by the input source code (hence git commit hash maps to data hash).
The reason IPFS or torrent is the answer here is for several reasons:
1) The data location is specied by the hash of the content - which can be used to make a hash map of git commit hashes of source code state that map to data outputs (the code uniquely specifies the data in almost all cases, and input data can be included for the very rare cases it doesn't)
2) The availability and speed of download scales with popularity. Right now, were at the mercy of centralized storage systems, wherein the download rate can be however low they want it to be. However, LLM NN weights on IPFS can be downloaded very fast when millions of people *and* many centralized storage providers have the file hosted.
3) The data is far more robust to disappearing. Almost all scientific data output links point to nothing (MAG, sra/geomdb - the examples are endless). This is for many reasons such as academics moving and the storage location no longer being funded, accounts being moved, or simply the don't have enough storage space for emails on their personal Google drive and they delete the database files from their research. However, these are often downloaded many times by others in the field - the data exists somewhere - so it just needs to be accessible by decentralizing the data and allowing the community to download the file from the entire community which has it.
One of the important aspects to include in this buildtool would be to ensure that, every time someone downloads a certain file (specified by the git commit hash - data hash map) or uploads a file after computing it, they host the file as well. This way the community grows automatically by having a very low resource and extremely secure IPFS daemon host all of the important data files for different projects.
Having this all achieved by the addition of just 6 characters in a Snakemake file might actually solve this problem for the scientific / data science community, as it would be the standard and hard to mess up.
The next issue to solve would be popularize a standard way to get a package to work on all available cores/gpu/resources, etc on a raspberry pi to HPC without any changes or special considerations. Pyspark almost does this, but there's still more config than desirable for the community, and the requirement of installing OS-level dependencies (Java stuff) to work on python can often halt it's use completely (if the package using pyspark is a dependency of a dependency of a dependency, wet lab biologists [the real target users] *will not* figure out how to fix that problem if it doesn't "just work"[TM])
Of course, it's absolutely DVC. The problem is that I've never seen a DVC solution that solves the problem by making the hosting decentralized. So all of the huge problems I listed still exist even with these DVC packages. What's more is, even in addition to the cost of the hosting, some of the DVC packages cost money on top of that.
So, when a researcher deletes a file to make room for others on their storage provider and/or moves institutions and their account gets deleted, the data is gone. The only way around this is to use torrent or ipfs.
Also, I'm not sure what your issue with ipfs is; If it's 'I saw something something crypto one time' - it's a really poor argument. IPFS works completely independently of any crypto - it has nothing really to do with it.
The solution can also be torrent - I don't care too much - it's just possible that IPFS can run with far less resource usage on lower power, etc because it's more modern (likely uses better algorithms in the protocol, deals with modern filesystems better, with better performance, hopefully have better security, etc) and it's likely easier to implement. But it doesn't matter if it's torrent because it would work essentially the same way.
If you want to host the data on Dropbox, Dropbox becomes a part of the network and is hosted by the community and Dropbox or whatever service you like. The problem is that it is very prohibitive to use services like AWS. For instance, downloading all of Arxiv.org from AWS is around $600 last time I checked.
Not every student trying to run an experiment is going to have $600 laying around to run the experiment.
But with torrent/ipfs it would be free to download.
It only adds. I don't understand how it subtracts.
I've found the problems that biologists cause are mostly:
* Not understanding dependencies, public/private, SCM or versioning, making their own code uninstallable after a few months
* Writing completely unreadable code, even to themselves, making it impossible to maintain. This means they always restart from zero, and projects grow into folders of a hundred individual scripts with no order, depending on files that no longer exists
* Foregoing any kind of testing or quality control, making real and nasty bugs rampant.
IMO the main issue with the software people in our field (of which I am one, even though I'm formally trained in biology) is that they are less interested in biology than in programming, so they are bad at choosing which scientific problems to solve. They are also less productive when coding than the scientists because they care too much about the quality of their work and not enough about getting shit done.