In search of a new programming language
I have been programming in R for the past 10 years, mostly as a scientist in
academia. R is extremely useful to quickly validate hypotheses and generate
plots to communicate results to collaborators. As I transition into industry I
become increasingly frustrated with some of R idiosyncrasies that make it
difficult to write error-free programs. Examples of such include recycling
shorter vectors (and only warning if the size of the smaller vector is not
a multiple of the longer one), and returning results when selecting using an NA
value. The following snippet shows both:
Do you really think this is what I meant? It is very easy for a NA value to slip
into your selection operation. And it will not appear in the code. So you have
to keep your code and data in your head to fully comprehend your program. More
of these can be found in The R
Things I like about R are its functional nature such as mapping functions (in the form of lapply, sapply…) over vectors insted of writing loops and being able to work interactively in the REPL.
So what would be a good language to replace R with? I need it to for bioinformatics and data analysis. Let’s first set some criteria and then discuss some candidates matching these criteria, with no claims to exhaustiveness.
- functional, meaning I can use a syntax close to R’s vectorized operations (apply and co.)
- easy to parallelize
- interactive, has a REPL
- libraries for statistics, bioinformatics and plotting
- type system
- compiles to native code
- pleasant syntax
- good library ecosystem
Strongly typed family of functional languages. Ocaml is used in many industries and has a flexbile compiler backend. F# allows to tap into the .NET framework. Bioinformatics libraries are BioFsharp or dotnetbio and biocaml.
Functional language with focus on purity (no side effects). Compiles to native code. There is Biohaskell but it does not seem Haskell is used much for bioinformatics. Not sure it is practical for interactive use.
This is certainly an odd one. I recently looked into Scheme, reading the seminal book “Structure and Interpretation of Computer Programs”. I fell in love with its simplicity yet expressiveness and also its syntax (prefix notation and parentheses) which is unappealing to many. It allows a style of coding similar to R (which is considered a LISP language) and interactive development at the command line. Most implementations have good support for FFI allowing to interface with bioinformatics tools implemented in other languages. The Racket Scheme implementation probably has the largest libraries ecosystem including libraries for plotting and statistics.
Has a type system, there is a bioinformatics package Biojulia and the language is designed for fast numerics and compiles to native code. It is still young and moves quickly with new packages being added and also changes to the main language.
A memory-safe language. Not sure it has much use in bioinformatics. There is an actively developed biolibrary rust-bio.
To sum it up, there is no way around R and python when it comes to the library ecosystems with over 10 000 packages in R/Bioconductor or python for bioinformatics, plotting and scientific computing. I think Julia is a language to watch in this space, but it might still need some time to mature. Scheme would be a personal favourite at this point, but more from an ideological rather than a practical perspective.