Solving Chemical Equations Part 4a: Chemical Formulas and Parsing

Expressing chemical formulas (something we do on paper quite easily) in the computer, while not as easy, is still something quite doable. Python is here to the rescue.


Lets first take a look at a chemical formula Iron(II) Ferricyanide (do you know how to pronounce the second part … I don’t).

\[
Fe_3[Fe(CN)_6]_2
\]

This chemical formula has way more information than we need for this project. A more simple form is as follows. BTW, I am not a chemist and am very sorry if this upcoming simplification offends.

\[
Fe_5C_{12}N_{12}
\]

All we really care about is how many of each element the formula has.

So let’s break this down into things that we will need.


Elements

We will not be using any pre-knowledge of elements that will be used. We will just assume that element symbols start with a capital letter and may or may not be followed by a lowercase letter. This is mostly for convenience but would allow a user to input equations that had non-existent elements.

Ok, I am skipping how we are storing the elements, but I kind of need to move on to do so.


Formula

Ok so as above, when storing our formula, we don’t care about brackets or parenthesis. We will only care about what elements are there and how many of each we have. We won’t even care about the order that they appear. Great, we will just use a Python dictionary, plain and simple. I will note here that Python dictionaries are a bit like a set in that there is not really an order to the objects in it and this will be perfectly fine for our purposes.


Parsing … yeah about that ….

Oh dear, first of all, there will be no magic here. I am just going to “Make it Work”, nothing more nothing less.

With that being said let’s break down what we will have to do.

  1. Identify element symbols
  2. Identify numerical subcripts
  3. Handle perenthesis and brackets
  4. Handle Dots (crystiline structures, hydrates, the Chemist has no idea)

Let’s look at an example that has all of these things, potassium ferrioxalate trihydrate. We will look at the nicely formatted chemical formula and also the plain text.

\[
K_3[Fe(C_2O_4)_3] \cdot 3H_2O
\]

K3[Fe(C2O4)3].3H2O

Element Symbol & Numberical Subscript: three functions

We will have two functions for this. Again we will assume that element symbol starts with a capital letter and may or may not have a lowercase letter following. If there is a number directly after the element symbol we will take it as the subscript.

When adding to our data structure we will test if the element is already accounted for and modify our data accordingly.

Driver: a Magical function

At this point, I will note that we will be using a driver function to orchestrate all this parsing. It will be Magical because it will use recursion. All that means is that the function can call itself. Don’t worry if you have never heard of recursion before. This function will take two inputs, the text of the formula as well as a coefficient. The coefficient will come into play when we handle parenthesis, brackets, and dots.

Dots: two functions

Ok, again I want to mention that I am not a chemist. Dots are most commonly used to represent when something is associated with a crystalline structure of the base molecule. However, it can also be used when the chemist does not know what is exactly going on. Either way, after the dot, there will be a coefficient representing how many of them there are followed by formula.

We will have one function to tell if there is a Dot in our formula, and another function to break this part off and call our driver to read this broken off formula. We will use the same function to read subscripts mentioned above. We will then pass the driver the broken off text as well as the coefficient.

I will illustrate what the breaking off will look like.

K3[Fe(C2O4)3].3H2O

Will become

K3[Fe(C2O4)3]

and

3 H2O

recall that the 3 is the coefficient and will be read into our data structure as

H6O3

Parenthesis and Brackets: two function

As far as I know, there is no meaningful difference between parenthesis and Brackets. I believe parenthesis are preferred and brackets are only used if a formula already used parenthesis. That being said, we will replace any brackets with parenthesis to make things easier.

We will use two functions again, just like for the Dot. One to test if they are there, and another to break it off.

This will be very similar to what we did with the dot. The only difference is that we are looking for a subscript rather than a coefficient, but we will still pass it to the driver again as a coefficient. Outermost parenthesis will always be considered first.

I will do another illustration, adding back the dot.

K3[Fe(C2O4)3].3H2O

change brackets to parenthesis

K3(Fe(C2O4)3).3H2O

breaking off we will have

K3.3H2O

and

Fe(C2O4)3

then

Fe(C2O4)3

will be broken off ass

Fe

and

3 C2O4

where the 3 will be passed as the coefficient

Final Thoughts

I hope that was all clear. If not, I hope it will become clearer once we start working on the actual code.