How Computers “See” Molecules | Towards Data Science

a computer, Edvard Munch’s The Scream is nothing more than a grid of pixel values. It has no sense of why swirling lines in a twilight sky convey the agony of a scream. That’s because (modern digital) computers fundamentally process only binary signals [1,2]; they don’t inherently comprehend the objects and emotions we perceive.

To mimic human intelligence, we first need an intermediate form (representation) to “translate” our sensory world into something a computer can handle. For The Scream, that might mean extracting edges, colors, shapes, etc. Likewise, in Natural Language Processing (NLP), a computer sees human language as an unstructured stream of symbols that must be turned into numeric vectors or other structured forms. Only then can it begin to map raw input to higher-level concepts (i.e., building a model).

Human intelligence also depends on internal representations.

In psychology, a representation refers to an internal mental symbol or image that stands for something in the outside world [3]. In other words, a representation is how information is encoded in the brain: the symbols we use (words, images, memories, artistic depictions, etc.) to stand for objects and ideas.

Our senses don’t simply put the external world directly into our brains; instead, they convert sensory input into abstract neural signals. For example, the eyes convert light into electrical signals on the retina, and the ears turn air vibrations into nerve impulses. These neural signals are the brain’s representation of the external world, which is used to reconstruct our perception of reality, essentially building a “model” in our mind.

Between ages one and two, children enter Piaget’s early preoperational stage [4]. This is when kids start using one thing to represent another: a toddler might hold a banana up to their ear and babble as if it’s a phone, or push a box around pretending it’s a car. This kind of symbolic play is important for cognitive development, because it shows the child can move beyond the here-and-now and project the concepts in their mind onto reality [5].

Without our senses translating physical signals into internal codes, we couldn’t perceive anything [5].

“Garbage in, garbage out”. The quality of a representation sets an upper bound on the performance of any model built on it [6,7].

Much of the progress in human intelligence has come from improving how we represent knowledge [8].

One of the core goals of education is to help students form effective mental representations of new knowledge. Seasoned educators use diagrams, animations, analogies and other tools to present abstract concepts in a vivid, relatable way. Richard Mayer argues that meaningful learning happens when learners form a coherent mental representation or model of the material, rather than just memorizing disconnected facts [8]. In meaningful learning, new information integrates into existing knowledge, allowing students to transfer and apply it in novel situations.

However, in practice, factors like limited model capacity and finite computing resources constrain how complex our representations can be. Compressing input data inevitably risks information loss, noise, and artifacts. So, as the first step, developing a “good enough” representation requires balancing several key properties:

It should retain the information critical to the task. (A clear problem definition helps filter out the rest.)
It should be as compact as possible: minimizing redundancy and keeping dimensionality low.
It should separate classes in feature space. Samples from the same class cluster together, while those from different classes stay far apart.
It should be robust to input noise, compression artifacts, and shifts in data modality.
Invariance. Representations should be invariant to task‑irrelevant changes (e.g. rotating or translating an image, or changing its brightness).
Generalizability.
Interpretability.
Transferability.

These limitations on representation complexity are somewhat analogous to the limited capacity of our own working memory.

Human short-term memory, on average, can only hold about 7±2 items at once [9]. When too many independent pieces of information arrive simultaneously (beyond what our cognitive load can handle), our brains bog down. Cognitive psychology research shows that with the right guidance (by adjusting how information is represented), people can reorganize information to overcome this apparent limit [10,11]. For example, we can remember a long string of digits more easily by chunking them into meaningful groups (which is why phone numbers are often split into shorter blocks).

Now, shifting from The Scream to the microscopic world of molecules, we face the same challenge: how can we translate real-world molecules into a form that a computer can understand? With the right representation, a computer can infer chemical properties or biological functions, and ultimately map those to higher‑level concepts (e.g., a drug’s activity or a molecule’s protein binding). In this article, we’ll explore the common methods that let computers “see” molecules.

Chemical Formula

Perhaps the most straightforward depiction of a molecule is its chemical formula, like C₈H₁₀N₄O₂ (caffeine), which tells us there are 8 carbon atoms, 10 hydrogen atoms, 4 nitrogen atoms and 2 oxygen atoms. However, its very simplicity is also its limitation: a formula conveys nothing about how those atoms are connected (the bonding topology), how they are arranged in space, or where functional groups are located. That’s why isomers (like ethanol and dimethyl ether) both share C₂H₆O yet differ completely in structure and properties.

Chemical formula and 2D structures of ethanol and dimethyl ether. Image by author.

Linear String

Another common way to represent molecules is to encode them as a linear string of characters, a format widely adopted in databases [12,13].

SMILES

The most classic example is SMILES (Simplified Molecular Input Line Entry System) [14], developed by David Weininger in the 1980s. SMILES treats atoms as nodes and bonds as edges, then “flattens” them into a 1D string via a depth‑first traversal, preserving all the connectivity and ring information. Single, double, triple, and aromatic bonds are denoted by the symbols “-”, “=”, “#”, and “:”, respectively. Numbers are used to mark the start and end of rings, and branches off the main chain are enclosed in parentheses. (See more in SMILES – Wikipedia.)

SMILES is simple, intuitive, and compact for storage. Its extended syntax supports stereochemistry and isotopes. There is also a rich ecosystem of tools supporting it: most chemistry libraries let us convert between SMILES and other standard formats.

However, without an agreed-upon canonicalization algorithm, the same molecule can be written in multiple valid SMILES forms. This can potentially lead to inconsistencies or “data pollution”, especially when merging data from multiple sources.

InChI

Another widely used string format is InChI (International Chemical Identifier) [15], introduced by IUPAC in 2005, to generate globally standardized, machine-readable, and unique molecule identifiers. InChI strings, though longer than SMILES, encode more details in layers (including atoms and their bond connectivity, tautomeric state, isotopes, stereochemistry, and charge), each with strict rules and priority. (See more in InChI – Wikipedia.)

Because an InChI string can become very lengthy as a molecule grows more complex, it is often paired with a 27‑character InChIKey hash [15]. The InChIKeys aren’t human‑friendly, but they’re ideal for database indexing and for exchanging molecule identifiers across systems.

How Computers “See” Molecules: Figure 2 — Linear representations of caffeine. Image by author.

Molecular Descriptor

Many computational models require numeric inputs. Compared to linear string representations, molecular descriptors turn a molecule’s properties and patterns into a vector of numerical features, delivering satisfactory performance in many tasks [7, 16-18].

Todeschini and Consonni describe the molecular descriptor as the “final result of a logical and mathematical procedure, which transforms chemical information encoded within a symbolic representation of a molecule into a useful number or the result of some standardized experiment” [16].

We can think of a set of molecular descriptors as a standardized “physical exam sheet” for a molecule, asking questions like:

Does it have a benzene ring?
How many carbon atoms does it have?
What’s the predicted octanol-water partition coefficient (LogP)?
Which functional groups are present?
What is its 3D conformation or electron distribution like?
…

Their answers can take various forms, such as numerical values, categorical flags, vectors, graph-based structures, tensors etc. Because every molecule in our dataset is described using the same set of questions (the same “physical exam sheet”), comparisons and model inputs become straightforward. And because each feature has a clear meaning, descriptors improve the interpretability of the model.

Of course, just as a physical exam sheet can’t capture absolutely everything about a person’s health, a finite set of molecular descriptors can never capture all aspects of a molecule’s chemical and physical nature. Computing descriptors is typically a non-invertible process, inevitably leading to a loss of information, and the results are not guaranteed to be unique. Therefore, there are different types of molecular descriptors, each focusing on different aspects.

Thousands of molecular descriptors have been developed over the years (for example, RDKit [19], CDK [20], Mordred [17], etc.). They can be broadly categorized by the dimensionality of information they encode (these categories aren’t strict divisions):

0D: formula‑based properties independent of structure (e.g., atom counts or molecular weight).
1D: sequence-based properties (e.g., counts of certain functional groups).
2D: derived from the 2D topology (e.g., eccentric connectivity index [21]).
3D: derived from 3D conformation, capturing geometric or spatial properties (e.g., charged partial surface area [22]).
4D and higher: these incorporate additional dimensions such as time, ensemble, or environmental factors (e.g., descriptors derived from molecular dynamics simulations, or from quantum chemical calculations like HOMO/LUMO).
Descriptors obtained from other sources including experimental measurements.

Molecular fingerprints are a special kind of molecular descriptor that encode substructures into a fixed-length numerical vector [16]. This table summarizes some commonly used molecular fingerprints [23], such as MACCS [24], which is shown in the figure below.

Similarly, human fingerprints or product barcodes can also be seen as (or converted to) fixed-format numerical representations.

Different descriptors describe molecules from various aspects, so their contributions to different tasks naturally vary. In a task of predicting the aqueous solubility of drug-like molecules, over 4,000 computed descriptors were evaluated, but only about 800 made significant contributions to the prediction [7].

How Computers “See” Molecules: Figure 3 — Some molecular descriptors of caffeine from PubChem, DrugBank and RDKit. Image by author.

Point Cloud

Sometimes, we need our models to learn directly from a molecule’s 3D structure. For example, this is important when we’re interested in how two molecules might interact with each other [25], need to search the possible conformations of a molecule [26], or want to simulate its behavior in a certain environment [27].

One straightforward way to represent a 3D structure is as a point cloud of its atoms [28]. In other words, a point cloud is a collection of coordinates of the atoms in 3D space. However, while this representation shows which atoms are near each other, it doesn’t explicitly tell us which pairs of atoms are bonded. Inferring connectivity from interatomic distances (e.g., via cutoffs) can be error-prone, and may miss higher‑order chemistry like aromaticity or conjugation. Moreover, our model must account for changes of raw coordinates due to rotation or translation. (More on this later.)

Graph

A molecule can also be represented as a graph, where atoms (nodes) are connected by bonds (edges). Graph representations elegantly handle rings, branches, and complex bonding arrangements. For example, in a SMILES string, a benzene ring must be “opened” and denoted by special symbols, whereas in a graph, it’s simply a cycle of nodes connected in a loop.

Molecules are commonly modeled as undirected graphs (since bonds have no inherent direction) [29-31]. We can further “decorate” the graph with additional domain-specific knowledge to make the representation more interpretable: tagging nodes with atom features (e.g., element type, charge, aromaticity) and edges with bond properties (e.g., order, length, strength). Therefore,

(uniqueness) each distinct molecular structure could correspond to a unique graph, and
(reversibility) we could reconstruct the original molecule from its graph representation.

How Computers “See” Molecules: Figure 4 — Ball-and-stick and two representations of caffeine’s 3D conformation. (Gray: carbon; blue: nitrogen; plum: hydrogen; red: oxygen). Image by author.

Chemical reactions essentially involve breaking bonds and forming new ones. Using graphs makes it easier to track these changes. Some reaction‑prediction models encode reactants and products as graphs and infer the transformation by comparing them [32,33].

Graph Neural Networks (GNNs) can directly process graphs and learn from them. Using molecular graph representation, these models can naturally handle molecules of arbitrary size and topology. In fact, many GNNs have outperformed models that only relied on descriptors or linear strings on many molecular tasks [7,30,34].

Often, when a GNN makes a prediction, we can inspect which parts of the graph were most influential. These “important bits” frequently correspond to actual chemical substructures or functional groups. In contrast, if we were looking at a particular substring of a SMILES, it’s not guaranteed to map neatly to a meaningful substructure.

A graph doesn’t always mean just the direct bonds connecting atoms. We can construct different kinds of graphs from molecular data depending on our needs, and sometimes these alternate graphs yield better results for particular applications. For example:

Complete graph: Every pair of nodes is connected by an edge. It could introduce redundant connections, but might be used to let a model consider all pairwise interactions.
Bipartite graph: Nodes are divided into two sets, and edges only connect nodes from one set to nodes from the other.
Nearest-neighbor graph: Each node is connected only to its nearest neighbors (according to some criterion), for controlling complexity.

Extensible Graph Representations

We can incorporate chemical rules or impose constraints within molecular graphs. In de novo molecular design, (early) SMILES‑based generative models often produced SMILES strings ended up proposing invalid molecules, because: (1) assembling characters may break SMILES syntax, and (2) even a syntactically correct SMILES might encode an impossible structure. Graph‑based generative models avoid them by building molecules atom by atom and bond by bond (under user-specified chemical rules). Graphs also let us impose constraints: require or forbid specific substructures, enforce 3D shapes or chirality, and so on; thus, to guide generation toward valid candidates that meet our goals [35,36].

Molecular graphs can also handle multiple molecules and their interactions (e.g., drug-protein binding, protein-protein interfaces). “Graph-of-graphs” treat each molecule as its own graph, then deploy a higher-level model to learn how they interact [37]. Or, we may merge the molecules into one composite graph, including all atoms from both partners and add special (dummy) edges or nodes to mark their contacts [38].

So far, we’ve been considering the standard graph of bonds (the 2D connectivity), but what if the 3D arrangement matters? Graph representations can certainly be augmented with 3D information: 3D coordinates could be attached to each node, or distances/angles could be added as attributes on the edges, to make models more sensitive to difference in 3D configurations. A better option is to use models like SE(3)-equivariant GNNs, which ensure their outputs (or key internal features) transform (or stay invariant) with any rotation or translation of the input.

In 3D space, the special Euclidean group SE(3) describes all possible rigid motions (any combination of rotations and translations). (It’s sometimes described as a semidirect product of the rotation group SO(3) with the translation group R³.) [28]

When we say a model or a function has SE(3) invariance, we mean that it gives the same result no matter how we rotate or translate the input in 3D. This kind of invariance is often an essential requirement for many molecular modeling tasks: a molecule floating in solution has no fixed reference frame (i.e., it can tumble around in space). So, if we predict some property of the molecule (say its binding affinity), that prediction should not be influenced by the molecule’s orientation or position.

Sequence Representations of Biomacromolecules

We’ve talked mostly about small molecules. But biological macromolecules (like proteins, DNA, and RNA) can contain thousands or even millions of atoms. SMILES or InChI strings become extremely long and complex, leading to the associated massive computational, storage, and analysis costs.

This brings us back to the importance of defining the problem: for biomacromolecules, we’re often not interested in the precise position of every single atom or the exact bonds between each pair of atoms. Instead, we care about higher-level structural patterns and functional modules: like a protein’s amino acid backbone and its alpha‑helices or beta‑sheets, which fold into tertiary and quaternary structures. For DNA and RNA, we may care about nucleotide sequences and motifs.

We describe these biological polymers as sequences of their building blocks (i.e., primary structure): proteins as chains of amino acids, and DNA/RNA as strings of nucleotides. There are well-established codes for these building blocks (defined by IUPAC/IUBMB): for instance, in DNA, the letters A, C, G, T represent the bases adenine, cytosine, guanine, and thymine respectively.

Static Embeddings and Pretrained Embeddings

To convert a sequence into numerical vectors, we can use static embeddings: assigning a fixed vector to each residue (or k-mer fragment). The simplest static embedding is one-hot encoding (e.g., encode adenine A as [1,0,0,0]), turning a sequence into a matrix. Another approach is to learn dense (pretrained) embeddings by leveraging large databases of sequences. For example, ProtVec [39] breaks proteins into overlapping 3‑mers and trains a Word2Vec‑like model (commonly used in NLP) on a large corpus of sequences, assigning each 3-mer a 100D vector. These learned fragment embeddings are shown to capture biochemical and biophysical patterns: fragments with similar functions or properties cluster closer in the embedding space.

k-mer fragments (or k-mers) are substrings of length k extracted from a biological sequence.

Tokens

Inspired by NLP, we can treat a sequence as if it’s a sentence composed of tokens or words (i.e., residues or k-mer fragments), and then feed them into deep language models. Trained on massive collections of sequences, these models learn biology’s “grammar” and “semantics” just as they do in human language.

Transformers can use self‑attention to capture long‑range dependencies in sequences; and we essentially use them to learn a “language of biology”. (Some) Meta’s ESM series of models [40-42] trained Transformers on hundreds of millions of protein sequences. Similarly, DNABERT [43] tokenizes DNA into k‑mers for BERT training on genomic data. These kinds of obtained embeddings have been shown to encapsulate a wealth of biological information. In many cases, these embeddings can be used directly for various tasks (i.e., transfer learning).

Descriptors

In practice, sequence-based models often combine their embeddings with physicochemical properties, statistical features, and other descriptors, such as the percentage of each amino acid in a protein, the GC content of a DNA sequence, or indices like hydrophobicity, polarity, charge, and molecular volume.

Beyond the main categories above, there are some other unconventional ways to represent sequences. Chaos Game Representation (CGR) [44] maps DNA sequences to points in a 2D plane, creating distinctive image patterns for downstream analysis.

Structural Representations of Biomacromolecules

The complex structure (of a protein) determines its functions and specificities [28]. Simply knowing the linear sequence of residues is often not enough to fully understand a biomolecule’s function or mechanism (i.e., sequence-structure gap).

Structures tend to be more conserved than sequences [28, 45]. Two proteins might have very divergent sequences but still fold into highly similar 3D structures [46]. Solving the structure of a biomolecule can give insights that we wouldn’t get just from the sequence alone.

Granularity and Dimensionality Control

A single biomolecule may contain on the order of 10³-10⁵ atoms (or even more). Encoding every atom and bond explicitly into numerical form produces prohibitively high-dimensional, sparse representations.

Adding dimensions to the representation can quickly run into the curse of dimensionality. As we increase the dimensionality of our data, the “space” we’re asking our model to cover grows exponentially. Data points become sparser relative to that space (it’s like having a few needles in an ever-expanding haystack). This sparsity means a model might need vastly more training examples to find reliable patterns. Meanwhile, the computational cost of processing the data often grows polynomially or worse with dimensionality.

Not every atom is equally important for the question we care about: we often turn to adjust the granularity of our representation or reduce dimensionality in smart ways (such data often has a lower-dimensional effective representation that can describe the system without (significant) performance loss [47]):

For proteins, each amino acid can be represented by the coordinates of just its alpha carbon (C_α). For nucleic acids, one might take each nucleotide and represent it by the position of its phosphate group or by the center of its base or sugar ring.
Another example of controlled granularity comes from how AlphaFold [49] represents protein using backbone rigid groups (or frames). Essentially, for each amino acid, a small set of main-chain atoms, typically the N, C_α, C (and maybe O) are treated as a unit. The relative geometry of these atoms is almost fixed (covalent bond lengths and angles don’t vary significantly), so that unit can be considered as a rigid block. Instead of tracking each atom separately, the model tracks the position and orientation of that entire block in space, reducing the risks associated with excessive degrees of freedom [28] (i.e., errors from the internal movement of atoms within a residue).

How Computers “See” Molecules: Figure 5 — Heavy atoms in protein backbone with dihedral angles. Image derived from [28].

If we have a large set of protein structures (or a long molecular dynamics trajectory), it can be useful to cluster those conformations into a few representative states. This is often done when building Markov state models: by clustering continuous states into a finite set of discrete “metastable” states, we can simplify a complex energy landscape into a network of a few states connected by transition probabilities.

Many coarse-grained molecular dynamics force fields, such as MARTINI [50] and UNRES [51], have been developed to represent structural details using fewer particles.

To capture for side-chain effects without modelling all internal atoms or adding excessive degrees of freedom, a common approach is to represent each side-chain with a single point, typically its center of mass [52]. Such side-chain centroid models are often used in conjunction with backbone models.
The 3Di Alphabet introduced by Foldseek [53] defines a 3D interaction “alphabet” of 20 states that describe protein tertiary interactions. Thus, a protein’s 3D structure can be converted into a sequence of 20 symbols; and two structures can be aligned by aligning their 3Di sequences.
We may spatially crop or focus on just part of a biomolecule. For instance, if we’re studying how a small drug molecule binds to a protein (say, in a dataset like PDBBind [54], which is full of protein-ligand complexes), we may only feed the pockets and drugs into our model.
Combining different granularities or modalities of data.

Point Cloud

We could model a biomacromolecule as a massive 3D point cloud of every atom (or residue). As noted earlier, the same limitations apply.

Distance Matrix

A distance matrix records all pairwise distances between certain key atoms (for proteins, commonly the C_αof each amino acid), and is inherently invariant to rotation and translation due to its symmetric nature. A contact map simplifies this further by indicating only which pairs of residues are “close enough” to be in contact. However, both representations lose directional information; so not all structural details can be recovered from them alone.

Graph

Similarly, just like we can use graphs for small molecules, we can use graphs for macromolecular structures [55,56]. Instead of atoms, each node might represent a larger unit (see Granularity and Dimensionality Control). To improve interpretability, additional knowledge like residue descriptors and known interaction networks within a protein, may also be incorporated in nodes and edges. Note that the graph representation for biomacromolecules inherits many of the advantages we discussed for small molecules.

For macromolecules, edges are often pruned to keep the graph sparse and manageable in size: essentially a form of local magnification that focuses on local substructures, while far-apart relationships are treated as background context.

General dimensionality reduction methods such as PCA, t-SNE and UMAP are also widely used to analyze the high-dimensional structural data of macromolecules. While they don’t give us representations for computation in the same sense as the others we’ve discussed, they help project complex data into lower dimensions (e.g., for visualization or insights).

Latent Space

When we train a model (especially generative models), it often learns to encode data into a compressed internal representation. This internal representation lives in some space of lower dimension, known as the latent space. Think of London’s complex urban layout, dense and intricate, while the latent space is like a “map” that captures its essence in a simplified form.

Latent spaces are usually not directly interpretable, but we can explore them by seeing how changes in latent variables map to changes in the output. In molecular generation, if a model maps molecules into a latent space, we can take two molecules (say, as two points in that space) and generate a path between them. Ochiai et. al. [57] did this by taking two known molecules as endpoints, interpolating between their latent representations, and decoding the intermediate points. The result was a set of new molecules that blended features of both originals: hybrids that might have mixed properties of the two.

—— About Author ——

Tianyuan Zheng
[email protected] | [email protected]
Computational Biology, Bioinformatics, Artificial Intelligence

Department of Computer Science and Technology
Department of Applied Mathematics and Theoretical Physics
University of Cambridge

Reference

Patterson DA, Hennessy JL. Computer organization and design ARM edition: the hardware software interface. Morgan kaufmann; 2016 May 6.
Harris S, Harris D. Digital Design and Computer Architecture, RISC-V Edition. Morgan Kaufmann; 2021 Jul 12.
Kosslyn SM, Koenig O. Wet mind: The new cognitive neuroscience. Simon and Schuster; 1992.
Piaget J, Cook M. The origins of intelligence in children. New York: International universities press; 1952.
Bergen D. The role of pretend play in children’s cognitive development. Early Childhood Research & Practice. 2002;4(1):n1.
Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence. 2013 Mar 7;35(8):1798-828.
Zheng T, Mitchell JB, Dobson S. Revisiting the application of machine learning approaches in predicting aqueous solubility. ACS omega. 2024 Jul 31;9(32):35209-22.
Mayer RE. Multimedia learning. In Psychology of learning and motivation 2002 Jan 1 (Vol. 41, pp. 85-139). Academic Press.
Miller GA. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological review. 1956 Mar;63(2):81.
Chase WG, Simon HA. Perception in chess. Cognitive psychology. 1973 Jan 1;4(1):55-81.
Simon HA. How Big Is a Chunk? By combining data from several experiments, a basic human memory unit can be identified and measured. Science. 1974 Feb 8;183(4124):482-8.
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B, Zaslavsky L. PubChem 2025 update. Nucleic acids research. 2025 Jan 6;53(D1):D1516-25.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The protein data bank. Nucleic acids research. 2000 Jan 1;28(1):235-42.
Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences. 1988 Feb 1;28(1):31-6.
Heller S, McNaught A, Stein S, Tchekhovskoi D, Pletnev I. InChI-the worldwide chemical structure identifier standard. Journal of cheminformatics. 2013 Jan 24;5(1):7.
Todeschini R, Consonni V. Molecular descriptors for chemoinformatics: volume I: alphabetical listing/volume II: appendices, references. John Wiley & Sons; 2009 Oct 30.
Moriwaki H, Tian YS, Kawashita N, Takagi T. Mordred: a molecular descriptor calculator. Journal of cheminformatics. 2018 Feb 6;10(1):4.
Jaganathan K, Tayara H, Chong KT. An explainable supervised machine learning model for predicting respiratory toxicity of chemicals using optimal molecular descriptors. Pharmaceutics. 2022 Apr 11;14(4):832.
RDKit: Open-source cheminformatics. https://www.rdkit.org
Willighagen EL, Mayfield JW, Alvarsson J, Berg A, Carlsson L, Jeliazkova N, Kuhn S, Pluskal T, Rojas-Chertó M, Spjuth O, Torrance G. The Chemistry Development Kit (CDK) v2. 0: atom typing, depiction, molecular formulas, and substructure searching. Journal of cheminformatics. 2017 Jun 6;9(1):33.
Sharma V, Goswami R, Madan AK. Eccentric connectivity index: A novel highly discriminating topological descriptor for structure− property and structure− activity studies. Journal of chemical information and computer sciences. 1997 Mar 24;37(2):273-82.
Stanton DT, Jurs PC. Development and use of charged partial surface area structural descriptors in computer-assisted quantitative structure-property relationship studies. Analytical Chemistry. 1990 Nov 1;62(21):2323-9.
Boldini D, Ballabio D, Consonni V, Todeschini R, Grisoni F, Sieber SA. Effectiveness of molecular fingerprints for exploring the chemical space of natural products. Journal of Cheminformatics. 2024 Mar 25;16(1):35.
Durant JL, Leland BA, Henry DR, Nourse JG. Reoptimization of MDL keys for use in drug discovery. Journal of chemical information and computer sciences. 2002 Nov 25;42(6):1273-80.
Kitchen DB, Decornez H, Furr JR, Bajorath J. Docking and scoring in virtual screening for drug discovery: methods and applications. Nature reviews Drug discovery. 2004 Nov 1;3(11):935-49.
Friedrich NO, Meyder A, de Bruyn Kops C, Sommer K, Flachsenberg F, Rarey M, Kirchmair J. High-quality dataset of protein-bound ligand conformations and its application to benchmarking conformer ensemble generators. Journal of chemical information and modeling. 2017 Mar 27;57(3):529-39.
Karplus M, McCammon JA. Molecular dynamics simulations of biomolecules. Nature structural biology. 2002 Sep 1;9(9):646-52.
Zheng T, Rondina A, Micklem G, Lio P. Challenges and Guidelines in Deep Generative Protein Design: Four Case Studies.
Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A, Adams RP. Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems. 2015;28.
Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. Neural message passing for quantum chemistry. InInternational conference on machine learning 2017 Jul 17 (pp. 1263-1272). Pmlr.
Wu Z, Pan S, Chen F, Long G, Zhang C, Yu PS. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems. 2020 Mar 24;32(1):4-24.
Jin W, Coley C, Barzilay R, Jaakkola T. Predicting organic reaction outcomes with weisfeiler-lehman network. Advances in neural information processing systems. 2017;30.
Shi C, Xu M, Guo H, Zhang M, Tang J. A graph to graphs framework for retrosynthesis prediction. InInternational conference on machine learning 2020 Nov 21 (pp. 8818-8827). PMLR.
Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K, Pande V. MoleculeNet: a benchmark for molecular machine learning. Chemical science. 2018;9(2):513-30.
Lim J, Ryu S, Kim JW, Kim WY. Molecular generative model based on conditional variational autoencoder for de novo molecular design. Journal of cheminformatics. 2018 Jul 11;10(1):31.
Maziarka Ł, Pocha A, Kaczmarczyk J, Rataj K, Danel T, Warchoł M. Mol-CycleGAN: a generative model for molecular optimization. Journal of Cheminformatics. 2020 Jan 8;12(1):2.
Wang H, Lian D, Zhang Y, Qin L, Lin X. Gognn: Graph of graphs neural network for predicting structured entity interactions. arXiv preprint arXiv:2005.05537. 2020 May 12.
Jiang D, Hsieh CY, Wu Z, Kang Y, Wang J, Wang E, Liao B, Shen C, Xu L, Wu J, Cao D. InteractionGraphNet: a novel and efficient deep graph representation learning framework for accurate protein–ligand interaction predictions. Journal of medicinal chemistry. 2021 Dec 8;64(24):18209-32.
Asgari E, Mofrad MR. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS one. 2015 Nov 10;10(11):e0141287.
Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, Guo D, Ott M, Zitnick CL, Ma J, Fergus R. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences. 2021 Apr 13;118(15):e2016239118.
Rao R, Meier J, Sercu T, Ovchinnikov S, Rives A. Transformer protein language models are unsupervised structure learners. Biorxiv. 2020 Dec 15:2020-12.
Rao RM, Liu J, Verkuil R, Meier J, Canny J, Abbeel P, Sercu T, Rives A. MSA transformer. InInternational conference on machine learning 2021 Jul 1 (pp. 8844-8856). PMLR.
Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021 Aug 1;37(15):2112-20.
Jeffrey HJ. Chaos game representation of gene structure. Nucleic acids research. 1990 Apr 25;18(8):2163-70.
Illergård K, Ardell DH, Elofsson A. Structure is three to ten times more conserved than sequence—a study of structural response in protein cores. Proteins: Structure, Function, and Bioinformatics. 2009 Nov 15;77(3):499-508.
Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. The EMBO journal. 1986 Apr 1;5(4):823-6.
Roel-Touris J, Don CG, V. Honorato R, Rodrigues JP, Bonvin AM. Less is more: coarse-grained integrative modeling of large biomolecular assemblies with HADDOCK. Journal of chemical theory and computation. 2019 Sep 20;15(11):6358-67.
Duong VT, Diessner EM, Grazioli G, Martin RW, Butts CT. Neural Upscaling from Residue-Level Protein Structure Networks to Atomistic Structures. Biomolecules. 2021 Nov 30;11(12):1788.
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A. Highly accurate protein structure prediction with AlphaFold. nature. 2021 Aug 26;596(7873):583-9.
Marrink SJ, Risselada HJ, Yefimov S, Tieleman DP, De Vries AH. The MARTINI force field: coarse grained model for biomolecular simulations. The journal of physical chemistry B. 2007 Jul 12;111(27):7812-24.
Liwo A, Baranowski M, Czaplewski C, Gołaś E, He Y, Jagieła D, Krupa P, Maciejczyk M, Makowski M, Mozolewska MA, Niadzvedtski A. A unified coarse-grained model of biological macromolecules based on mean-field multipole–multipole interactions. Journal of molecular modeling. 2014 Aug;20(8):2306.
Cao F, von Bülow S, Tesei G, Lindorff‐Larsen K. A coarse‐grained model for disordered and multi‐domain proteins. Protein Science. 2024 Nov;33(11):e5172.
Van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CL, Söding J, Steinegger M. Fast and accurate protein structure search with Foldseek. Nature biotechnology. 2024 Feb;42(2):243-6.
Wang R, Fang X, Lu Y, Yang CY, Wang S. The PDBbind database: methodologies and updates. Journal of medicinal chemistry. 2005 Jun 16;48(12):4111-9.
Ingraham J, Garg V, Barzilay R, Jaakkola T. Generative models for graph-based protein design. Advances in neural information processing systems. 2019;32.
Jing B, Eismann S, Suriana P, Townshend RJ, Dror R. Learning from protein structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411. 2020 Sep 3.
Ochiai T, Inukai T, Akiyama M, Furui K, Ohue M, Matsumori N, Inuki S, Uesugi M, Sunazuka T, Kikuchi K, Kakeya H. Variational autoencoder-based chemical latent space for large molecular structures with 3D complexity. Communications Chemistry. 2023 Nov 16;6(1):249.

How Computers “See” Molecules | Towards Data Science

Chemical Formula

Linear String

SMILES

InChI

Molecular Descriptor

Point Cloud

Graph

Extensible Graph Representations

Sequence Representations of Biomacromolecules

Static Embeddings and Pretrained Embeddings

Tokens

Descriptors

Structural Representations of Biomacromolecules

Granularity and Dimensionality Control

Point Cloud

Distance Matrix

Graph

Latent Space

Reference

Related Posts

Mastering NLP with spaCy – Part 2

GFT: Wynxx Reduces Time to Launch Financial Institutions’ AI and Cloud Projects

Leave a Reply Cancel reply