Global analysis of publicly available safety data for 9,801 substances registered under REACH from 2008-2014

Thomas Luechtefeld, Alexandra Maertens, Daniel P. Russo, Costanza Rovida, Hao Zhu, Thomas Hartung

Research output: Contribution to journalArticle

Abstract

The European Chemicals Agency (ECHA) warehouses the largest public dataset of in vivo and in vitro toxicity tests. In December 2014 this data was converted into a structured, machine readable and searchable database using natural language processing. It contains data for 9,801 unique substances, 3,609 unique study descriptions and 816,048 study documents. This allows exploring toxicological data on a scale far larger than previously possible. Substance similarity analysis was used to determine clustering of substances for hazards by mapping to PubChem. Similarity was measured using PubChem 2D conformational substructure fingerprints, which were compared via the Tanimoto metric. Following K-Core filtration, the Blondel et al. (2008) module recognition algorithm was used to identify chemical modules showing clusters of substances in use within the chemical universe. The Global Harmonized System of Classification and Labelling provides a valuable information source for hazard analysis. The most prevalent hazards are H317 "May cause an allergic skin reaction" with 20% and H318 "Causes serious eye damage" with 17% positive substances. Such prevalences obtained for all hazards here are key for the design of integrated testing strategies. The data allowed estimation of animal use. The database covers about 20% of substances in the high-throughput biological assay database Tox21 (1,737 substances) and has a 917 substance overlap with the Comparative Toxicogenomics Database (~7% of CTD). The biological data available in these datasets combined with ECHA in vivo endpoints have enormous modeling potential. A case is made that REACH should systematically open regulatory data for research purposes.

Original languageEnglish (US)
Pages (from-to)95-109
Number of pages15
JournalAltex
Volume33
Issue number2
DOIs
StatePublished - 2016

Fingerprint

Databases
Safety
Toxicogenetics
Natural Language Processing
High-Throughput Screening Assays
Toxicity Tests
Dermatoglyphics
Toxicology
Cluster Analysis
Hypersensitivity
Skin
Research
Datasets
In Vitro Techniques

Keywords

  • Animal testing
  • Chemical toxicity
  • Computational toxicology
  • Database
  • In silico

ASJC Scopus subject areas

  • Medical Laboratory Technology
  • Pharmacology

Cite this

Global analysis of publicly available safety data for 9,801 substances registered under REACH from 2008-2014. / Luechtefeld, Thomas; Maertens, Alexandra; Russo, Daniel P.; Rovida, Costanza; Zhu, Hao; Hartung, Thomas.

In: Altex, Vol. 33, No. 2, 2016, p. 95-109.

Research output: Contribution to journalArticle

Luechtefeld, Thomas ; Maertens, Alexandra ; Russo, Daniel P. ; Rovida, Costanza ; Zhu, Hao ; Hartung, Thomas. / Global analysis of publicly available safety data for 9,801 substances registered under REACH from 2008-2014. In: Altex. 2016 ; Vol. 33, No. 2. pp. 95-109.
@article{0b17312347cc452f8c96467682b1ff3e,
title = "Global analysis of publicly available safety data for 9,801 substances registered under REACH from 2008-2014",
abstract = "The European Chemicals Agency (ECHA) warehouses the largest public dataset of in vivo and in vitro toxicity tests. In December 2014 this data was converted into a structured, machine readable and searchable database using natural language processing. It contains data for 9,801 unique substances, 3,609 unique study descriptions and 816,048 study documents. This allows exploring toxicological data on a scale far larger than previously possible. Substance similarity analysis was used to determine clustering of substances for hazards by mapping to PubChem. Similarity was measured using PubChem 2D conformational substructure fingerprints, which were compared via the Tanimoto metric. Following K-Core filtration, the Blondel et al. (2008) module recognition algorithm was used to identify chemical modules showing clusters of substances in use within the chemical universe. The Global Harmonized System of Classification and Labelling provides a valuable information source for hazard analysis. The most prevalent hazards are H317 {"}May cause an allergic skin reaction{"} with 20{\%} and H318 {"}Causes serious eye damage{"} with 17{\%} positive substances. Such prevalences obtained for all hazards here are key for the design of integrated testing strategies. The data allowed estimation of animal use. The database covers about 20{\%} of substances in the high-throughput biological assay database Tox21 (1,737 substances) and has a 917 substance overlap with the Comparative Toxicogenomics Database (~7{\%} of CTD). The biological data available in these datasets combined with ECHA in vivo endpoints have enormous modeling potential. A case is made that REACH should systematically open regulatory data for research purposes.",
keywords = "Animal testing, Chemical toxicity, Computational toxicology, Database, In silico",
author = "Thomas Luechtefeld and Alexandra Maertens and Russo, {Daniel P.} and Costanza Rovida and Hao Zhu and Thomas Hartung",
year = "2016",
doi = "10.14573/altex.1510052",
language = "English (US)",
volume = "33",
pages = "95--109",
journal = "ALTEX : Alternativen zu Tierexperimenten",
issn = "1868-596X",
publisher = "Elsevier GmbH",
number = "2",

}

TY - JOUR

T1 - Global analysis of publicly available safety data for 9,801 substances registered under REACH from 2008-2014

AU - Luechtefeld, Thomas

AU - Maertens, Alexandra

AU - Russo, Daniel P.

AU - Rovida, Costanza

AU - Zhu, Hao

AU - Hartung, Thomas

PY - 2016

Y1 - 2016

N2 - The European Chemicals Agency (ECHA) warehouses the largest public dataset of in vivo and in vitro toxicity tests. In December 2014 this data was converted into a structured, machine readable and searchable database using natural language processing. It contains data for 9,801 unique substances, 3,609 unique study descriptions and 816,048 study documents. This allows exploring toxicological data on a scale far larger than previously possible. Substance similarity analysis was used to determine clustering of substances for hazards by mapping to PubChem. Similarity was measured using PubChem 2D conformational substructure fingerprints, which were compared via the Tanimoto metric. Following K-Core filtration, the Blondel et al. (2008) module recognition algorithm was used to identify chemical modules showing clusters of substances in use within the chemical universe. The Global Harmonized System of Classification and Labelling provides a valuable information source for hazard analysis. The most prevalent hazards are H317 "May cause an allergic skin reaction" with 20% and H318 "Causes serious eye damage" with 17% positive substances. Such prevalences obtained for all hazards here are key for the design of integrated testing strategies. The data allowed estimation of animal use. The database covers about 20% of substances in the high-throughput biological assay database Tox21 (1,737 substances) and has a 917 substance overlap with the Comparative Toxicogenomics Database (~7% of CTD). The biological data available in these datasets combined with ECHA in vivo endpoints have enormous modeling potential. A case is made that REACH should systematically open regulatory data for research purposes.

AB - The European Chemicals Agency (ECHA) warehouses the largest public dataset of in vivo and in vitro toxicity tests. In December 2014 this data was converted into a structured, machine readable and searchable database using natural language processing. It contains data for 9,801 unique substances, 3,609 unique study descriptions and 816,048 study documents. This allows exploring toxicological data on a scale far larger than previously possible. Substance similarity analysis was used to determine clustering of substances for hazards by mapping to PubChem. Similarity was measured using PubChem 2D conformational substructure fingerprints, which were compared via the Tanimoto metric. Following K-Core filtration, the Blondel et al. (2008) module recognition algorithm was used to identify chemical modules showing clusters of substances in use within the chemical universe. The Global Harmonized System of Classification and Labelling provides a valuable information source for hazard analysis. The most prevalent hazards are H317 "May cause an allergic skin reaction" with 20% and H318 "Causes serious eye damage" with 17% positive substances. Such prevalences obtained for all hazards here are key for the design of integrated testing strategies. The data allowed estimation of animal use. The database covers about 20% of substances in the high-throughput biological assay database Tox21 (1,737 substances) and has a 917 substance overlap with the Comparative Toxicogenomics Database (~7% of CTD). The biological data available in these datasets combined with ECHA in vivo endpoints have enormous modeling potential. A case is made that REACH should systematically open regulatory data for research purposes.

KW - Animal testing

KW - Chemical toxicity

KW - Computational toxicology

KW - Database

KW - In silico

UR - http://www.scopus.com/inward/record.url?scp=84962788737&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84962788737&partnerID=8YFLogxK

U2 - 10.14573/altex.1510052

DO - 10.14573/altex.1510052

M3 - Article

VL - 33

SP - 95

EP - 109

JO - ALTEX : Alternativen zu Tierexperimenten

JF - ALTEX : Alternativen zu Tierexperimenten

SN - 1868-596X

IS - 2

ER -