Recent Changes

Monday, December 8

  1. 8:45 am
  2. 8:45 am

Sunday, October 13

  1. 2:20 pm
  2. 2:20 pm
  3. 2:20 pm
  4. 2:20 pm
  5. 2:20 pm

Sunday, April 28

  1. page Workflow for update to ChEMBL_15 edited ... Indexes: "compound_pkey" PRIMARY KEY, btree ("CID") So we add a column …
    ...
    Indexes:
    "compound_pkey" PRIMARY KEY, btree ("CID")
    So we add a column to theThe chembl_15_compound_records table, whichtable contains compound
    ...
    within ChEMBL_15.
    chord=> alter table chembl_14_compound_records add column pubchem_cid integer ;
    There are various types of IDs for various sources:
    src_id | src_description | src_short_name | cpd_count
    --------+---------------------------------------------------------+------------------+-----------
    1 | Scientific Literature | LITERATURE | 897374
    2 | GSK Malaria Screening | GSK_TCMDC | 13533
    3 | Novartis Malaria Screening | NOVARTIS | 10119
    4 | St Jude Malaria Screening | ST_JUDE | 1524
    5 | Sanger Institute Genomics of Drug Sensitivity in Cancer | SANGER | 17
    7 | PubChem BioAssays | PUBCHEM_BIOASSAY | 482196
    8 | Clinical Candidates | CANDIDATES | 676
    9 | Orange Book | ORANGE_BOOK | 2003
    10 | Guide to Receptors and Channels | GRAC | 570
    11 | Open TG-GATEs | TG_GATES | 524
    12 | Manually Added Drugs | DRUGS | 114
    13 | USP Dictionary of USAN and International Drug Names | USP/USAN | 10568
    14 | Drugs for Neglected Diseases Initiative (DNDi) | DNDI | 6820
    15 | DrugMatrix in vitro pharmacology assays | DRUGMATRIX | 871
    16 | GSK Published Kinase Inhibitor Set | GSK_PKIS | 734
    17 | MMV Malaria Box | MMV_MBOX | 799
    18 | TP-search Transporter Database | TP_TRANSPORTER | 4383
    19 | Harvard Malaria Screening | HARVARD | 37
    20 | WHO-TDR Malaria Screening | WHO_TDR | 740
    21 | Deposited Supplementary Data | SUPPLEMENTARY | 54
    22 | GSK Tuberculosis Screening | GSK_TB | 776
    (21 rows)

    (view changes)
    10:55 am
  2. page Workflow for update to ChEMBL_15 edited ... chord=> \d+ public.chembl_15_compound_structures Table "public.chembl_15_compound_str…
    ...
    chord=> \d+ public.chembl_15_compound_structures
    Table "public.chembl_15_compound_structures"
    ...
    | Description
    --------------------+-------------------------+-----------+-------------
    ...
    null |
    molfile

    molfile
    | text | |
    standard_inchi

    standard_inchi
    | character
    ...
    | |
    standard_inchi_key

    standard_inchi_key
    | character
    ...
    null |
    canonical_smiles

    canonical_smiles
    | character
    ...
    | |
    molformula

    molformula
    | character
    ...
    | |
    Indexes:
    "chembl_15_compound_structures_pk" UNIQUE, btree (molregno)
    ...
    Indexes:
    "compound_pkey" PRIMARY KEY, btree ("CID")
    ...
    to the chembl_14_compound_recordschembl_15_compound_records table, which
    ...
    IDs within ChEMBL_14. Note chembl_14_compound_structures.canonical_smiles is a stereoisomeric canonical smiles. We must be careful to distinguish stereoisomeric canonical smiles from non-stereo canonical smiles. So we try this update command:ChEMBL_15.
    chord=> alter table chembl_14_compound_records add column pubchem_cid integer ;
    chord=> UPDATE
    chembl_14_compound_records
    SET
    pubchem_cid=c2b2r_compound."CID"
    FROM
    c2b2r_compound,
    chembl_14_compound_structures
    WHERE
    gnova.cansmiles(chembl_14_compound_structures.canonical_smiles)=gnova.cansmiles(c2b2r_compound.openeye_can_smiles)
    AND chembl_14_compound_structures.molregno=chembl_14_compound_records.molregno
    ;
    HOWEVER! While this could work but could be very slow, and in practice we run into a Chord fatal error:
    ERROR: can't make smiles of type 64 from smiles 'CC(C)c1ccc2C=[N+]3N=C(N)[S+]4[Pt]56[S+](C(=N[N+]5=Cc7ccc(c[c-]67)C(C)C)N)[Pt]89[S+](C(=N[N+]8=Cc%10ccc(c[c-]9%10)C(C)C)N)[Pt]%11%12[S+](C(=N[N+]%11=Cc%13ccc(c[c-]%12%13)C(C)C)N)[Pt]34[c-]2c1'.
    So it looks like we need to assign CIDs outside of the database and import. (This was going to be true for some compounds anyway, since there can be new, unknown compounds in chembl_14 or any update.) This can be done with the PubChem PUG REST API. First create a file with all chembl_14 molregno's and smiles.
    \pset format unaligned
    \pset footer off
    \pset fieldsep ' '
    \o chembl_14_dump_cpds.smi
    --
    SELECT
    canonical_smiles,
    molregno
    FROM
    chembl_14_compound_structures
    ORDER BY molregno
    ;
    Then we can access PubChem as follows:
    pug_rest_mols2ids.py \
    --v \
    --firstonly \
    --i data/chembl_14_dump_cpds.smi \
    --o data/chembl_14_dump_cpds_wCIDs.smi
    Unfortunately this is slow too, possibly due to the need to canonicalize smiles by PubChem. For 426000 mols, 15 days 1hr. Using InChI should be faster since the InChI should be unique already. Unfortunately not any faster. For 328000 mols, 12d 21hr.
    pug_rest_mols2ids.py \
    --v \
    --inchi \
    --firstonly \
    --i data/chembl_14_dump_cpds.inchi \
    --o data/chembl_14_dump_cpds_wCIDs.inchi
    This job takes a long time. About 43 days! Finally we have the full output:
    cheminfov$ wc -l chembl_14_dump_cpds_from-inchi_wCIDs.inchi
    1211654 chembl_14_dump_cpds_from-inchi_wCIDs.inchi
    After this job runs and all CIDs are obtained these must be used to merge link the ChEMBL14 compounds. We create a new table to link CIDs and ChEMBL MOLREGNOs. This will link
    chembl_14_compound_structure and chembl_14_compound_records to public.c2b2r_compound.
    CREATE TABLE chembl_14_compound_molregno2cid (
    molregno INTEGER,
    cid INTEGER
    );
    The table chembl_14_compound_records already has a column "pubchem_cid", unfortunately not populated much:
    chord=> select count(pubchem_cid) from chembl_14_compound_records ;
    count
    -------
    144
    (1 row)
    chord=> select count(molregno) from chembl_14_compound_records ;
    count
    ---------
    1376469
    (1 row)
    We could simply use this to store the CIDs. Or we can update this from chembl_14_compound_molregno2cid.
    With the CIDs obtained, we generate SQL to update the database:
    ./chembl_CID2sql.py \
    --i data/chembl_14_dump_cpds_wCIDs.inchi \
    --o data/chembl_14_dump_cpds_wCIDs.sql
    chembl_CID2sql.py: lines in: 1211654 ; converted to sql: 1163276
    chembl_CID2sql.py: errors: 2
    chembl_CID2sql.py: missing CIDs: 48376
    psql -U ***** chord < data/chembl_14_dump_cpds_wCIDs.sql
    ...
    Now new compounds not in public.c2b2r_compound must be added. The CID field is NOT NULL and in this sense Chem2Bio2RDF is somewhat PubChem-centric. So for new CIDs a row is added regardless of whether the InChI or cansmiles is new. For an existing CID we can assume the InChI and cansmiles are correct or take the opportunity to validate (which could also be done separately).

    (view changes)
    9:38 am

More