Genomico
Oct 16, 2023
Volume della natura 616, pagine 543–552 (2023) Citare questo articolo
25mila accessi
4 citazioni
89 Altmetrico
Dettagli sulle metriche
L’eterogeneità intratumorale (ITH) alimenta l’evoluzione del cancro del polmone, che porta all’evasione immunitaria e alla resistenza alla terapia1. Qui, utilizzando dati accoppiati dell'intero esoma e del sequenziamento dell'RNA, abbiamo studiato la diversità trascrittomica intratumorale in 354 tumori polmonari non a piccole cellule provenienti da 347 dei primi 421 pazienti reclutati prospetticamente nello studio TRACERx2,3. Le analisi di 947 regioni tumorali, che rappresentano sia la malattia primaria che quella metastatica, insieme a 96 campioni di tessuto normale adiacenti al tumore, implicano il trascrittoma come una delle principali fonti di variazione fenotipica. I livelli di espressione genica e l'ITH si riferiscono a modelli di selezione positiva e negativa durante l'evoluzione del tumore. Osserviamo frequenti espressioni allele-specifiche indipendenti dal numero di copie che sono collegate alla disfunzione epigenomica. L'espressione allele-specifica può anche provocare un'evoluzione parallela genomica-trascrittomica, che converge sulla distruzione del gene del cancro. Estraiamo le firme delle sostituzioni a base singola dell'RNA e colleghiamo la loro eziologia all'attività degli enzimi di modifica dell'RNA ADAR e APOBEC3A, rivelando così l'attività APOBEC in corso altrimenti non rilevata nei tumori. Caratterizzando i trascrittomi delle coppie di tumori primari-metastatici, combiniamo molteplici approcci di apprendimento automatico che sfruttano le variabili genomiche e trascrittomiche per collegare il potenziale di semina delle metastasi al contesto evolutivo delle mutazioni e all'aumento della proliferazione all'interno delle regioni tumorali primarie. Questi risultati evidenziano l’interazione tra il genoma e il trascrittoma nell’influenzare l’ITH, l’evoluzione del cancro del polmone e le metastasi.
Comprendere le cause della variazione da cellula a cellula del cancro è essenziale per comprendere l’evoluzione del tumore. Lavori recenti hanno sottolineato che gran parte di questa variazione è trascrittomica, derivante da diversi meccanismi che si riferiscono o sono indipendenti dalla variazione genomica4. Nei modelli murini di cancro polmonare non a piccole cellule (NSCLC), è stato dimostrato che la plasticità trascrittomica è alla base di ITH5. Mentre la variazione genomica riflette i resti di eventi somatici passati acquisiti durante la storia evolutiva di un tumore, la variazione trascrittomica può fornire un'approssimazione accurata dello stato fenotipico di un tumore al momento del campionamento1. Ad oggi, la maggior parte degli studi sull’evoluzione dei tumori negli esseri umani si sono concentrati sull’impatto delle alterazioni genomiche sul cancro. Gli studi trascrittomici che sfruttano i dati di sequenziamento dell'RNA tumorale in massa (RNA-seq) tendono a concentrarsi sull'ampiezza dell'espressione genica in una singola biopsia eseguita in un singolo momento. Questo approccio potrebbe non riuscire a catturare processi trascrittomici poco compresi, tra cui l'espressione allele specifica (ASE) e l'editing dell'RNA che possono esercitare importanti effetti sull'evoluzione del cancro1,4.
Qui sfruttiamo i dati di sequenziamento multiregione dei pazienti reclutati nello studio TRACERx2 per comprendere meglio l'impatto di molteplici caratteristiche trascrittomiche e la loro interazione con la diversità genomica e fenotipica nell'evoluzione del NSCLC su diverse scale spaziali e temporali.
Abbiamo analizzato i dati di sequenziamento dell'RNA-seq e dell'intero esoma di 347 pazienti reclutati nello studio prospettico TRACERx (coorte TRACERx 421). I campioni della coorte comprendevano 947 regioni tumorali provenienti da 354 tumori NSCLC (6 pazienti presentavano più tumori primari alla diagnosi), nonché 96 regioni di tessuto polmonare normale adiacenti al tumore (vedere il diagramma degli standard consolidati di reporting degli studi (CONSORT) in Informazioni supplementari)6 ,7. Di questi pazienti, 344 avevano 886 regioni tumorali primarie, 21 avevano anche 29 regioni linfonodali metastatiche campionate al momento della resezione chirurgica del tumore primario e 24 pazienti avevano 30 regioni tumorali metastatiche campionate al momento della recidiva o della progressione. In totale, 168 regioni tumorali primarie e 4 regioni LN di 64 pazienti in questa coorte erano state precedentemente descritte nella coorte TRACERx 1008. La coorte di regioni primarie-metastatiche accoppiate analizzate qui (e riportate in un articolo complementare6) comprende 61 regioni metastatiche comprese regioni LN e metastasi intrapolmonari resecate durante l'intervento chirurgico (d'ora in poi denominate lesioni LN/satelliti primarie) e regioni LN e metastatiche in caso di recidiva o progressione.
1) was most readily observed within truncating mutations in genes in the highest expression tertile. Notably, within non-cancer genes, signals of negative selection (dN/dS ± 95% confidence intervals of <1) were identified within truncating mutations in genes within the highest expression tertile only (242 truncating mutations, relative to 3,932 observed truncating mutations, were estimated to have been lost through negative selection in these genes). Similar patterns were observed when dividing the data by different expression quantiles (Extended Data Fig. 1i)./p>8 reads (Methods). It was possible to evaluate ASE in a total of 16,378 different genes across all samples within the cohort at an average of 3,809 (s.d. ± 885) and 4,064 (s.d. ± 485) genes per tumour and normal tissue sample, respectively./p>G substitutions, in keeping with ADAR-linked RNA editing, which deaminates adenosine to inosine, a nucleotide that is then read as guanosine by the translation machinery26 and sequencing platforms. Of these substitutions, 65% were present in the REDIportal database27 of known A>G editing events in human tissues. C>T substitutions28 represented 11.8% of the total substitutions detected. Of all the RNA substitutions detected, 67% were tumour specific (not present within a TRACERx panel of samples of normal tissue), and of these, 29.4% were shared between two or more tumours./p>G transitions, whereas RNA-SBS2 consisted mainly of C>T transitions. RNA-SBS3 consisted mainly of A>G and T>C transitions, RNA-SBS4 of G>A transitions and RNA-SBS5 of G>T transversions. RNA-SBS1 and RNA-SBS3 were identified in most tumours (RNA-SBS1 in 98% and RNA-SBS3 in 85%). RNA-SBS1 exhibited the lowest ITH and was detected within all regions of 87.4% of multiregion tumours./p>G sites from REDIportal was highly similar to RNA-SBS1 (cosine similarity = 0.97), consistent with the A>G activity of ADAR underpinning RNA-SBS1./p>T transitions at TpC sites (67%), a motif consistent with the RNA editing activity of APOBEC3A (ref. 30). In keeping with this, an unbiased analysis showed that RNA-SBS2 correlated more strongly with APOBEC3A expression than with any other gene in the transcriptome (Pearson's r = 0.73, FDR = 4.7 × 10−108; Fig. 3d). A multiple linear regression considering all APOBEC enzymes revealed that the expression of APOBEC3A was the strongest independent predictor of RNA-SBS2 activity, although APOBEC3F was also significant (P = 2.6 × 10−57 and P = 0.01 for APOBEC3A and APOBEC3F, respectively, linear mixed-effects model). Investigating the link between RNA-SBS2 and C>T enrichment at APOBEC3A-specific motifs30,31 further confirmed that RNA-SBS2 was strongly influenced by APOBEC3A expression (Extended Data Fig. 3c,d). Associations between gene expression or genomic features and the activity of the three remaining RNA-SBS signatures did not produce any obvious explanations for their aetiology./p>40% of all genes with zero counts (estimated using the QoRTS output Genes_WithZeroCounts) were excluded. Additionally, samples with <20% of reads mapping to a genomic area covered by exactly one gene in a coding sequence genomic region (estimated using the QoRTS output ReadPairs_UniqueGene_CDS) were excluded. Next, RNA coverage was calculated for single nucleotide variants (SNVs) detected in matched whole-exome sequencing data per tumour region using SAMtools (v.1.9)61 mpileup. Mutation expression was used to further quality check the mapping of RNA reads. The expression of SNVs exclusive to a given tumour region was used to detect potential instances of within-patient mislabelling of RNA–DNA matched tumour regions as well as to exclude normal adjacent lung tissue regions that expressed mutations present in paired tumour regions. A similar approach was applied to germline SNPs to further assess potential sample swaps based on patterns of CN variation from matched DNA per tumour region. Tumour regions in which fewer than 10 mutations, or fewer than 25% of the total mutation count, had evidence of expression, and/or less than 10% of SNPs had evidence of biallelic expression, were excluded. Finally, tumour regions clustering with tumour-adjacent normal tissue regions (see the section ‘UMAP clustering’) and tumour regions with a low purity were also excluded from further analyses. To ensure the reproducibility and portability of the above pipeline, all steps described were implemented through the Nextflow (v.20.07.1)62 pipeline manager./p>0) were evaluated for an enrichment in driver mutations more commonly associated with LUADs./p> 0.5 as not significantly ASE. In the case of CN-dependent ASE, genes were required to show no significant ASE, irrespective of CN, to be categorized as not significantly ASE. Genes with no phasing information were not tested for ASE./p> 0.2). For each of these, we computed the number of CpGs that were significantly hypomethylated and hypermethylated in tumour samples compared to the normal samples, taking only loci that had coverage in all samples (minnormal = 10, mintumour = 3). We then calculated the fraction of differentially methylated positions that were hypomethylated. Using a linear mixed effects model, with tumour identity as random effect, we then compared this metric to the percentage of genes showing evidence of CN-independent ASE per sample (separately for LUAD and LUSC)./p>T events at known RNA-editing APOBEC motifs. APOBEC enzymes typically edit C>T variants at the fourth position of 4-nucleotide-long RNA hairpin loops. In particular, APOBEC3A favours the CAT[C>T] motif30,31./p>T variant site, a Fisher's test was performed to test whether C>T changes within 20 upstream or downstream nucleotides occurred more than expected by chance at specific motifs (CAT[C>T]) in either strand./p>0.2 CCF were considered as seeding for this analysis. In total, 516 primary tumour regions from 206 tumours for which seeding status could be established and for which all metrics tested could be measured (307 non-seeding regions, 209 seeding) were analysed. The following features were also considered for the classifier:/p> 0.75, n = 11). We one-hot-encoded categorical features using get_dummies from Pandas (v1.3.3)106 and then split the data into training and test datasets (75/25 split). After encoding, we had a total of 60 features. We scaled the continuous features using MinMaxScaler from sklearn.preprocessing (v.0.0)107 and used SMOTENC from imblearn.over_sampling (v.0.8.0)105 to improve the balance of the dataset in terms of numbers of seeding and non-seeding regions. Finally, we used the sklearn (v.0.0)105 framework to perform additional variable selection before training using a LinearSVC model (penalty = "l1"), keeping those features with importance ≥0.015. This threshold removed 15 out of 60 features. Following this initial pre-processing, we generated different subsets of the dataset depending on the source of the input features, thus downstream processes within the pipeline operated on three datasets: (1) genomic only features, (2) transcriptomic only features, and (3) all features./p>