4 File Formats
Here we have described in detail the acquisition method and formats of the input files and the output files.
4.1 Files of Core Workflow
Input files
File1 Protein Sequences in Fasta Format
- One of the input files of genetribe is fasta file that stores protein sequences.
You can obtain the fasta file from Ensembl or NCBI.
Their file name is generally
name.pep.all.fa.gz
. - We decompress them all using the command
gunzip name.pep.all.fa.gz
, and rename it asname.fa
, genetribe extracts longest transcript per gene as the representative sequence. - Repeat the above steps for another genome.
example:
Note:>AT2G27490.4 pep chromosome:TAIR10:2:11748087:11749153:-1 gene:AT2G27490 transcript:AT2G27490.4 gene_biotype:protein_coding MRIVGLTGGIASGKSTVSNLFKASGIPVVDADVVARDVLKKGSGGWKRVVAAFGEEILLP SGEVDRPKLGQIVFSSDSKRQLLNKLMAPYISSGIFWEILKQWASGAKVIVVDIPLLFEV KMDKWTKPIVVVWVSQETQLKRLMERDGLSEEDARNRVMAQMPLDSKRSKADVVIDNNGS LDDLHQQFEKVLIEIRRPLTWIEFWRSRQGAFSVLGSVILGLSVCKQLKIGS
In this example. genetribe uses headers(i.e. >AT2G27490.4 pep chromosome:TAIR10:2:11748087:11749153:-1 gene:AT2G27490 transcript:AT2G27490.4)
to distinguish sequences and extract longest transcript per gene. genetribe uses the ID that starts withgene:
to get ID (i.e. AT2G27490). If this ID does not exist, genetribe uses the transcript ID (i.e. >AT2G27490.4) split by the separator.
to get ID. The separator is used to distinguish gene ID from transcript ID and specified by parameters-s
ingenetribe core
.
File2 Annotation File in Bed Format
- One of the input files of genetribe is bed) file that stores annotation information. First, you can obtain the gff file from Ensembl or NCBI. Their file name is generally
name.gff3.gz
. - We decompress them all using the command
gunzip name.gff3.gz
, and rename it asname.gff3
. - We convert
gff
file tobed
file that only contains gene information (i.e. the eighth column is "gene"). there are two ways to choose(of course, there are more ways):
MCscan + sed:
gff2bed + gawk + sed:python -m jcvi.formats.gff bed --type=gene --key=ID name.gff3 -o name.bed sed -i 's/gene://g' name.bed
example:gff2bed < name.gff3 | gawk -vOFS="\t" '{if($8=="gene")print $1,$2,$3,$4,$5,$6}' | sed 's/gene://g' > name.bed
Each column is chromosome, start location, end location, gene ID, score, and strand.1 3630 5899 AT1G01010 0 + 1 6787 9130 AT1G01020 0 - 1 11648 13714 AT1G01030 0 - 1 23120 31227 AT1G01040 0 + 1 31169 33171 AT1G01050 0 - 1 33364 37871 AT1G01060 0 - 1 38443 41017 AT1G01070 0 - 1 44969 47059 AT1G01080 0 - 1 47233 49304 AT1G01090 0 -
Note:
For some annotations, the fourth column ofbed
contains some strings, such asgene:
,ID:
, etc., we should delete. Number other than chromosome numbers should not appear in chromosome (S1.chr1 → chr1)File3 Chromosome Group Information
In this part, we should specify the chromosome name characteristics of each subgenome of the species genome, and name itname.chrlist
.
example:
Triticum aestivum IWGSC RefSeqv1 (chr1A chr2A chr1B chr2B chr1D chr2D)
Triticum urartu Tu2.0 (Tu1 Tu2)cat Triticum_aestivum.chrlist chrNA,chrNB,chrND
Oryza sativa Japonica IRGSP v1 (1 2 ...)cat Triticum_urartu.chrlist TuN
Triticum aestivum TGACv1 (TGACv1_scaffold_000099_1AL TGACv1_scaffold_732229_3B)cat Oryza_sativa_Japonica.chrlist N
Note:NA,NB,ND
The chromosome name in.chrlist
must be contained in the chromosome name inbed
.
File4 Gene Annotation Confidence (Optional)
This file is only needed when we specified parameters -c
in genetribe core
.
- The confidence ( HC / high-confidence, LC / low-confidence ) can be found in
gff
or namefasta
(name.HC.fasta
,name.LC.fasta
) downloaded from the database. - Extract two columns of information, gene ID and confidence, and name it
name.confidence
.
example:TraesCS1A02G000100 HC TraesCS1A02G000200 HC TraesCS1A02G000300 HC TraesCS1A02G000100LC LC TraesCS1A02G000200LC LC TraesCS1A02G000300LC LC
Output Files
Output files | Contents |
---|---|
.one2many | All homologous gene pairs, there are three columns of information, namely the gene in genome 1, the gene in genome 2 and the homolog match score (HMS) |
.one2one | RBH+SBH, which contains four columns of information, namely the gene in genome 1, the gene in genome2, the homologous type (RBH/SBH) and the chromosome group of gene in genome 2 |
.RBH | Gene pairs belonging to the Reciprocal Best Hits |
.SBH | Genes pairs belonging to the Single-side Best Hits, where RBH is not found but the best matching gene is found |
.singleton | The genes with no homologous genes |
4.2 Files of Sameassembly
Homolog inference for the same assembly
Input files
File1 Annotation File in Bed Format
The file is the same as bed in core
.
File2 Length of Genes
We can obtain it from bed
using the following command (of course, there are more ways):
cat name.bed | gawk -vOFS="\t" '{print $4,$3-$2}' > name.genelength
Output Files
Output files | Contents |
---|---|
.one2many | All gene pairs with overlapping positions, there are three columns of information, namely the gene in genome 1, the gene in genome 2 and the homolog match score (i.e. overlapping ratio) |
.one2one | RBH+SBH, which contains three columns of information, namely the gene in genome 1, the gene in genome2 and the homologous type (RBH/SBH) |