4 File Formats

Here we have described in detail the acquisition method and formats of the input files and the output files.

4.1 Files of Core Workflow

Input files

File1 Protein Sequences in Fasta Format

  1. One of the input files of genetribe is fasta file that stores protein sequences. You can obtain the fasta file from Ensembl or NCBI. Their file name is generally name.pep.all.fa.gz.
  2. We decompress them all using the command gunzip name.pep.all.fa.gz, and rename it as name.fa, genetribe extracts longest transcript per gene as the representative sequence.
  3. Repeat the above steps for another genome.
    example:
    >AT2G27490.4 pep chromosome:TAIR10:2:11748087:11749153:-1 gene:AT2G27490 transcript:AT2G27490.4 gene_biotype:protein_coding
    MRIVGLTGGIASGKSTVSNLFKASGIPVVDADVVARDVLKKGSGGWKRVVAAFGEEILLP
    SGEVDRPKLGQIVFSSDSKRQLLNKLMAPYISSGIFWEILKQWASGAKVIVVDIPLLFEV
    KMDKWTKPIVVVWVSQETQLKRLMERDGLSEEDARNRVMAQMPLDSKRSKADVVIDNNGS
    LDDLHQQFEKVLIEIRRPLTWIEFWRSRQGAFSVLGSVILGLSVCKQLKIGS
    
    Note:
    In this example. genetribe uses headers (i.e. >AT2G27490.4 pep chromosome:TAIR10:2:11748087:11749153:-1 gene:AT2G27490 transcript:AT2G27490.4) to distinguish sequences and extract longest transcript per gene. genetribe uses the ID that starts with gene: to get ID (i.e. AT2G27490). If this ID does not exist, genetribe uses the transcript ID (i.e. >AT2G27490.4) split by the separator . to get ID. The separator is used to distinguish gene ID from transcript ID and specified by parameters -s in genetribe core.

File2 Annotation File in Bed Format

  1. One of the input files of genetribe is bed) file that stores annotation information. First, you can obtain the gff file from Ensembl or NCBI. Their file name is generally name.gff3.gz.
  2. We decompress them all using the command gunzip name.gff3.gz, and rename it as name.gff3.
  3. We convert gff file to bed file that only contains gene information (i.e. the eighth column is "gene"). there are two ways to choose(of course, there are more ways):
    MCscan + sed:
    python -m jcvi.formats.gff bed --type=gene --key=ID name.gff3 -o name.bed
    sed -i 's/gene://g' name.bed
    
    gff2bed + gawk + sed:
    gff2bed < name.gff3 | gawk -vOFS="\t" '{if($8=="gene")print $1,$2,$3,$4,$5,$6}' | sed 's/gene://g' > name.bed
    
    example:
    1       3630    5899    AT1G01010       0       +
    1       6787    9130    AT1G01020       0       -
    1       11648   13714   AT1G01030       0       -
    1       23120   31227   AT1G01040       0       +
    1       31169   33171   AT1G01050       0       -
    1       33364   37871   AT1G01060       0       -
    1       38443   41017   AT1G01070       0       -
    1       44969   47059   AT1G01080       0       -
    1       47233   49304   AT1G01090       0       -
    
    Each column is chromosome, start location, end location, gene ID, score, and strand.
    Note:
    For some annotations, the fourth column of bed contains some strings, such as gene:, ID:, etc., we should delete. Number other than chromosome numbers should not appear in chromosome (S1.chr1 → chr1)

    File3 Chromosome Group Information

    In this part, we should specify the chromosome name characteristics of each subgenome of the species genome, and name it name.chrlist.
    example:
    Triticum aestivum IWGSC RefSeqv1 (chr1A chr2A chr1B chr2B chr1D chr2D)
    cat Triticum_aestivum.chrlist
    chrNA,chrNB,chrND
    
    Triticum urartu Tu2.0 (Tu1 Tu2)
    cat Triticum_urartu.chrlist
    TuN
    
    Oryza sativa Japonica IRGSP v1 (1 2 ...)
    cat Oryza_sativa_Japonica.chrlist
    N
    
    Triticum aestivum TGACv1 (TGACv1_scaffold_000099_1AL TGACv1_scaffold_732229_3B)
    NA,NB,ND
    
    Note:
    The chromosome name in .chrlist must be contained in the chromosome name in bed.

File4 Gene Annotation Confidence (Optional)

This file is only needed when we specified parameters -c in genetribe core.

  1. The confidence ( HC / high-confidence, LC / low-confidence ) can be found in gff or name fasta ( name.HC.fasta,name.LC.fasta) downloaded from the database.
  2. Extract two columns of information, gene ID and confidence, and name it name.confidence.
    example:
    TraesCS1A02G000100      HC
    TraesCS1A02G000200      HC
    TraesCS1A02G000300      HC
    TraesCS1A02G000100LC    LC
    TraesCS1A02G000200LC    LC
    TraesCS1A02G000300LC    LC
    

Output Files

Table 1. Output files of the core workflow
Output files Contents
.one2many All homologous gene pairs, there are three columns of information, namely the gene in genome 1, the gene in genome 2 and the homolog match score (HMS)
.one2one RBH+SBH, which contains four columns of information, namely the gene in genome 1, the gene in genome2, the homologous type (RBH/SBH) and the chromosome group of gene in genome 2
.RBH Gene pairs belonging to the Reciprocal Best Hits
.SBH Genes pairs belonging to the Single-side Best Hits, where RBH is not found but the best matching gene is found
.singleton The genes with no homologous genes

4.2 Files of Sameassembly

Homolog inference for the same assembly

Input files

File1 Annotation File in Bed Format

The file is the same as bed in core.

File2 Length of Genes

We can obtain it from bed using the following command (of course, there are more ways):

cat name.bed | gawk -vOFS="\t" '{print $4,$3-$2}' > name.genelength

Output Files

Table 2. Output files of the genereibe sameassembly
Output files Contents
.one2many All gene pairs with overlapping positions, there are three columns of information, namely the gene in genome 1, the gene in genome 2 and the homolog match score (i.e. overlapping ratio)
.one2one RBH+SBH, which contains three columns of information, namely the gene in genome 1, the gene in genome2 and the homologous type (RBH/SBH)