Explanation of Data

Each of the pages provides information on Borrelia burgdorferi, as well as the results from a recent RNASeq experiment on Differential Gene Expression.

The data and results were set up as follows.

The annotation data was sorted and matched using a series of python scripts, which are available upon request. Briefly, a GFF file was downloaded from PATRC, and then locus_ID was reassigned to the ID field from column 8. In several cases, there was no locus_tag or a conflicting identifier, so the gene ID was created with the following priority:

1) If a locus_tag existed and was not duplicated, that was assigned to the ID field
* Note that the protein_product field was ignored in this case
2) If a locus_tag existed and was duplicated, and the protein product was a hypothetical protein, then the locus_tag was used, with a 'h' appended to it. To preserve unique identifiers, an integer was used that was incremented for each new hypothetical-type gene product.
* Example: BB_L35 and BB_L35_h4 ; '4' was used because it was the fourth hypothetical protein encountered *in the entire annotation file*
3) If a locus_tag was not present, and no other identifier was present, OR it was specifically identified as a pseudogene, the previously listed locus tag in the annotation file was used, with a 'p' appended to it. As in priority 3, to preserve unique identifiers, an integer was used that was incremented for each new pseudogene-type gene product.
* Example: BB_Q15_p1 ; '1' was used because it was the first pseudogene-type gene product encountered *in the entire annotation file*.
* Also note that it is possible for an annotation entry to have both 'p' and 'h', example, BB_Q03_p1_h15, *if* the entry before it was a pseudogene-type gene product.
4) If a locus_tag existed, and was identified as one that translated a tRNA, then the locus_tag was used, with a 't' appended to it. To preserve unique identifiers, an integer was used that was incremented for each new hypothetical-type gene product.
* Example: BB0615_t14 ; '14' was used because it was the 14th tRNA *in the entire annotation file*

After a list of updated genes was created, the gene IDs were matched to the Differential Gene Expression output using exact name matches and regular expressions. The gene IDs were also checked by a human researcher for accuracy.
* If a gene perfectly matched to the PATRC gene ID, it was reported as 'MATCH', and the gene information from PATRC was added.
* If it partly matched, it was reported as 'PARTIAL', and the gene information from PATRC was added.
* If the script could not match it, it was manually checked again for accuracy, and 'NA' was reported for the gene information.

The protein_product field from column 8 in the gff file for each gene was assigned with the following priorities:
1) If a locus_tag existed and was not duplicated (i.e. it fulfilled priority 1 above), then the value from the protein_product field was assigned
2) Otherwise, the value 'NA' was assigned as the protein product.

Borrelia burgdorferi: