Getting started with uniprotREST

This document will show you the basics of uniprotREST. This package uses httr2 to wrap the latest UniProt REST API, which was updated in June 2022. I wrote this package as an easy-to-use interface to the API for R users who need to regularly and reproducibly download information from UniProt.

uniprotREST has 3 main functions to use:

uniprot_map() to map to or from UniProt accessions.
uniprot_search() to perform text search queries.
uniprot_single() to get detailed information for a single entry.

library(uniprotREST)

1. ID mapping with `uniprot_map`

This is by far the most frequently used tool. Say hypothetically, you have been given a list of UniProt accessions. You have no clue what proteins they refer to, or what properties these proteins have. You can use uniprot_map() to find this out.

# Accessions of interest
aoi <- c("A0A8I6AN81", "A0A0N4SVP8", "Q9H6R0")

Default settings

Here we just use the default settings, which will map the IDs from UniProtKB_AC-ID to UniProtKB, and output a dataframe.

result1 <- uniprot_map(ids = aoi)
## Running job: 30b29a9ff6a8afbb177c3cd24d2860825a383978 
## Checking job status...
## Job complete!
## 
 Downloading: page 1 of 1

The job ID is automatically printed (stop printing by setting the verbosity argument to 0). Job IDs and the job data are kept by UniProt for approximately 7 days, and are then deleted.

# All 3 proteins are RNA helicases
head(result1)
##         From      Entry       Entry.Name   Reviewed
## 1 A0A8I6AN81 A0A8I6AN81   A0A8I6AN81_RAT unreviewed
## 2 A0A0N4SVP8 A0A0N4SVP8 A0A0N4SVP8_MOUSE unreviewed
## 3     Q9H6R0     Q9H6R0      DHX33_HUMAN   reviewed
##                                                          Protein.names
## 1                                           RNA helicase (EC 3.6.4.13)
## 2                                           RNA helicase (EC 3.6.4.13)
## 3 ATP-dependent RNA helicase DHX33 (EC 3.6.4.13) (DEAH box protein 33)
##    Gene.Names                Organism Length
## 1        Rig1 Rattus norvegicus (Rat)    881
## 2    Eif4a3l2    Mus musculus (Mouse)    411
## 3 DHX33 DDX33    Homo sapiens (Human)    707

By default, the output will be a dataframe with 8 columns:

From = accessions used to map from
To = accessions they were mapped to
Entry.Name = UniProtKB entry name
Reviewed = is the protein in Swiss-Prot?
Protein.names = name of protein in UniProtKB
Gene.Names = gene names associated with this protein (can be multiple)
Organism = name of organism the protein is from
Length = amino acid length

And n rows which depends on:

How many ids were successfully mapped
If the mapping was 1:1 or not

The output columns can be customised with the fields argument.

Return fields

UniProt has a lot of metadata available for each protein. You can access this in the results by requesting different columns or ‘return fields’ using the fields argument.

Here we will request some different return fields. See Return Fields - UniProt and Return Fields - Other for lists of all available fields.

# Jobs are stored for 7 days
# so subsequent queries will be faster
result2 <- uniprot_map(
  ids = aoi,
  fields = c(
    "gene_primary",
    "organism_name",
    "length",
    "mass"
  )
)
## Running job: 30b29a9ff6a8afbb177c3cd24d2860825a383978 
## Checking job status...
## Job complete!
## 
 Downloading: page 1 of 1

head(result2)
##         From Gene.Names..primary.                Organism Length   Mass
## 1 A0A8I6AN81                 Rig1 Rattus norvegicus (Rat)    881 101151
## 2 A0A0N4SVP8             Eif4a3l2    Mus musculus (Mouse)    411  46959
## 3     Q9H6R0                DHX33    Homo sapiens (Human)    707  78874

From/to database

uniprot_map() can be used to map IDs from other databases to UniProt IDs, and vice-versa. See Databases for a list of databases available for mapping, and From/to Rules for the rules of which databases can be mapped to what.

Here we’ll map some Ensembl gene IDs to reviewed UniProtKB accessions.

# Genes of interest
goi <- c("ENSG00000088247", "ENSG00000162613")

The fields argument only works when mapping to a UniProtKB, UniRef, or UniParc database.

result3 <- uniprot_map(
  ids = goi,
  from = "Ensembl",
  to = "UniProtKB-Swiss-Prot",
  fields = c("accession", "gene_primary")
)
## Running job: fcc9728d59b6c0e3f618d5cf59445fc83d359a9a 
## Checking job status...
## Job complete!
## 
 Downloading: page 1 of 1

head(result3)
##              From  Entry Gene.Names..primary.
## 1 ENSG00000088247 Q92945                KHSRP
## 2 ENSG00000162613 Q96AE4                FUBP1

Format

The UniProt REST API can deliver results in different data formats. The formats available depends on the database being accessed and the uniprotREST function being used. See Formats for a full list of available formats.

The uniprotREST wrapper functions do not support all formats yet. Each tool currently supports the following formats:

uniprot_map() = tsv, fasta
uniprot_search() = tsv, fasta
uniprot_single() = tsv, fasta, json

Here we’ll re-use the result3 job above, but request the FASTA protein sequences instead. If the Biostrings package is installed (highly recommended) the output will be a Biostrings::AAStringSet, or otherwise a named character.

result4 <- uniprot_map(
  ids = goi,
  from = "Ensembl",
  to = "UniProtKB-Swiss-Prot",
  format = "fasta"
)
## Running job: fcc9728d59b6c0e3f618d5cf59445fc83d359a9a 
## Checking job status...
## Job complete!
## 
 Downloading: page 1 of 1

result4
## AAStringSet object of length 2:
##     width seq                                               names               
## [1]   711 MSDYSTGGPPPGPPPPAGGGGGA...YGQTPGPGGPQPPPTQQGQQQAQ sp|Q92945|FUBP2_H...
## [2]   644 MADYSTVPPPSSGSAGGGGGGGG...QAAYYAQTSPQGMPQHPPAPQGQ sp|Q96AE4|FUBP1_H...

Path

The previous examples all save the data from UniProt into an object in memory. However, you can also save the data to a file on disk. To do this, just specify a file path with the correct extension. The file must not already exist otherwise an error is thrown.

# Get temp path for this example (and delete when done)
tmp <- tempfile(fileext = ".tsv")
on.exit(unlink(tmp))

# Save results to a tsv file
uniprot_map(
  ids = goi,
  from = "Ensembl",
  to = "UniProtKB-Swiss-Prot",
  fields = c("accession", "gene_primary"),
  format = "tsv",
  path = tmp
)
## Running job: fcc9728d59b6c0e3f618d5cf59445fc83d359a9a 
## Checking job status...
## Job complete!
## 
 Downloading: page 1 of 1

# Check file contents
read.delim(tmp)
##              From  Entry Gene.Names..primary.
## 1 ENSG00000088247 Q92945                KHSRP
## 2 ENSG00000162613 Q96AE4                FUBP1

Other arguments

The other arguments in uniprot_map() are as follows:

Isoform

By default, the UniProt APIs will only provide results with a proteins’ canonical sequence. If you set isoform = TRUE, then isoform sequences will be included as well. This is typically only relevant when format = "fasta" although I have run into some exceptions.

Here we get the canonical and isoform sequence for human GAPDH.

result5 <- uniprot_map(
  ids = "P04406",
  format = "fasta",
  isoform = TRUE
)
## Running job: 5feb5aff87ab072e97ab3ab620e270f19b36e103 
## Checking job status...
## Job complete!
## 
 Downloading: page 1 of 1

result5
## AAStringSet object of length 2:
##     width seq                                               names               
## [1]   335 MGKVKVGVNGFGRIGRLVTRAAF...WYDNEFGYSNRVVDLMAHMASKE sp|P04406|G3P_HUM...
## [2]   293 MVYMFQYDSTHGKFHGTVKAENG...WYDNEFGYSNRVVDLMAHMASKE sp|P04406-2|G3P_H...

Method and page_size

The UniProt API provides results via 2 endpoints: stream, and pagination, which you can choose via the method argument.

By default, uniprot_map() and uniprot_search() use method = "paged" which is more robust but slightly slower, with the default recommended page_size of 500. Whereas uniprot_single() only uses the stream endpoint.

Paged endpoint:

Slightly slower.
Processes results in chunks, so much more reliable to connection issues.
Can theoretically handle more than 10,000,000 results.

Stream endpoint:

Slightly faster.
Expensive for the API, uses a lot of memory.
Can return a 429 status error if it currently has too many requests.
Up to 10,000,000 results can be fetched.

Compressed

Should gzipped data be requested? This is FALSE by default, and it is only used if method = "stream" and path is specified.

For example:

# Get temp path for this example (and delete when done)
tmp <- tempfile(fileext = ".fasta.gz")
on.exit(unlink(tmp))

# Save results to a tsv file
uniprot_map(
  ids = "P04406",
  format = "fasta",
  isoform = TRUE,
  method = "stream",
  path = tmp,
  compressed = TRUE
)
## Running job: 5feb5aff87ab072e97ab3ab620e270f19b36e103 
## Checking job status...
## Job complete!
## 
Downloading: 0 B     
Downloading: 0 B     
Downloading: 0 B     
Downloading: 0 B     
Downloading: 0 B     
Downloading: 420 B     
Downloading: 420 B     
Downloading: 420 B     
Downloading: 420 B

# Check file contents
Biostrings::readAAStringSet(tmp)
## AAStringSet object of length 2:
##     width seq                                               names               
## [1]   335 MGKVKVGVNGFGRIGRLVTRAAF...WYDNEFGYSNRVVDLMAHMASKE sp|P04406|G3P_HUM...
## [2]   293 MVYMFQYDSTHGKFHGTVKAENG...WYDNEFGYSNRVVDLMAHMASKE sp|P04406-2|G3P_H...

Verbosity

Controls the amount of information to print:

Use verbosity = 0 to not print anything.
Use verbosity = 1, 2, or 3 to print increasing amounts of information about the HTTP requests made to the UniProt API (typically for debugging purposes).

Dry_run

If TRUE, performs the request locally with httr2::req_dry_run() instead of the actually sending it to the UniProt REST API. This is useful for debugging purposes if you are getting 400 - Bad request status errors.

2. Querying UniProt with `uniprot_search`

This function is used to perform text searches against UniProt, akin to using the search bar on their website. The different databases available from the search bar are also available via uniprot_search() (see Databases).

It’s very important that the search string is constructed correctly, see this page for help building queries. If you get a 400 - Bad request error, its likely your search string is not formatted correctly.

Here we’ll do a search for human proteins annotated with the glycoprotein keyword, which are in SwissProt i.e. have been manually reviewed.

result6 <- uniprot_search(
  query = "(proteome:UP000005640) AND (keyword:KW-0325) AND (length<100)",
  database = "uniprotkb",
  format = "tsv",
  fields = c("accession", "gene_primary")
)
## 
 Downloading: page 1 of 1

head(result6)
##    Entry Gene.Names..primary.
## 1 P06028                 GYPB
## 2 P80098                 CCL7
## 3 Q16627                CCL14
## 4 P0DMC3                APELA
## 5 P25063                 CD24
## 6 P31358                 CD52

The other UniProt databases other than UniProtKB are available to query as well. In this example we’ll look for all reference proteomes with the word ‘dog’ in their title.

result7 <- uniprot_search(
  "dog",
  database = "proteomes",
  format = "tsv",
  fields = c("upid", "organism")
)
## 
 Downloading: page 1 of 1

head(result7)
##   Proteome.Id
## 1 UP000252519
## 2 UP000645828
## 3 UP000805418
## 4 UP000029752
## 5 UP000201396
## 6 UP000277561
##                                                                                            Organism
## 1                                                                Ancylostoma caninum (Dog hookworm)
## 2                                       Nyctereutes procyonoides (Raccoon dog) (Canis procyonoides)
## 3                                                   Canis lupus familiaris (Dog) (Canis familiaris)
## 4 Cadicivirus A (isolate Dog/Hong Kong/209/2008) (CaPdV-1) (Canine picodicistrovirus (isolate 209))
## 5                                                                             Raccoon dog amdovirus
## 6                                                                  Human associated gemykibivirus 2

3. Retrieving an entry with `uniprot_single`

uniprot_single() is used to quickly retrieve information about a single entry in UniProt. By default the json format is requested and is parsed into a list which contains all information available for that particular entry. All other arguments work the same as uniprot_search().