This document will show you the basics of uniprotREST. This package uses httr2 to wrap the latest UniProt REST API, which was updated in June 2022. I wrote this package as an easy-to-use interface to the API for R users who need to regularly and reproducibly download information from UniProt.
uniprotREST has 3 main functions to use:
-
uniprot_map()
to map to or from UniProt accessions. -
uniprot_search()
to perform text search queries. -
uniprot_single()
to get detailed information for a single entry.
1. ID mapping with uniprot_map
This is by far the most frequently used tool. Say hypothetically, you
have been given a list of UniProt
accessions. You have no clue what proteins they refer to, or what
properties these proteins have. You can use uniprot_map()
to find this out.
# Accessions of interest
aoi <- c("A0A8I6AN81", "A0A0N4SVP8", "Q9H6R0")
Default settings
Here we just use the default settings, which will map the IDs from
UniProtKB_AC-ID
to UniProtKB
, and output a
dataframe.
<- uniprot_map(ids = aoi)
result1 ## Running job: 30b29a9ff6a8afbb177c3cd24d2860825a383978
## Checking job status...
## Job complete!
##
: page 1 of 1 Downloading
The job ID is automatically printed (stop printing by setting the
verbosity
argument to 0). Job IDs and the job data are kept
by UniProt for approximately 7 days, and are then deleted.
# All 3 proteins are RNA helicases
head(result1)
## From Entry Entry.Name Reviewed
## 1 A0A8I6AN81 A0A8I6AN81 A0A8I6AN81_RAT unreviewed
## 2 A0A0N4SVP8 A0A0N4SVP8 A0A0N4SVP8_MOUSE unreviewed
## 3 Q9H6R0 Q9H6R0 DHX33_HUMAN reviewed
## Protein.names
## 1 RNA helicase (EC 3.6.4.13)
## 2 RNA helicase (EC 3.6.4.13)
## 3 ATP-dependent RNA helicase DHX33 (EC 3.6.4.13) (DEAH box protein 33)
## Gene.Names Organism Length
## 1 Rig1 Rattus norvegicus (Rat) 881
## 2 Eif4a3l2 Mus musculus (Mouse) 411
## 3 DHX33 DDX33 Homo sapiens (Human) 707
By default, the output will be a dataframe with 8 columns:
-
From
= accessions used to map from -
To
= accessions they were mapped to -
Entry.Name
= UniProtKB entry name -
Reviewed
= is the protein in Swiss-Prot? -
Protein.names
= name of protein in UniProtKB -
Gene.Names
= gene names associated with this protein (can be multiple) -
Organism
= name of organism the protein is from -
Length
= amino acid length
And n rows which depends on:
- How many ids were successfully mapped
- If the mapping was 1:1 or not
The output columns can be customised with the fields
argument.
Return fields
UniProt has a lot of metadata available for each protein. You can
access this in the results by requesting different columns or ‘return
fields’ using the fields
argument.
Here we will request some different return fields. See Return
Fields - UniProt and Return
Fields - Other for lists of all available fields
.
# Jobs are stored for 7 days
# so subsequent queries will be faster
<- uniprot_map(
result2 ids = aoi,
fields = c(
"gene_primary",
"organism_name",
"length",
"mass"
)
)## Running job: 30b29a9ff6a8afbb177c3cd24d2860825a383978
## Checking job status...
## Job complete!
##
: page 1 of 1 Downloading
head(result2)
## From Gene.Names..primary. Organism Length Mass
## 1 A0A8I6AN81 Rig1 Rattus norvegicus (Rat) 881 101151
## 2 A0A0N4SVP8 Eif4a3l2 Mus musculus (Mouse) 411 46959
## 3 Q9H6R0 DHX33 Homo sapiens (Human) 707 78874
From/to database
uniprot_map()
can be used to map IDs from other
databases to UniProt IDs, and vice-versa. See Databases
for a list of databases available for mapping, and From/to
Rules for the rules of which databases can be mapped to what.
Here we’ll map some Ensembl gene IDs to reviewed UniProtKB accessions.
# Genes of interest
goi <- c("ENSG00000088247", "ENSG00000162613")
The fields
argument only works when mapping to
a UniProtKB, UniRef, or UniParc database.
<- uniprot_map(
result3 ids = goi,
from = "Ensembl",
to = "UniProtKB-Swiss-Prot",
fields = c("accession", "gene_primary")
)## Running job: fcc9728d59b6c0e3f618d5cf59445fc83d359a9a
## Checking job status...
## Job complete!
##
: page 1 of 1 Downloading
head(result3)
## From Entry Gene.Names..primary.
## 1 ENSG00000088247 Q92945 KHSRP
## 2 ENSG00000162613 Q96AE4 FUBP1
Format
The UniProt REST API can deliver results in different data formats. The formats available depends on the database being accessed and the uniprotREST function being used. See Formats for a full list of available formats.
The uniprotREST wrapper functions do not support all formats yet. Each tool currently supports the following formats:
-
uniprot_map()
=tsv, fasta
-
uniprot_search()
=tsv, fasta
-
uniprot_single()
=tsv, fasta, json
Here we’ll re-use the result3
job above, but request the
FASTA protein sequences instead. If the Biostrings
package
is installed (highly recommended) the output will be a
Biostrings::AAStringSet
, or otherwise a
named character
.
<- uniprot_map(
result4 ids = goi,
from = "Ensembl",
to = "UniProtKB-Swiss-Prot",
format = "fasta"
)## Running job: fcc9728d59b6c0e3f618d5cf59445fc83d359a9a
## Checking job status...
## Job complete!
##
: page 1 of 1 Downloading
result4
## AAStringSet object of length 2:
## width seq names
## [1] 711 MSDYSTGGPPPGPPPPAGGGGGA...YGQTPGPGGPQPPPTQQGQQQAQ sp|Q92945|FUBP2_H...
## [2] 644 MADYSTVPPPSSGSAGGGGGGGG...QAAYYAQTSPQGMPQHPPAPQGQ sp|Q96AE4|FUBP1_H...
Path
The previous examples all save the data from UniProt into an object in memory. However, you can also save the data to a file on disk. To do this, just specify a file path with the correct extension. The file must not already exist otherwise an error is thrown.
# Get temp path for this example (and delete when done)
<- tempfile(fileext = ".tsv")
tmp on.exit(unlink(tmp))
# Save results to a tsv file
uniprot_map(
ids = goi,
from = "Ensembl",
to = "UniProtKB-Swiss-Prot",
fields = c("accession", "gene_primary"),
format = "tsv",
path = tmp
)## Running job: fcc9728d59b6c0e3f618d5cf59445fc83d359a9a
## Checking job status...
## Job complete!
##
: page 1 of 1
Downloading
# Check file contents
read.delim(tmp)
## From Entry Gene.Names..primary.
## 1 ENSG00000088247 Q92945 KHSRP
## 2 ENSG00000162613 Q96AE4 FUBP1
Other arguments
The other arguments in uniprot_map()
are as follows:
Isoform
By default, the UniProt APIs will only provide results with a
proteins’ canonical
sequence. If you set isoform = TRUE
, then isoform
sequences will be included as well. This is typically only relevant when
format = "fasta"
although I have run into some
exceptions.
Here we get the canonical and isoform sequence for human GAPDH.
<- uniprot_map(
result5 ids = "P04406",
format = "fasta",
isoform = TRUE
)## Running job: 5feb5aff87ab072e97ab3ab620e270f19b36e103
## Checking job status...
## Job complete!
##
: page 1 of 1 Downloading
result5
## AAStringSet object of length 2:
## width seq names
## [1] 335 MGKVKVGVNGFGRIGRLVTRAAF...WYDNEFGYSNRVVDLMAHMASKE sp|P04406|G3P_HUM...
## [2] 293 MVYMFQYDSTHGKFHGTVKAENG...WYDNEFGYSNRVVDLMAHMASKE sp|P04406-2|G3P_H...
Method and page_size
The UniProt API provides results via 2 endpoints: stream, and
pagination, which you can choose via the method
argument.
By default, uniprot_map()
and
uniprot_search()
use method = "paged"
which is
more robust but slightly slower, with the default recommended
page_size
of 500. Whereas uniprot_single()
only uses the stream endpoint.
Paged endpoint:
- Slightly slower.
- Processes results in chunks, so much more reliable to connection issues.
- Can theoretically handle more than 10,000,000 results.
Stream endpoint:
- Slightly faster.
- Expensive for the API, uses a lot of memory.
- Can return a
429
status error if it currently has too many requests. - Up to 10,000,000 results can be fetched.
Compressed
Should gzipped data be requested? This is FALSE
by
default, and it is only used if method = "stream"
and
path
is specified.
For example:
# Get temp path for this example (and delete when done)
<- tempfile(fileext = ".fasta.gz")
tmp on.exit(unlink(tmp))
# Save results to a tsv file
uniprot_map(
ids = "P04406",
format = "fasta",
isoform = TRUE,
method = "stream",
path = tmp,
compressed = TRUE
)## Running job: 5feb5aff87ab072e97ab3ab620e270f19b36e103
## Checking job status...
## Job complete!
##
: 0 B
Downloading: 0 B
Downloading: 0 B
Downloading: 0 B
Downloading: 0 B
Downloading: 420 B
Downloading: 420 B
Downloading: 420 B
Downloading: 420 B
Downloading
# Check file contents
::readAAStringSet(tmp)
Biostrings## AAStringSet object of length 2:
## width seq names
## [1] 335 MGKVKVGVNGFGRIGRLVTRAAF...WYDNEFGYSNRVVDLMAHMASKE sp|P04406|G3P_HUM...
## [2] 293 MVYMFQYDSTHGKFHGTVKAENG...WYDNEFGYSNRVVDLMAHMASKE sp|P04406-2|G3P_H...
Verbosity
Controls the amount of information to print:
- Use
verbosity = 0
to not print anything. - Use
verbosity = 1
,2
, or3
to print increasing amounts of information about the HTTP requests made to the UniProt API (typically for debugging purposes).
Dry_run
If TRUE
, performs the request locally with
httr2::req_dry_run()
instead of the actually sending it to
the UniProt REST API. This is useful for debugging purposes if you are
getting 400 - Bad request
status errors.
2. Querying UniProt with uniprot_search
This function is used to perform text searches against UniProt, akin
to using the search bar on their website. The different databases
available from the search bar are also available via
uniprot_search()
(see Databases).
It’s very important that the search string is constructed correctly,
see this page for
help building queries. If you get a 400 - Bad request
error, its likely your search string is not formatted correctly.
Here we’ll do a search for human proteins annotated with the glycoprotein keyword, which are in SwissProt i.e. have been manually reviewed.
<- uniprot_search(
result6 query = "(proteome:UP000005640) AND (keyword:KW-0325) AND (length<100)",
database = "uniprotkb",
format = "tsv",
fields = c("accession", "gene_primary")
)##
: page 1 of 1 Downloading
head(result6)
## Entry Gene.Names..primary.
## 1 P06028 GYPB
## 2 P80098 CCL7
## 3 Q16627 CCL14
## 4 P0DMC3 APELA
## 5 P25063 CD24
## 6 P31358 CD52
The other UniProt databases other than UniProtKB are available to query as well. In this example we’ll look for all reference proteomes with the word ‘dog’ in their title.
<- uniprot_search(
result7 "dog",
database = "proteomes",
format = "tsv",
fields = c("upid", "organism")
)##
: page 1 of 1 Downloading
head(result7)
## Proteome.Id
## 1 UP000252519
## 2 UP000645828
## 3 UP000805418
## 4 UP000029752
## 5 UP000201396
## 6 UP000277561
## Organism
## 1 Ancylostoma caninum (Dog hookworm)
## 2 Nyctereutes procyonoides (Raccoon dog) (Canis procyonoides)
## 3 Canis lupus familiaris (Dog) (Canis familiaris)
## 4 Cadicivirus A (isolate Dog/Hong Kong/209/2008) (CaPdV-1) (Canine picodicistrovirus (isolate 209))
## 5 Raccoon dog amdovirus
## 6 Human associated gemykibivirus 2
3. Retrieving an entry with uniprot_single
uniprot_single()
is used to quickly retrieve information
about a single entry in UniProt. By default the json
format
is requested and is parsed into a list which contains all
information available for that particular entry. All other arguments
work the same as uniprot_search()
.
For example:
result8 <- uniprot_single(
id = "P99999",
verbosity = 0
)
str(result8, max.level = 1)
## List of 17
## $ entryType : chr "UniProtKB reviewed (Swiss-Prot)"
## $ primaryAccession : chr "P99999"
## $ secondaryAccessions :List of 6
## $ uniProtkbId : chr "CYC_HUMAN"
## $ entryAudit :List of 5
## $ annotationScore : num 5
## $ organism :List of 4
## $ proteinExistence : chr "1: Evidence at protein level"
## $ proteinDescription :List of 1
## $ genes :List of 1
## $ comments :List of 10
## $ features :List of 36
## $ keywords :List of 14
## $ references :List of 19
## $ uniProtKBCrossReferences:List of 179
## $ sequence :List of 5
## $ extraAttributes :List of 3
Again, there are other UniProt databases available apart from UniProtKB (see Databases).
For example UniParc:
result9 <- uniprot_single(
id = "UPI0001C61C61",
database = "uniparc",
verbosity = 0
)
str(result9, max.level = 1)
## List of 6
## $ uniParcId : chr "UPI0001C61C61"
## $ uniParcCrossReferences :List of 4
## $ sequence :List of 5
## $ sequenceFeatures :List of 7
## $ oldestCrossRefCreated : chr "2010-03-03"
## $ mostRecentCrossRefUpdated: chr "2023-06-28"