Programmatic access to IMPC data relies on the [SOLR query syntax]. This approach is flexible and hence powerful, but this means there are complex features and behaviors that may not always be simple to understand. Using common data access patterns can be a helpful way to limit the complexity and obtain answers to specific queries.
The examples below show URLs that query a data core called ‘genotype-phenotype’. These examples provide a guide as to construct a query URL. All the example URLs can be pasted into a browser address bar.
Size of output
One of the most important settings to control when using the API is the size of the output, or the number of records returned from the server. This can be achieved by appending a settings ‘rows’ at the end of each query.
Note. – If rows is not specified, the server returns 10 records.
Using the ‘genotype-phenotype’ core as an example, the following URLs provide two small subsets of the available data.
Two common output formats are json and csv, which can be toggled via an argument ‘wt’. The queries above become as follows.
Depending on browser settings, one or the other may appear more readable in certain situations. The csv format is convenient for use with spreadsheet programs. Both formats are compatible with programmatic processing in R, Python, or any other data science framework.
The default behavior for each endpoint is to return all fields available in a data store, akin to returning all columns from a large table. It is possible to limit the output by specifying the desired fields via an argument ‘fl’.
Note that some records in the output may appear to be identical – they only appear so because their distinguishing features are not provided in the immediate output.
The queries can be set to return data on a subset of records of interest by replacing the text ‘q=*:*’ in the previous queries. Any of the available fields can be used in a filter. Common patterns include filter by gene symbol, procedure, or phenotype.
Combined with the other techniques, filtering provides a direct mechanism to answer very specific queries. The following fetches all significant phenotypes for a gene symbol.
Note that the query requests 20 records, but the server returns a smaller number. This is an indication that the output contains all the data that satisfy the filter, i.e. none have been left out.