CollecTor¶
Descriptor archives are available from CollecTor. If you need Tor’s topology at a prior point in time this is the place to go!
With CollecTor you can either read descriptors directly…
import datetime
import stem.descriptor.collector
yesterday = datetime.datetime.utcnow() - datetime.timedelta(days = 1)
# provide yesterday's exits
exits = {}
for desc in stem.descriptor.collector.get_server_descriptors(start = yesterday):
if desc.exit_policy.is_exiting_allowed():
exits[desc.fingerprint] = desc
print('%i relays published an exiting policy today...\n' % len(exits))
for fingerprint, desc in exits.items():
print(' %s (%s)' % (desc.nickname, fingerprint))
… or download the descriptors to disk and read them later.
import datetime
import stem.descriptor
import stem.descriptor.collector
yesterday = datetime.datetime.utcnow() - datetime.timedelta(days = 1)
cache_dir = '~/descriptor_cache/server_desc_today'
collector = stem.descriptor.collector.CollecTor()
for f in collector.files('server-descriptor', start = yesterday):
f.download(cache_dir)
# then later...
for f in collector.files('server-descriptor', start = yesterday):
for desc in f.read(cache_dir):
if desc.exit_policy.is_exiting_allowed():
print(' %s (%s)' % (desc.nickname, desc.fingerprint))
get_instance - Provides a singleton CollecTor used for...
|- get_server_descriptors - published server descriptors
|- get_extrainfo_descriptors - published extrainfo descriptors
|- get_microdescriptors - published microdescriptors
|- get_consensus - published router status entries
|
|- get_key_certificates - authority key certificates
|- get_bandwidth_files - bandwidth authority heuristics
+- get_exit_lists - TorDNSEL exit list
File - Individual file residing within CollecTor
|- read - provides descriptors from this file
+- download - download this file to disk
CollecTor - Downloader for descriptors from CollecTor
|- get_server_descriptors - published server descriptors
|- get_extrainfo_descriptors - published extrainfo descriptors
|- get_microdescriptors - published microdescriptors
|- get_consensus - published router status entries
|
|- get_key_certificates - authority key certificates
|- get_bandwidth_files - bandwidth authority heuristics
|- get_exit_lists - TorDNSEL exit list
|
|- index - metadata for content available from CollecTor
+- files - files available from CollecTor
New in version 1.8.0.
-
stem.descriptor.collector.
get_instance
()[source]¶ Provides the singleton
CollecTor
used for this module’s shorthand functions.- Returns
singleton
CollecTor
instance
-
stem.descriptor.collector.
get_server_descriptors
(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)[source]¶ Shorthand for
get_server_descriptors()
on our singleton instance.
-
stem.descriptor.collector.
get_extrainfo_descriptors
(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)[source]¶ Shorthand for
get_extrainfo_descriptors()
on our singleton instance.
-
stem.descriptor.collector.
get_microdescriptors
(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶ Shorthand for
get_microdescriptors()
on our singleton instance.
-
stem.descriptor.collector.
get_consensus
(start=None, end=None, cache_to=None, document_handler='ENTRIES', version=3, microdescriptor=False, bridge=False, timeout=None, retries=3)[source]¶ Shorthand for
get_consensus()
on our singleton instance.
-
stem.descriptor.collector.
get_key_certificates
(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶ Shorthand for
get_key_certificates()
on our singleton instance.
-
stem.descriptor.collector.
get_bandwidth_files
(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶ Shorthand for
get_bandwidth_files()
on our singleton instance.
-
stem.descriptor.collector.
get_exit_lists
(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶ Shorthand for
get_exit_lists()
on our singleton instance.
-
class
stem.descriptor.collector.
File
(path, types, size, sha256, first_published, last_published, last_modified)[source]¶ Bases:
object
File within CollecTor.
- Variables
path (str) – file path within collector
types (tuple) – descriptor types contained within this file
compression (stem.descriptor.Compression) – file compression, None if this cannot be determined
size (int) – size of the file
sha256 (str) – file’s sha256 checksum
start (datetime) – first publication within the file, None if this cannot be determined
end (datetime) – last publication within the file, None if this cannot be determined
last_modified (datetime) – when the file was last modified
-
read
(directory=None, descriptor_type=None, start=None, end=None, document_handler='ENTRIES', timeout=None, retries=3)[source]¶ Provides descriptors from this archive. Descriptors are downloaded or read from disk as follows…
If this file has already been downloaded through :func:`~stem.descriptor.collector.CollecTor.download’ these descriptors are read from disk.
If a directory argument is provided and the file is already present these descriptors are read from disk.
If a directory argument is provided and the file is not present the file is downloaded this location then read.
If the file has neither been downloaded and no directory argument is provided then the file is downloaded to a temporary directory that’s deleted after it is read.
- Parameters
directory (str) – destination to download into
descriptor_type (str) – descriptor type, this is guessed if not provided
start (datetime.datetime) – publication time to begin with
end (datetime.datetime) – publication time to end with
document_handler (stem.descriptor.__init__.DocumentHandler) – method in which to parse a
NetworkStatusDocument
timeout (int) – timeout when connection becomes idle, no timeout applied if None
retries (int) – maximum attempts to impose
- Returns
iterator for
Descriptor
instances in the file- Raises
ValueError if unable to determine the descirptor type
TypeError if we cannot parse this descriptor type
DownloadFailed
if the download fails
-
download
(directory, decompress=True, timeout=None, retries=3, overwrite=False)[source]¶ Downloads this file to the given location. If a file already exists this is a no-op.
- Parameters
directory (str) – destination to download into
decompress (bool) – decompress written file
timeout (int) – timeout when connection becomes idle, no timeout applied if None
retries (int) – maximum attempts to impose
overwrite (bool) – if this file exists but mismatches CollecTor’s checksum then overwrites if True, otherwise rases an exception
- Returns
str with the path we downloaded to
- Raises
DownloadFailed
if the download failsIOError if a mismatching file exists and overwrite is False
-
class
stem.descriptor.collector.
CollecTor
(retries=2, timeout=None)[source]¶ Bases:
object
Downloader for descriptors from CollecTor. The contents of CollecTor are provided in an index that’s fetched as required.
- Variables
retries (int) – number of times to attempt the request if downloading it fails
timeout (float) – duration before we’ll time out our request
-
get_server_descriptors
(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)[source]¶ Provides server descriptors published during the given time range, sorted oldest to newest.
- Parameters
start (datetime.datetime) – publication time to begin with
end (datetime.datetime) – publication time to end with
cache_to (str) – directory to cache archives into, if an archive is available here it is not downloaded
bridge (bool) – standard descriptors if False, bridge if True
timeout (int) – timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
retries (int) – maximum attempts to impose on a per-archive basis
- Returns
iterator of
ServerDescriptor
for the given time range- Raises
DownloadFailed
if the download fails
-
get_extrainfo_descriptors
(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)[source]¶ Provides extrainfo descriptors published during the given time range, sorted oldest to newest.
- Parameters
start (datetime.datetime) – publication time to begin with
end (datetime.datetime) – publication time to end with
cache_to (str) – directory to cache archives into, if an archive is available here it is not downloaded
bridge (bool) – standard descriptors if False, bridge if True
timeout (int) – timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
retries (int) – maximum attempts to impose on a per-archive basis
- Returns
iterator of
RelayExtraInfoDescriptor
for the given time range- Raises
DownloadFailed
if the download fails
-
get_microdescriptors
(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶ Provides microdescriptors estimated to be published during the given time range, sorted oldest to newest. Unlike server/extrainfo descriptors, microdescriptors change very infrequently…
"Microdescriptors are expected to be relatively static and only change about once per week." -dir-spec section 3.3
CollecTor archives only contain microdescriptors that change, so hourly tarballs often contain very few. Microdescriptors also do not contain their publication timestamp, so this is estimated.
- Parameters
start (datetime.datetime) – publication time to begin with
end (datetime.datetime) – publication time to end with
cache_to (str) – directory to cache archives into, if an archive is available here it is not downloaded
timeout (int) – timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
retries (int) – maximum attempts to impose on a per-archive basis
- Returns
iterator of :class:`~stem.descriptor.microdescriptor.Microdescriptor for the given time range
- Raises
DownloadFailed
if the download fails
-
get_consensus
(start=None, end=None, cache_to=None, document_handler='ENTRIES', version=3, microdescriptor=False, bridge=False, timeout=None, retries=3)[source]¶ Provides consensus router status entries published during the given time range, sorted oldest to newest.
- Parameters
start (datetime.datetime) – publication time to begin with
end (datetime.datetime) – publication time to end with
cache_to (str) – directory to cache archives into, if an archive is available here it is not downloaded
document_handler (stem.descriptor.__init__.DocumentHandler) – method in which to parse a
NetworkStatusDocument
version (int) – consensus variant to retrieve (versions 2 or 3)
microdescriptor (bool) – provides the microdescriptor consensus if True, standard consensus otherwise
bridge (bool) – standard descriptors if False, bridge if True
timeout (int) – timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
retries (int) – maximum attempts to impose on a per-archive basis
- Returns
iterator of
RouterStatusEntry
for the given time range- Raises
DownloadFailed
if the download fails
-
get_key_certificates
(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶ Directory authority key certificates for the given time range, sorted oldest to newest.
- Parameters
start (datetime.datetime) – publication time to begin with
end (datetime.datetime) – publication time to end with
cache_to (str) – directory to cache archives into, if an archive is available here it is not downloaded
timeout (int) – timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
retries (int) – maximum attempts to impose on a per-archive basis
- Returns
iterator of :class:`~stem.descriptor.networkstatus.KeyCertificate for the given time range
- Raises
DownloadFailed
if the download fails
-
get_bandwidth_files
(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶ Bandwidth authority heuristics for the given time range, sorted oldest to newest.
- Parameters
start (datetime.datetime) – publication time to begin with
end (datetime.datetime) – publication time to end with
cache_to (str) – directory to cache archives into, if an archive is available here it is not downloaded
timeout (int) – timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
retries (int) – maximum attempts to impose on a per-archive basis
- Returns
iterator of :class:`~stem.descriptor.bandwidth_file.BandwidthFile for the given time range
- Raises
DownloadFailed
if the download fails
-
get_exit_lists
(start=None, end=None, cache_to=None, timeout=None, retries=3)[source]¶ TorDNSEL exit lists for the given time range, sorted oldest to newest.
- Parameters
start (datetime.datetime) – publication time to begin with
end (datetime.datetime) – publication time to end with
cache_to (str) – directory to cache archives into, if an archive is available here it is not downloaded
timeout (int) – timeout for downloading each individual archive when the connection becomes idle, no timeout applied if None
retries (int) – maximum attempts to impose on a per-archive basis
- Returns
iterator of :class:`~stem.descriptor.tordnsel.TorDNSEL for the given time range
- Raises
DownloadFailed
if the download fails
-
index
(compression='best')[source]¶ Provides the archives available in CollecTor.
- Parameters
compression (descriptor.Compression) – compression type to download from, if undefiled we’ll use the best decompression available
- Returns
dict with the archive contents
- Raises
If unable to retrieve the index this provide…
ValueError if json is malformed
IOError if unable to decompress
DownloadFailed
if the download fails
-
files
(descriptor_type=None, start=None, end=None)[source]¶ Provides files CollecTor presently has, sorted oldest to newest.
- Parameters
descriptor_type (str) – descriptor type or prefix to retrieve
start (datetime.datetime) – publication time to begin with
end (datetime.datetime) – publication time to end with
- Returns
list of
File
- Raises
If unable to retrieve the index this provide…
ValueError if json is malformed
IOError if unable to decompress
DownloadFailed
if the download fails