Methodology
Overview
Knovolo assesses event plausibility using geospatial context, multi-source evidence, and temporal dynamics. This page documents how our intelligence infrastructure works - not to showcase complexity, but because systematic analysis requires transparent methodology.
Core Architecture
Three-Layer Processing Pipeline
Sources Layer
- 200+ RSS/Atom news feeds
- News aggregation APIs (GDELT, NewsAPI)
- Social media signals (Twitter/X, Reddit)
- Public records and government filings
Analysis Layer
- H3 geospatial clustering (resolution 7)
- Bayesian plausibility scoring
- Entity extraction (Named Entity Recognition)
- Gazetteer based entity extraction
- Temporal context modeling
Intelligence Layer
- Verified events database
- Map view
- Entity relationship graphs
- Coverage pattern analysis
- Query interface (Knovolo Query Language)
Geospatial Foundation
H3 Hexagonal Indexing
Knovolo uses Uber's H3 hierarchical hexagonal indexing system as its primary spatial reference framework. All geospatial data is normalized to H3 resolution 7 (average hexagon area ≈ 5.16 km²), providing a consistent global spatial unit that balances geographic precision with computational efficiency.
Hexagonal indexing is used to avoid directional bias, edge effects, and scale inconsistencies inherent in square-grid or administrative-boundary-based systems.
Hex-Level Data Integration
Each H3 hexagon functions as a spatial container for heterogeneous public datasets. Data is aggregated, normalized, and stored at the hex level, enabling direct comparison and composability across sources.
Indexed attributes may include:
- Population and settlement density (e.g. WorldPop, census-derived datasets)
- Infrastructure presence and classification (e.g. OpenStreetMap)
- Economic activity proxies (e.g. night-time light intensity, land use)
- Historical event occurrence by category and frequency
- Terrain and physical geography (elevation, slope, land cover)
- Climate and environmental conditions (temperature anomalies, precipitation, vegetation indices)
- Socioeconomic indicators and contextual risk factors derived from public datasets
All attributes are stored with explicit temporal references where applicable.
Regional Aggregation
H3 hexagons aggregate to administrative regions:
Hexagons → Neighborhoods → Districts → Cities → States → Countries
For any region (e.g., 'Berlin'), we:
- Identify all intersecting H3 hexagons
- Calculate overlap percentages
- Aggregate hex-level data weighted by overlap
- Generate region-specific priors for each event category
Example: A 'political violence' prior aggregates infrastructure density, government presence, and historical patterns across all constituent hexagons in the applicable area.
Plausibility Scoring Model
Bayesian Framework
Event plausibility combines multiple evidence sources:
logP(Eventt∣Location)=zcat(E∣h)+∑δk⋅e−λk(t−te)+∑WklogP(signals∣Event)+logPHMM(statet∣history) Component Breakdown
1. Geospatial Prior: zcat(E∣h)
Base probability for each event category in this location. Derived from:
Event category specific data. Category-specific examples:
- Protests: high in capitals, urban centers with government buildings
- Natural disasters: based on terrain, climate, historical patterns
- Infrastructure failures: age of systems, maintenance indicators
- Political violence: security infrastructure, historical conflict data
Historical baseline integration:
zcat(E∣h)=zgeo(E∣h)+α⋅max(0,zhist(E∣h)) Where:
- z_geo: Current geospatial characteristics
- z_hist: Historical event frequency (NASA PEND GDIS, EM-DAT)
- α: Small weight (~0.1-0.2) ensuring history informs but doesn't constrain
- max(0, ...): History can only increase prior, never decrease
Principle: Novel events aren't penalized by lack of precedent.
2. Temporal Context: ∑δk⋅e−λk(t−te)
Ongoing situations that modify current event likelihood. Context modifiers:
- Active heat waves (increases wildfire plausibility)
- Drought conditions (water conflict, agricultural stress)
- Existing civil unrest (escalation probability)
- Weather patterns (floods, storms)
- Seasonal factors (fire season, monsoon season)
Exponential decay: Recent contexts weigh more heavily than distant ones. A heat wave from 2 days ago has more impact than one from 2 months ago.
Example: Berlin heat wave (started 5 days ago) →δfire⋅e−0.1⋅5=δfire⋅0.61→ Wildfire signals get 61% boost from this context
3. Multi-Source Evidence: ∑WklogP(signals∣Event)
Weighted signals from multiple source types. Source weights (Wk) calibrated by:
- Historical reliability
- Geographic specificity
- Temporal precision
- Account diversity (for social signals)
Source types:
Social Media (Wsocial):
- Signal volume in geographic cluster
- Account diversity (not bot-like patterns)
- Temporal clustering (reports within narrow time window)
- Geographic specificity of reports
News Outlets (Wnews):
- Outlet credibility score
- Geographic proximity to event
- Original vs. aggregated reporting
- Citation of sources
Official Sources (Wofficial):
- Government statements
- Emergency service dispatches
- Institutional announcements
- Regulatory filings
Signal likelihood calculation:
P(signal∣Event)=f(geographic_clustering,temporal_clustering,source_diversity,content_consistency) Example: 15 social media signals + 2 news alerts + 1 official statement → Weighted evidence score → Higher confidence than 100 bot-like social signals alone
4. HMM State Tracking: logPHMM(statet∣history)
Hidden Markov Model tracking situational evolution. State space examples:
- Peaceful → Tension → Active conflict → De-escalation
- Clear weather → Storm warning → Active storm → Recovery
- Normal operations → Service degradation → Outage → Restoration
Transition probabilities: Learned from historical sequences of how situations typically evolve in specific contexts.
Why this matters: A 'riot' signal has different plausibility if current state is:
- Peaceful protest (lower transition probability)
- Tense standoff with police (higher transition probability)
- Active clashes (much higher—escalation already underway)
Regional variation: State transition probabilities vary by location. Protest → police response → escalation has different probabilities in different countries/cities based on historical patterns.
Verification Pipeline
Four-Stage Assessment
Events progress through increasing confidence thresholds:
Speculative (SPEC-EVT)
- Threshold: 3+ signals clustering geographically
- Confidence: 20-40%
- Status: 'Signals detected, monitoring'
- Action: Automated tracking initiated
Plausible (PLAUS-EVT)
- Threshold: Multiple independent signals + temporal clustering
- Confidence: 40-70%
- Status: 'Plausible event, awaiting verification'
- Action: Enhanced monitoring across source types
Confirmed (CONF-EVT)
- Threshold: 3+ independent credible sources
- Confidence: 70-90%
- Status: 'Event confirmed'
- Action: Entity extraction, relationship mapping
Supported (SUPP-EVT)
- Threshold: Multiple mainstream sources + official statements
- Confidence: 90%+
- Status: 'Widely confirmed'
- Action: Full coverage analysis, archival
Entity Intelligence
Event-Scoped Extraction
Knovolo distinguishes between public entities and private individuals in its entity handling methodology. Public entities - including governments, corporations, NGOs, international bodies, and recognized public officeholders - may be represented through entity cards when they can be resolved against authoritative public records (such as Wikidata or comparable structured public registries), allowing aggregation and relationship mapping across events at an institutional level.
Private individuals are never represented as identifiable entities; references to non-public persons are retained only in anonymized, non-identifying form (e.g. '50 casualties,' 'local residents') for event context. These anonymized references may persist as aggregate descriptors but are not searchable, not linkable across events, and not attributable to specific individuals. Relationship mapping is strictly limited to event-derived, non-personal associations and explicitly excludes personal relationships, social networks, attendance patterns, or behavioral profiling, in line with data minimization and purpose limitation principles.
Entity types:
Public Entities & Organizations (tracked IF a Wikidata entity or authoritative public record exists):
- Governments, corporations, NGOs, international bodies, politicians, CEOs, high-ranking officials, other recognized officeholders
- Persistent tracking across events
- Relationship graphs showing institutional connections
Private Individuals (minimal tracking):
- Not searchable as primary entities
- Extracted for event context only
- No cross-event tracking
- Auto-expire after event significance drops
- Non-identifiable references may persist in aggregate form (e.g., '50 casualties')
Relationship Mapping
Relationship types:
- mentioned_in_event
- issued_statement_about
- organization_involved_in
- official_responded_to
NOT tracked:
- Personal relationships
- Social networks
- Attendance patterns
- Behavioral profiles
Principle: Institutional accountability, not individual surveillance.
Coverage Analysis
Outlet Monitoring
We track which sources cover which events and how they report them. Metrics:
- Political leaning and reporting tendencies (moving beyond simple left/center/right, capturing nuanced editorial positions)
- Coverage volume (article count per event)
- Temporal patterns (who reports first, who follows)
- Narrative framing differences (tone, focus, and context)
- Under-reported vs. over-reported events
Bias scoring:
- Not a label of 'biased vs unbiased'
- Measures divergence from aggregate coverage patterns
- Highlights framing differences, omission patterns, and selective emphasis
Pattern detection:
- Stories covered by some outlets but not others
- High-confidence events with low mainstream coverage
- Narrative divergence for the same event across multiple sources
Query Interface
Query Interface – Knovolo Query Language (KQL)
Purpose: Structured syntax for filtering events, entities, and relationships. Designed to scale to advanced analytics over time, but initially limited to core operations.
Core Query Commands (MVP Phase):
- FIND – Search events or entities by text or metadata filters.
- GRAPH – Traverse relationships in the knowledge graph.
- RELATE – Identify connections between entities across events.
Notes:
- Advanced commands such as TREND, DIVERGE, IMPACT, GAP, SYNTHESIZE are part of the future roadmap.
- Initial focus is on FIND, GRAPH, and RELATE, providing simple yet flexible exploration of events and entities.
- KQL is designed to be extensible, supporting additional filters, graph traversals, and predictive queries as the platform grows.
Extensive documentation on the commands are available with a subscription to Knovolo Intelligence Terminal.
Data Sources
News & Media
RSS/Atom Feeds (~200 outlets): Major international news organizations, regional outlets, and specialized publications.
News APIs:
- GDELT Project: Global event database with structured coverage of political, social, and conflict events.
- NewsAPI: Aggregated news from multiple sources for broader event coverage.
Social Media:
- Twitter/X API: Geo-tagged signals and trending topics.
- Bluesky – Emerging decentralized platform.
- Mastodon – Federated network monitoring.
Future expansions: Reddit, YouTube, Telegram, and dark web monitoring.
Public Records
Government & Regulatory:
- SEC EDGAR (corporate filings)
- EU Transparency Register
- Municipal building permits
- Court documents (public filings)
Official Statements:
- Government press releases
- Emergency service announcements
- Diplomatic communications
Geospatial Data
Satellite Imagery:
- Sentinel-2 (ESA, 5-day revisit)
- Landsat (NASA/USGS, 16-day revisit)
Infrastructure:
- OpenStreetMap (roads, buildings, points of interest)
- Administrative boundaries (countries, states, districts)
Context Data:
- WorldPop (population density)
- Climate/weather data (NOAA, OpenWeather)
Additional geospatial indicators will be added later.
Privacy & Ethics
Design Principles
- Institutional Focus: Track governments, corporations, organizations, and high-ranking individuals linked to organizations — not individuals.
- Public Interest: We collect and analyze data strictly in support of public interest activities, focusing on contextualized event intelligence rather than personal tracking. All algorithms and data pipelines are designed to provide insight into events, entities, and patterns that are relevant to societal, economic, and environmental outcomes, ensuring that the intelligence produced benefits everyone while respecting privacy and legal frameworks.
- Event-Scoped Intelligence: Entities linked to specific events, not persistent dossiers across time.
- Public Data Only: No hacking, data breaches, or private communications. Only publicly available information; preferably Open-Source.
- No Demographic Profiling: We don't track race, religion, political affiliation, or personal behavior patterns.
- Transparent Methodology: All scoring shows calculation basis. No black-box algorithms.
What We Don't Build
- ❌ Individual surveillance tools
- ❌ Social relationship tracking
- ❌ Behavioral prediction models
- ❌ Demographic profiling systems
- ❌ Tools for harassment or doxing
GDPR Compliance
- EU data residency (AWS Europe)
- Right to be forgotten (entity removal upon request – E-Mail: compliance@knovolo.com)
- Data minimization (only what's needed for analysis)
- Purpose limitation (intelligence analysis only)
- Transparent processing (methodology documented)
Limitations & Biases
Known Limitations
Geographic Bias:
- Better coverage in English-speaking regions
- Urban areas over-represented vs. rural
- Wealthy countries have more data sources
Source Bias:
- Dependent on available public data
- News outlet selection affects coverage patterns
- Social media skews younger, more connected populations
Event Type Bias:
- Dramatic events (attacks, disasters) detected faster
- Slow-developing situations (droughts, economic shifts) harder to detect
- 'Newsworthy' events over-represented
Temporal Lag:
- RSS feeds: 5-30 minute delay
- News APIs: 10-60 minute delay
- Official sources: Often hours after event
- Social media: Fastest but least reliable
What We're Working On
- Expanding non-English source coverage
- Improving rural/developing region monitoring
- Better slow-onset event detection
- Reducing temporal lag in verification
Validation & Accuracy
How We Test
Historical Event Validation: Test model on known past events to measure:
- Detection speed (time to first signal)
- False positive rate
- Confidence score calibration
- Coverage completeness
Novel Event Testing: Ensure model doesn't suppress unprecedented events (climate shifts, new technologies, emerging threats).
Bias Auditing: Regular analysis of coverage patterns to detect and correct systematic biases.
Confidence Score Calibration
We calibrate confidence scores so that:
- 70% confidence events are actually true ~70% of the time
- 90% confidence events are actually true ~90% of the time
Validated against historical data and continuously adjusted.
Future Development
Planned Enhancements
Data Sources (2026):
- Traffic & Mobility: Traffic camera APIs, flight tracking (ADS-B), shipping activity (AIS)
- News & Media: RSS/Atom feeds (~200 outlets), GDELT, NewsAPI, expanded non-English sources
- Social Media Signals: Twitter/X, Bluesky, Mastodon (Reddit/YouTube/Telegram/Dark Web monitoring potentially later)
- More non-English news sources
- H3 Geospatial Layers: 151+ layers including population, infrastructure, environmental, conflict/security, predictive models, satellite imagery, socioeconomic indices, and risk factors
- Other Public Data: World Bank, UNDP, WHO, NASA, NOAA, OpenStreetMap, and similar open-access datasets
Analysis Capabilities:
- Improved entity disambiguation
- Better slow-onset event detection
- Predictive modeling (infrastructure failure risk)
- Enhanced causal relationship detection
Historical Integration:
- NASA PEND GDIS (disasters 1900-present)
- ACLED (armed conflict data)
- More comprehensive baseline modeling
Technical Details
Infrastructure
Backend: Rust (performance, safety) Database: PostgreSQL + PostGIS (geospatial queries) Search: Meilisearch + pgvector (semantic search) Processing: Kafka (stream processing) Storage: S3-compatible object storage (satellite imagery, archives)
Scale:
- ~200 sources monitored continuously
- ~10,000 events processed daily
- ~1M articles indexed
- Sub-second query response times
Questions?
This methodology is a living document. As we improve our systems, we'll update this page to reflect current capabilities.
For technical questions or clarifications: luis@knovolo.com
For access to full technical documentation: Request Intelligence Terminal access
Last updated: 3. January 2026