When R² Turns Negative — And What It Really Means

Negative R Squared

We’re often told that the R² value in a regression model ranges between 0 and 1. A score close to 1? Great model! A score near 0? Not so great.

But here’s a question most textbooks skip:
Can R² be negative?

Yes. And when it is, it’s a big red flag.
Let me walk you through why — with code, plots, and a surprising mathematical twist.

When Guessing Outperforms Your Model

Imagine you’re trying to predict house prices.
But instead of using real-world features like square footage or number of rooms…
You train your model using just the row index of the dataset (0, 1, 2, …).

You know it won’t perform well — but how badly?

Let’s find out:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt

# Actual data: decreasing trend
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([10, 8, 6, 4, 2])

# Bad model: predict constant (wrong!) values
y_bad = np.array([5, 5, 5, 5, 5])
model = LinearRegression()
model.fit(X, y_bad)  # Fitting on wrong y

# Predict on true y values
y_pred = model.predict(X)
r2 = r2_score(y, y_pred)

# Output R²
print("R² Score:", r2)

# Plot
plt.scatter(X, y, color='blue', label='Actual data')
plt.plot(X, y_pred, color='red', label='Bad model')
plt.axhline(np.mean(y), color='green', linestyle='--', label='Mean of y')
plt.title(f"Bad Model Fit (R² = {r2:.2f})")
plt.legend()
plt.grid(True)
plt.show()

The model is worse than predicting the average, which gives you a negative R². That’s not a bug — that’s your model admitting defeat.

So :

  • R² is not a square in the arithmetic sense.
  • Squaring a number like x^2 is different — even complex numbers behave differently.
  • But R² is about how well your model captures variability. And if it performs poorly, R² will let you know — even with a minus sign.
Bottom line:

If you get a negative R², it’s your model’s polite way of saying:

“You’d be better off guessing the mean.”

What Does Negative R² Really Mean?

Mathematically, R² is defined as:

textCopyEditR² = 1 - (SS_res / SS_tot)
  • SS_res: residual sum of squares (errors of the model)
  • SS_tot: total sum of squares (errors of the mean-only model)

If your model’s error is larger than that of a “dumb” average-based prediction,
then SS_res > SS_tot → R² becomes negative.

So, in plain English:

A negative R² means your model performs worse than no model at all.

Let’s Clear Up a Common Misconception: “Squared” Doesn’t Mean Power

The name R-squared misleads many into thinking it’s always non-negative — like squaring a number.

But here’s the catch:

  • The “squared” in R² comes from squaring residuals (the errors),
    not from squaring some kind of correlation coefficient.
  • It’s a statistical measure, not a mathematical power operation.

In fact, real squaring behaves differently:

x = 3
print(x**2)  # 9
print((-3)**2)  # 9

Always positive.

But when you move into complex numbers, things get spicier:

import cmath
z = complex(0, 1)  # i
print(z**2)  # Output: (-1+0j)

Yes — i² = -1.
So, squaring in complex math can result in negative values.

Key Takeaways for Readers

  1. R² < 0 means your model is worse than guessing.
    Always check your features and assumptions.
  2. “Squared” in R² doesn’t mean it behaves like math powers.
    It’s a statistical ratio, not a square root or power-of-two operation.
  3. In complex numbers, squared values can be negative — just like R².
    That’s a fun coincidence, not a cause — but it’s a helpful analogy.
  4. Always use R² with other metrics: residual plots, p-values, and domain intuition.

Modeling the Unthinkable: The Mathematics of Rare Event Prediction

Rare Event Modeling

Rare events—massive earthquakes, financial crashes, pandemics, cyberattacks—are statistically unlikely but carry enormous consequences. Modeling such events is critical for informed decision-making, risk management, and public safety. While conventional statistical models often fail due to data sparsity and class imbalance, mathematical tools such as Bayesian inference, Poisson processes, Extreme Value Theory, and anomaly detection provide more robust solutions.


1. What Are Rare Events?

A rare event is defined by its low probability of occurrence. Formally:

    $$ P(X \geq x) \ll 1 $$

where X is a random variable representing the magnitude or frequency of the event. Rare events may occur only a few times over centuries but can cause outsized harm. For example:

Example: Earthquakes

Suppose a region experiences a magnitude 8.0 earthquake roughly once every 500 years. The annual probability is:

    $$ P(\text { earthquake in a given year })=\frac{1}{500}=0.002 $$

Despite the low probability, this risk cannot be ignored due to the potential devastation. Policymakers, urban planners, and insurers rely on accurate rare event modeling to mitigate damage and allocate resources effectively.


2. Statistical Tools for Modeling Rare Events

2.1 Poisson Distribution

Used to model the number of rare events in a fixed interval of time or space:

    \begin{equation*} P(X=k)=\frac{\lambda^k e^{-\lambda}}{k!} \end{equation*}

where \lambda is the average rate of occurrence. For example, it can model the number of earthquakes per century.

2.2 Extreme Value Theory (EVT)

EVT focuses on modeling the extremes, such as maximum temperature or largest earthquake magnitude:

    $$ F(x)=\exp \left(-\left(1+\xi \frac{x-\mu}{\sigma}\right)^{-1 / \xi}\right) $$

Here, \mu , \sigmaand \xi represent the location, scale, and shape of the distribution respectively. EVT is widely used in engineering, hydrology, and seismology.

2.3 Data Challenges
  • Imbalanced Classes: Rare events constitute a tiny portion of total data.
  • High Variance: Sparse data leads to high uncertainty in estimations.
  • Misclassification: Standard machine learning models often default to predicting the majority class.

3. Bayesian Inference and the Beta Distribution

Why Bayesian?

Frequentist methods rely solely on observed data, which is problematic when events are rare. Bayesian inference incorporates prior beliefs and updates them with data:

    $$ P(\theta \mid X)=\frac{P(X \mid \theta) P(\theta)}{P(X)} $$

This is especially useful for estimating probabilities with limited observations.

The Role of the Beta Distribution

The Beta distribution is the conjugate prior for the Binomial distribution. It models probability values between 0 and 1 and updates easily with new data:

    $$ P(\theta)=\frac{\theta^{a-1}(1-\theta)^{b-1}}{B(a, b)} $$

Why Use It?

  • Natural model for probability values
  • Mathematical convenience (conjugacy)
  • Flexibility in representing different prior beliefs
Earthquake Example with Beta Distribution

Suppose historical data shows that 2 magnitude 7+ earthquakes have occurred in a region over 200 years. We want to estimate the annual probability of such an earthquake.

Start with an uninformative prior \operatorname{Beta}(1, 1):

  • Observed successes (earthquakes): k = 2
  • Failures (no earthquake): n - k = 198

Update the distribution:

    $$ \text { Posterior }=\operatorname{Beta}(1+2,1+198)=\operatorname{Beta}(3,199) $$

ExpectedProbability :

    $$ E[\theta]=\frac{3}{3+199}=\frac{3}{202} \approx 0.0148 $$

This means the best estimate for the annual probability of a large earthquake is about 1.48%.

Why Add 1 in Beta(1 + k, 1 + n − k)?

We use \operatorname{Beta}(1, 1) because we start with a prior belief, and in the absence of prior data, we use the uninformative prior \operatorname{Beta}(1, 1) :

  • Equivalent to assuming 1 prior success and 1 prior failure
  • Prevents the model from being overly confident in absence of data
  • Ensures mathematical tractability (conjugacy)

In our example:

  • Prior: \operatorname{Beta}(1, 1)
  • Data: 2 events in 200 years.

Posterior becomes:

$\operatorname{Beta}(1+2,1+198)=\operatorname{Beta}(3,199)$

This smooths the estimate and allows for consistent updating if new data arrives.

If new geological studies suggest the fault line stress has increased, we can incorporate that information into a new prior and update again.


4. Anomaly Detection Techniques

When labeled data is scarce, unsupervised anomaly detection is effective:

4.1 Kernel Density Estimation (KDE)

Estimates the probability density function of data:

    $$ \hat{f}(x)=\frac{1}{n h} \sum_{i=1}^n K\left(\frac{x-X_i}{h}\right) $$

Anomalies are values with low estimated density.

4.2 One-Class SVM

Learns a boundary around normal data. Points outside are flagged as anomalies:

    $$ \min _{w, \xi} \frac{1}{2}\|w\|^2+\frac{1}{\nu n} \sum \xi_i $$

subject to $w \cdot \phi\left(x_i\right) \geq \rho-\xi_i.$

Fraud Detection Example

A bank detects fraud by modeling normal transaction patterns. If a user who typically spends $100 suddenly initiates a $10,000 transaction, KDE will assign this a near-zero density, flagging it as anomalous.

5. Conclusion

Rare events may be statistically improbable, but their impacts are immense. Accurate modeling requires blending statistical rigor with domain knowledge. Techniques like Bayesian inference with Beta priors, Poisson processes, EVT, and anomaly detection allow us to quantify uncertainty and make informed decisions even when data is sparse.

Understanding and planning for rare events is not just about numbers—it’s about resilience.

What rare event risks are you modeling in your industry? Let us know in the comments or reach out to discuss!

Python Example :

Here’s a Python code snippet that demonstrates:

  • Modeling rare event probability using the Beta distribution
  • Updating it using Bayesian inference
  • Visualizing the prior and posterior distributions

We’ll use the earthquake example: 2 events in 200 years.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta

# Prior: Uniform prior Beta(1, 1)
a_prior, b_prior = 1, 1

# Data: 2 earthquakes in 200 years
k = 2           # number of rare events
n = 200         # number of observations (years)
a_post = a_prior + k
b_post = b_prior + (n - k)

# Probability range for plotting
theta = np.linspace(0, 0.05, 1000)

# Prior and Posterior Beta distributions
prior_dist = beta(a_prior, b_prior)
posterior_dist = beta(a_post, b_post)

# Plotting
plt.figure(figsize=(10, 6))
plt.plot(theta, prior_dist.pdf(theta), label=f'Prior: Beta({a_prior}, {b_prior})', linestyle='--')
plt.plot(theta, posterior_dist.pdf(theta), label=f'Posterior: Beta({a_post}, {b_post})', linewidth=2)
plt.axvline(posterior_dist.mean(), color='red', linestyle=':', label=f'Posterior Mean: {posterior_dist.mean():.4f}')
plt.title('Bayesian Update of Earthquake Probability (Beta Distribution)', fontsize=14)
plt.xlabel('Annual Earthquake Probability')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
plt.show()
Rare Event modeling with Beta Distribution

Output Summary:

  • Posterior Mean: ~0.01485, or 1.48% annual probability
  • You’ll see how the belief about the earthquake probability shifts from a uniform prior to a sharper peak after observing two events in 200 years.

Let me know if you want to simulate further updates (e.g., new earthquake in year 201), or integrate this into a web app or dashboard.

What is p-value?

This image effectively conveys how statistical significance and p-values are used to assess the strength of evidence against the null hypothesis in hypothesis testing

A p-value is a concept used in statistics to help us decide whether the results of an experiment or study are meaningful or just happened by chance.

Imagine This Scenario: Suppose you have a coin, and you want to test if it’s fair (has an equal chance of landing heads or tails). You flip it 100 times, and it comes up heads 60 times. You might wonder: Is this coin actually unfair, or did I just get an unusual result by chance?

What Exactly is a P-Value? The p-value tells you the probability of getting results at least as extreme as the ones you observed, assuming that the null hypothesis is true.

  • The Null Hypothesis (H₀) : This is the default assumption.
  • The Alternative Hypothesis (H₁) : This is what you want to prove.

Mathematical Formula

The p-value can be calculated using various statistical tests. A common formula for the p-value in hypothesis testing is:

    $$ p\_value = P\left(X \geq x \mid H_0\right) $$

This formula represents the probability of observing a test statistic as extreme as $x$(or more extreme) under the assumption that the null hypothesis $H0$​ is true.

relationship between probability, statistical significance, and the p-value in the context of a hypothesis test

The image shows that if an observed result (like the orange dot) is far from the true value under the null hypothesis and falls within the green shaded area (p-value), it is considered statistically significant. This indicates that the result is unlikely to have occurred by random chance.

Key Takeaways:

  • P-Value helps determine the significance of your results.
  • Low P-Value (< 0.05): Strong evidence against the null hypothesis.
  • High P-Value (> 0.05): Weak evidence against the null hypothesis; results could be due to chance.
  • Always consider p-values in the context of your study and alongside other statistical measures.

Remember: P-values don’t tell you the probability that the null hypothesis is true; they tell you how compatible your data is with the null hypothesis.

Let’s say you are testing a new drug that you think lowers blood pressure more effectively than an existing drug. Here’s how the p-value and statistical significance would play a role:

Scenario: Testing a New Drug

  • Null Hypothesis (H₀):
    The new drug has no effect on blood pressure compared to the existing drug (i.e., the difference in effectiveness is zero).
  • Alternative Hypothesis (H₁):
    The new drug is more effective at lowering blood pressure than the existing drug (i.e., there is a difference in effectiveness).
  • Experiment:
    You conduct a study with two groups: one takes the new drug, and the other takes the existing drug.
    After the study, you calculate the average reduction in blood pressure for both groups.
  • Observed Result:
    Suppose the group taking the new drug shows a reduction of 8 mmHg in blood pressure, while the group taking the existing drug shows a reduction of 5 mmHg.
  • Statistical Test:
    You perform a statistical test (e.g., a t-test) to compare the blood pressure reductions between the two groups.
    The test generates a p-value, which represents the probability of observing a difference as extreme as 3 mmHg (8 mmHg – 5 mmHg) or more, assuming the null hypothesis is true (i.e., assuming the drugs are equally effective).

Interpreting the P-Value:

    • If the p-value is low (e.g., p < 0.05): This suggests that the observed difference is unlikely to have occurred by chance. In this case, you might reject the null hypothesis and conclude that the new drug is likely more effective than the existing drug.
    • If the p-value is high (e.g., p > 0.05): This suggests that the observed difference could have occurred by chance. In this case, you would not have enough evidence to reject the null hypothesis, and you might conclude that the new drug is not significantly more effective than the existing drug.

    Example Outcome:

    • Suppose the p-value calculated is 0.03.
    • Since 0.03 is less than the common significance level of 0.05, you would consider this result statistically significant. This means there is a strong indication that the new drug has a different (likely better) effect on blood pressure than the existing drug.

    Conclusion: In this case, because the p-value is low, you might reject the null hypothesis and conclude that the new drug is more effective in reducing blood pressure. This helps support the decision to potentially use the new drug over the existing one.

    This example demonstrates how the p-value helps determine whether an observed effect in an experiment is likely due to a real effect or just random chance.

    Notes:

    The p-value is highly dependent on the sample size. With very large samples, even trivial differences can become statistically significant (low p-value), while small samples may fail to detect meaningful differences.

    The conventional cutoff for statistical significance is often set at p < 0.05. However, this threshold is arbitrary and may not be appropriate in all contexts.

    Focusing solely on p-values can cause researchers to ignore other important aspects of data, such as confidence intervals, effect sizes, and the study’s context. Combine p-values with other statistical measures and the broader context of the research to draw more accurate and meaningful conclusions.

    YoloV9 Code for Object Detection + Segmentation and Tracking

    Yolo V9 Tracking + Object Tracing + Segmentation

    First, See The Video below then see the code to do Object Detection + Segmentation and Tracking in that

    YoloV9 Code for Object Detection + Segmentation and Tracking

    Now, See The Python Code for Video above

    import numpy as np
    import supervision as sv
    from ultralytics import YOLO
    
    model = YOLO("yolov9e-seg.pt")
    tracker = sv.ByteTrack()
    box_annotator = sv.MaskAnnotator()
    label_annotator = sv.LabelAnnotator(text_color=sv.Color.BLACK)
    trace_annotator = sv.TraceAnnotator()
    
    def callback(frame: np.ndarray, _: int) -> np.ndarray:
        results = model(frame)[0]
        detections = sv.Detections.from_ultralytics(results)
        detections = tracker.update_with_detections(detections)
    
        labels = [
            f"#{tracker_id} {results.names[class_id]}"
            for class_id, tracker_id
            in zip(detections.class_id, detections.tracker_id)
        ]
    
        annotated_frame = box_annotator.annotate(
            frame.copy(), detections=detections)
        annotated_frame = label_annotator.annotate(
            annotated_frame, detections=detections, labels=labels)
        return trace_annotator.annotate(
            annotated_frame, detections=detections)
    
    sv.process_video(
        source_path="1.mp4",
        target_path="1-result_2.mp4",
        callback=callback
    )