Using Python in Power BI: The Complete, No-Nonsense Guide

Final Exam & Certification

Complete this tutorial and pass the 19-question final exam to earn a downloadable certificate of completion.

skip to exam

Power BI handles dashboards and drag-and-drop visualizations with ease. Python handles everything else. When you bring them together, you get a business intelligence environment that can ingest data from virtually any source, run statistical models and machine learning algorithms, and render visualizations that Power BI's native tooling simply cannot produce on its own.

This is not a surface-level overview. We are going to walk through exactly how the integration works under the hood — including the CSV-based data exchange mechanism that most documentation glosses over — where Python fits into real workflows, what Python Enhancement Proposals (PEPs) make the whole thing possible at a language level, and what the current state of affairs looks like in 2026, including significant deprecation changes, the PBIR format shift, and how Copilot is reshaping where Python fits in the broader analytics stack.

What's in this Python Tutorial

Why Python Belongs in Your Power BI Workflow

3
integration modes in Power BI Desktop
30 min
max script timeout in Power Query
1 min
visual timeout in the Service
72 DPI
locked by wrapper — cannot be overridden
May 2026
deprecation: Python visuals in embedded / Publish to web
3.11
Python version in the Power BI Service runtime

Power BI speaks DAX and M (Power Query). These are powerful within their domain, but they have hard ceilings. DAX does not do regression analysis. M does not train a clustering model. Neither of them can reach into an obscure legacy API, parse a nested JSON response, and reshape it into a clean tabular format in ten lines of code.

Python can.

In a 2020 Dropbox Blog profile, Python creator Guido van Rossum offered his theory on why the language took over data science so completely. He argued that once Python reached a critical mass of adoption in scientific computing, network effects locked it in: it became easier for every new practitioner to join the Python ecosystem than to bet on an alternative. That critical mass was built on the back of one architectural decision — Python was designed from the ground up to support deep integration with independently developed third-party libraries. Libraries like NumPy, pandas, and scikit-learn were not built by the Python core team. They were built by the scientific computing community, in Python, because Python made that possible. That same extensibility is precisely what makes the Python-Power BI integration work.

"At the top, Python continues to cement its overall dominance" — IEEE Spectrum, ranking Python first in the Spectrum and Trending categories in its 2024 annual report, the most recent at time of publication.

When you write a Python script inside Power BI, you are not using some watered-down subset of the language. You are running actual Python on your local machine (or in a Fabric notebook in the cloud), with full access to the ecosystem. The bridge between the two systems is the pandas DataFrame — and that bridge rests on foundational Python specifications that are worth understanding.

The PEP Foundations That Make It Work

If you want to understand why Python integrates so cleanly with Power BI, you need to look at a few key PEPs (Python Enhancement Proposals) that established the standards underpinning the integration.

PEP 249: The Python Database API Specification v2.0

PEP 249 is arguably the most directly relevant specification. It defines a common interface for Python packages that connect to relational databases. Every time you use pyodbc, sqlite3, psycopg2, or any other database connector in a Python script inside Power BI, you are relying on a module that implements this specification.

PEP 249 standardizes how connection objects are created, how cursors work, how queries are executed, and how results are fetched. It defines the constructor connect(), which returns a Connection object, and lays out the full exception hierarchy — from InterfaceError for problems with the API layer to ProgrammingError for issues like a malformed SQL statement.

Why does this matter for Power BI? Because one of the primary use cases for Python in Power BI is as a data source connector. Power BI natively supports dozens of connectors, but it does not support everything. If your organization stores data in an obscure database system, a legacy flat-file format, or a proprietary API, you can write a Python script that uses a PEP 249-compliant module to connect, query, and return the data as a pandas DataFrame. Power BI consumes that DataFrame as if it came from any native connector. Here is a practical example:

Why This Matters

Every database connector you will ever use in a Power BI Python script — pyodbc, psycopg2, sqlite3, cx_Oracle, pymysql — is an implementation of PEP 249. Understanding the spec means you understand all of them. The connect(), cursor(), and execute() pattern is identical across every compliant driver. If you learn it once, you have learned it everywhere.

python
import pandas as pd
import pyodbc

conn_string = (
    "DRIVER={ODBC Driver 18 for SQL Server};"
    "SERVER=your_server.database.windows.net;"
    "DATABASE=SalesDB;"
    "UID=analyst;"
    "PWD=your_password;"
    "Encrypt=yes;"
    "TrustServerCertificate=no;"
    "Connection Timeout=30;"
)

query = """
    SELECT
        OrderDate,
        ProductCategory,
        SUM(Revenue)            AS TotalRevenue,
        COUNT(DISTINCT CustomerID) AS UniqueCustomers
    FROM Sales.Orders
    WHERE OrderDate >= '2025-01-01'
    GROUP BY OrderDate, ProductCategory
"""

# Use a context manager so the connection closes even if an error is raised
with pyodbc.connect(conn_string) as conn:
    df = pd.read_sql_query(query, conn)

When you paste this into Power BI's "Get Data > Python script" dialog and click OK, Power BI executes it, finds the df variable (a pandas DataFrame), and presents it for loading into the data model. The PEP 249-compliant pyodbc module handles the database handshake, and pandas handles the translation into a tabular format that Power BI understands.

PEP 484: Type Hints

PEP 484, authored by Guido van Rossum, Jukka Lehtosalo, and Lukasz Langa — drafted in 2014 and formally accepted in May 2015 — introduced a standard syntax for type annotations in Python. The specification shipped with Python 3.5 in 2015 and the typing module became non-provisional in Python 3.7. While type hints do not directly affect Power BI's runtime behavior, they have had a massive indirect impact on code quality in data science workflows.

When you write Python scripts for Power BI — especially complex transformation or machine learning scripts — type hints make your code dramatically more maintainable. Consider the difference between these two function signatures:

python
# Without type hints
def prepare_forecast(data, periods, freq):
    ...

# With type hints
def prepare_forecast(
    data: pd.DataFrame,
    periods: int,
    freq: str = "M"
) -> pd.DataFrame:
    ...

The second version communicates exactly what goes in, what comes out, and what the default behavior is. When your Power BI report depends on a Python transformation script, and someone else on your team needs to modify it six months from now, type hints are the difference between clarity and confusion.

PEP 484 built on PEP 3107, which introduced the syntax for function annotations back in 2006 but deliberately left the semantics undefined. PEP 484 gave those annotations a concrete meaning. Later, PEP 526 extended the concept to variable annotations, and PEP 563 proposed postponed evaluation of annotations to resolve forward reference issues. Together, these PEPs created the type hinting ecosystem that tools like mypy, Pylance, and modern IDEs rely on — the same tools you should be using when developing Python scripts destined for Power BI.

PEP 20: The Zen of Python

This one is philosophical, but it matters. PEP 20 — Tim Peters' collection of guiding principles — includes the aphorism: "There should be one — and preferably only one — obvious way to do it." When you are writing Python for Power BI, this principle is your guardrail against over-engineering. Power BI already has DAX. It already has M. You should reach for Python when there is no clean way to accomplish the task with native tools — not as a replacement for them.

Three Modes of Python Integration in Power BI Desktop

Power BI Desktop supports three distinct modes of Python integration, each serving a different stage of the analytics workflow. This section walks you through each one with enough detail to follow along and produce a working result — not just an explanation of what each mode does, but how to actually use it.

Before starting any of these exercises, confirm your environment is configured. Open Power BI Desktop, go to File > Options and settings > Options, then select Python scripting in the left panel. The "Detected Python home directories" dropdown should show your virtual environment path, or you can type it manually. If this setting is blank, Power BI will refuse to run any Python. Set it now and click OK before continuing.

Mode 1: Python as a Data Source — Guided Exercise

In this exercise you will load a small dataset into Power BI using nothing but a Python script. No CSV files, no database connection, no native connector — just Python producing a DataFrame that Power BI treats as a table.

Step 1. Open Power BI Desktop with a blank report. On the Home ribbon, click Get Data. In the dialog that opens, scroll to the bottom of the left panel and select Other. In the right panel, find Python script and click Connect.

Step 2. A script editor dialog opens with a blank text area labeled "Script". Paste the following script exactly as written:

python
import pandas as pd

# A small synthetic sales table — no external dependencies needed
df_sales = pd.DataFrame({
    "OrderID":   [1001, 1002, 1003, 1004, 1005],
    "Product":   ["Widget A", "Widget B", "Widget A", "Widget C", "Widget B"],
    "Region":    ["North", "South", "North", "East", "West"],
    "Revenue":   [1200.00, 850.50, 950.75, 2100.00, 670.25],
    "OrderDate": pd.to_datetime(["2025-01-15", "2025-01-22",
                                 "2025-02-03", "2025-02-14", "2025-03-01"])
})

# A second table — Power BI will offer both for selection
df_regions = pd.DataFrame({
    "Region":  ["North", "South", "East", "West"],
    "Manager": ["Alice", "Bob", "Carol", "David"],
    "Target":  [50000, 45000, 60000, 40000]
})

Step 3. Click OK. Power BI executes the script. After a brief pause, a Navigator dialog appears. You should see two items in the left panel: df_regions and df_sales. This is Power BI listing every DataFrame variable it found in the script's global namespace. Click df_sales and you will see a preview of the five rows in the right panel.

Step 4. Check the checkbox next to df_sales and click Load. The table loads into your data model. It will appear in the Fields pane on the right side of the report canvas, ready to use in any visual.

What You Just Did

Power BI executed your Python script, scanned its namespace for pandas DataFrames, and imported the result as a native table — exactly as if you had loaded a CSV or connected to a SQL database. The script runs again every time you click Refresh. In production, this script would contain real data access logic: API calls, database queries, file parsing. The mechanism is identical regardless of how complex the source is.

The DataFrame Variable Name Rule

Power BI only recognizes pandas.DataFrame objects assigned to module-level variables. If your DataFrame is created inside a function and the return value is not assigned at the top level, Power BI will not find it. Every DataFrame you want Power BI to see must be assigned to a variable at the outermost scope of the script — not inside a function, class, or conditional block.

Mode 2: Python as a Transformation Step — Guided Exercise

In this exercise you will take a loaded table, run it through a Python script that adds a derived column, and load the enriched result back into the data model. This is the mode you will reach for when DAX or M cannot cleanly handle what you need — complex conditional logic, NLP scoring, fuzzy matching, statistical transformations.

Step 1. Using the df_sales table you just loaded (or any table you have), go to Home > Transform data. This opens the Power Query Editor. Your query appears in the Queries panel on the left. Click it to select it so its preview is visible.

Step 2. On the Transform ribbon, click Run Python script. A script editor dialog opens. Unlike the data source dialog, this one already has your data — Power BI has serialized the current query table to a temporary CSV and will inject it into your script as a variable called dataset. You do not need to load anything yourself.

Step 3. Paste the following script. It adds a revenue tier label to each row — the kind of conditional multi-label derivation that is verbose in M and straightforward in Python:

python
import pandas as pd

# 'dataset' is injected by Power BI — it contains the current query table
df = dataset.copy()

# Add a revenue tier column
def classify_revenue(rev):
    if rev >= 2000:
        return "High"
    elif rev >= 1000:
        return "Medium"
    else:
        return "Low"

df["RevenueTier"] = df["Revenue"].apply(classify_revenue)

# IMPORTANT: always reassign back to 'dataset'
# Power BI looks for this variable name when collecting output
dataset = df

Step 4. Click OK. Power BI runs the script, then shows a Navigator dialog. You will see a variable called dataset in the left panel — this is the enriched DataFrame your script returned. Click it. You should see all the original columns plus the new RevenueTier column in the preview on the right.

Step 5. Click OK in the Navigator. Power Query adds a new applied step to your query pipeline, labeled "Run Python script", visible in the Applied Steps panel on the right. Click that step and the preview updates to show the enriched table with RevenueTier present.

Step 6. Rename the step: right-click "Run Python script" in the Applied Steps panel, choose Rename, and type something like "Python: Add RevenueTier". Self-documenting step names make the pipeline readable for anyone who opens the report later.

Step 7. Click Home > Close & Apply. The query runs, and the enriched table loads into your data model. In the Fields pane you will now see RevenueTier alongside the original columns, usable in visuals or DAX measures.

What Happens Under the Hood

Every time this query refreshes, Power BI serializes the table to a temporary CSV, launches a Python subprocess, waits for Python to write its output to a second CSV, then reads that output back into the pipeline. The Python step sits between other M transformation steps and behaves exactly like any other step — you can add M steps before or after it, and each one sees the output of the previous step as its input.

Mode 3: Python Visuals — Guided Exercise

In this exercise you will create a Python visual on the report canvas. You need the df_sales table in your data model from the Mode 1 exercise above, or any table with at least a text column and a numeric column.

Step 1. Make sure you are on the report canvas (not in Power Query Editor). In the Visualizations pane, scroll to the bottom and click the Py icon — it looks like a stylized Python logo. A placeholder visual appears on the canvas with a message: "To create a Python visual, you must add fields from the Data pane."

Step 2. With the Python visual selected, go to the Fields pane on the right. Drag Product into the Values well, then drag Revenue into the Values well. You will see the placeholder update to say "Script is not complete." and the Python script editor panel will open at the bottom of the screen.

Step 3. The editor contains auto-generated preamble code that looks like this:

python
# The following code to create a dataframe and remove duplicated rows
# is always executed and acts as a preamble for your script:

# dataset = pandas.DataFrame(Product, Revenue)
# dataset = dataset.drop_duplicates()

# Paste or type your script code here:

This is the auto-generated binding code Power BI adds. It tells you exactly what dataset will contain — the two columns you dragged in. You cannot edit this preamble. Your script code goes below the last comment line.

Step 4. Click in the editor below the preamble comment and paste the following script:

python
import matplotlib.pyplot as plt

# 'dataset' contains the columns dragged into the Values well
# Power BI already ran drop_duplicates() in the preamble above

fig, ax = plt.subplots(figsize=(6, 4))

colors = ['#4b8bbe' if r < 1000 else '#FFD43B' if r < 2000 else '#98c379'
          for r in dataset["Revenue"]]

bars = ax.bar(dataset["Product"], dataset["Revenue"],
              color=colors, edgecolor='none')

ax.set_title("Revenue by Product", fontsize=13, fontweight='bold', pad=12)
ax.set_ylabel("Revenue (USD)")
ax.spines[['top', 'right']].set_visible(False)
ax.tick_params(axis='x', rotation=15)

# Add value labels above each bar
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width() / 2., height + 20,
            f'${height:,.0f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()   # Do not remove — Power BI intercepts this to save the PNG

Step 5. Click the Run script button (the triangle/play icon at the top-right of the script editor panel). Power BI executes the script. After a few seconds the placeholder on the canvas is replaced by the bar chart. You will see three differently coloured bars — blue for low revenue, yellow for medium, green for high — with dollar labels on top.

Step 6. To see filter responsiveness in action: insert a Slicer visual on the canvas, bind it to the Region column, and select different regions. Each time you change the slicer, Power BI re-executes the Python script with the filtered dataset and the chart updates. Then try clicking one of the bars in the Python visual itself — you will notice it does not cross-filter any other visuals. That is the static-image limitation of all Python visuals; it is not something you can fix in the script.

Why plt.show() Must Stay In

Do not remove or comment out plt.show(). The PythonScriptWrapper.PY monkey-patches this call — it replaces it with a plt.savefig() that writes the figure to a UUID-named PNG in a temp folder. Power BI then reads that PNG and renders it as the visual. Without plt.show(), nothing is saved and the visual stays blank with no error message.

You have now completed all three modes. You loaded data with Mode 1, transformed it with Mode 2, and visualized it with Mode 3. The sections that follow cover production-grade patterns built on top of these foundations.

Check Your Understanding
Three Modes of Python Integration
Question 1 of 3
out of 3 correct

Python vs R in Power BI: Which Should You Use?

Power BI supports both Python and R in all three integration modes. If you are deciding which to invest in, the answer depends on your team's existing skills and your specific use cases, but there are meaningful differences in how each language behaves in practice.

Python is the stronger choice for data engineering tasks: connecting to APIs, transforming nested JSON, working with dataclass-style data structures, and building anything that needs to integrate with external systems or a broader automation pipeline. The ecosystem is also larger — more packages, more Stack Overflow answers, more maintained libraries. In the Power BI Service, Python runs on Python 3.11 and has a substantial supported package list covering data science, NLP, and machine learning.

R has an advantage in statistical depth. CRAN's statistical modeling packages — particularly those for mixed-effects models, survival analysis, and certain econometric methods — have no direct Python equivalents of comparable maturity. If your team is already fluent in the tidyverse and ggplot2, and your primary use case is statistical visualization, R may be the faster path.

In the Power BI Service, both languages face the same constraints: a curated, fixed package list, no outbound networking, and a 1-minute visual timeout. Both are affected by the May 2026 deprecation of visuals in embedded and Publish to web scenarios. Both communicate with Power BI through the same CSV-based data exchange mechanism.

Practical Guidance

For most teams starting from scratch, Python is the better investment. Its data engineering capabilities, broader package ecosystem, and deeper integration with Microsoft Fabric's notebook environment (which is Python-first) make it the more future-proof choice. R remains a strong option for teams with existing R expertise and a statistical analysis focus. Nothing prevents a team from using both — Python for data source connections and transformation steps, R for specialized statistical visuals — since the two operate independently.

Licensing, Limits, and Constraints You Need to Know

Python in Power BI is not without guardrails. Here are the hard limits documented by Microsoft:

  • Python data source and transformation scripts in Power Query time out after 30 minutes. Python visuals time out after 5 minutes in Power BI Desktop and 1 minute in the Power BI Service — a distinction the official documentation makes clearly, and one that catches teams off guard when they develop in Desktop and then publish.
  • Python visuals in the Power BI Service are subject to a 30MB payload limit covering both compressed input data and the script itself. In Power BI Desktop, the input data limit for Python visuals is 250MB.
  • The input data for Python visuals is capped at 150,000 rows; anything beyond that is silently truncated.
  • In Python visual input datasets, string values longer than 32,766 characters in a single cell are truncated. This is also the general truncation limit when Power Query loads text into the data model.
  • Python visuals in the Service require a Fabric, Pro, Premium Per User (PPU), or Premium capacity license and are only supported in regions where Fabric is available.
  • Refresh of reports containing Python scripts in the Power BI Service is only supported through a personal gateway — the enterprise/standard gateway is not supported.
  • Only packages published on PyPI with free and open-source licenses are supported in the Power BI Service. Networking is blocked, meaning packages that make client-server queries over the web will not function in the cloud.

The Deprecation You Cannot Ignore: Embedded Python Visuals

Deprecation Warning

Starting May 2026, Power BI will end support for embedding reports containing Python (and R) visuals in the "Embed for your customers" scenario (also called "app owns data") and Publish to web scenarios. After the deprecation date, embedded reports containing Python visuals will still load, but the Python charts will render as blank.

In November 2025, Microsoft announced this significant change in their Power BI Feature Summary. Reports without Python visuals will be unaffected. This does not impact "Embed for your organization" (user owns data) scenarios, nor does it affect secure embedding to SharePoint, websites, or portals.

But if your organization uses the "app owns data" embedding pattern and any of those reports contain Python visuals, you have until May 2026 to migrate those visuals to native Power BI alternatives or DAX-based analytics. Microsoft's recommendation is to consider leveraging Fabric Notebooks for sophisticated or technical visualizations going forward.

The On-Premises Gateway and Scheduled Refresh

This is one of the most common production surprises for teams that have Python-sourced reports working perfectly in Desktop: publishing to the Service and enabling scheduled refresh requires a gateway, and the gateway situation for Python is specific.

When a Power BI report in the Service contains a Python data source script, the Service cannot execute that script directly — it runs in a sandboxed container with no networking. To enable scheduled refresh for a report that uses a Python data source, you must install and configure a personal mode on-premises data gateway on the same machine where the Python environment is set up. When the Service triggers a refresh, it routes the Python script execution through that gateway to your local machine.

There is a critical operational constraint here that catches teams off guard: the personal mode gateway cannot run as a background Windows service. Unlike the standard enterprise gateway, it runs only as an interactive application under your logged-in user account. This means the machine must be powered on, a user must be actively logged in, and the gateway application must be open and signed in at the time the scheduled refresh runs. A machine restart, a user logout, or a gateway sign-out will silently break all scheduled refreshes until someone manually relaunches the application. This is why the personal gateway is inherently unsuitable for unattended production environments and why the Fabric Lakehouse architecture is the correct path for any refresh scenario that needs to be reliable overnight or over weekends.

Enterprise Gateway Is Not Supported

The standard on-premises data gateway (enterprise mode) does not support Python data sources. Only the personal mode gateway is supported for Python script refresh. This is a hard constraint from Microsoft, not a configuration option. If your organization's data governance policy does not permit personal gateways, the workaround is to move the Python data extraction into a Fabric notebook and have the notebook write its output to a Lakehouse, then connect Power BI to the Lakehouse table using a standard connector.

The personal gateway is bound to a single machine and a single user session. Because it cannot run as a background service, it offers no path to high availability, clustering, or load balancing — features that the standard enterprise gateway provides but which do not extend to Python data sources. For mission-critical scheduled refresh, the Fabric notebook approach is the correct architecture: move the Python work to Fabric, which runs on Microsoft's cloud infrastructure independently of any local machine or logged-in user.

To configure the gateway for a Python data source: install the personal gateway from the Microsoft download page, sign in with the same account used for the Power BI Service, then in the Power BI Service navigate to the dataset settings and configure the data source credentials. Once the gateway is online and the data source is configured, scheduled refresh will route through it.

Incremental Refresh with Python Data Sources

Incremental refresh is one of the most valuable performance features in Power BI, but its interaction with Python data sources requires careful handling. The short version is that native incremental refresh — which relies on Power BI folding a date range filter into the query at the connector level — does not work the same way with Python data sources as it does with SQL or OData connectors.

Power BI's incremental refresh uses two reserved parameters, RangeStart and RangeEnd (both of type DateTime), to communicate the target date range to the data source. For SQL connectors, Power BI folds these parameters into the WHERE clause at the database level. For Python scripts, Power BI passes these parameters as accessible objects in the Power Query environment, but the Python script itself must be written to use them — they are not automatically injected into the script namespace.

The supported pattern is to reference RangeStart and RangeEnd as query parameters in Power Query, convert them to strings, and then pass them as variables into your Python script via parameter substitution in the M formula that calls the script. This is verbose and easy to get wrong, which is why many teams instead implement a simpler manual approach: the Python script fetches a rolling window (for example, the last 90 days) on every refresh, and a separate historically cached table — loaded without a Python step — holds older records. The two tables are unioned in the data model using Power Query Append.

The Practical Recommendation

If you need true incremental refresh at scale with Python data sources, the cleanest architecture is to move the Python data extraction into a Fabric notebook, write the output partitioned by date to a Fabric Lakehouse, and then connect Power BI to the Lakehouse table using the standard Lakehouse connector — which supports full incremental refresh through Direct Lake or import mode. This pattern removes the gateway dependency, supports high availability, and removes the need to implement RangeStart/RangeEnd parameter threading through M and Python simultaneously.

Check Your Understanding
Service Constraints, Gateway, and Refresh
Question 1 of 3
out of 3 correct

Microsoft Fabric and Semantic Link: Now Generally Available

Python's role in the Microsoft data stack is expanding, but it is shifting from embedded-in-the-report scripting toward a more robust, cloud-native notebook experience within Microsoft Fabric. This is not a future direction — it is the current architecture.

Python Notebooks in Fabric are pure Python environments (no Spark required) running on lightweight 2-vCore/16GB compute nodes. They support Python 3.10 and 3.11 (with 3.11 as the default), native IPython features, and direct integration with Fabric Lakehouses and Data Warehouses.

The key capability is Semantic Link — which reached General Availability at the Microsoft Fabric Conference in March 2024, and has continued to expand with new capabilities since. It connects Power BI semantic models directly to Fabric notebooks via the SemPy Python library. SemPy introduces a FabricDataFrame object that behaves like a pandas DataFrame but carries metadata from Power BI semantic models — relationships, hierarchies, and column descriptions. You can retrieve and evaluate DAX measures from your Power BI models directly in Python, without reimplementing the business logic.

One important architectural note: the Semantic Link Spark native connector — which allowed PySpark, Spark SQL, R, and Scala to access Power BI tables directly — is approaching retirement and is unsupported on Spark runtime 4.0 or later. Microsoft's current guidance is to migrate Spark-based Semantic Link usage to the Semantic Link Python SDK, which provides continued support and new features. If you are building new Fabric workflows, use the SemPy Python library rather than the Spark native connector.

Here is what Semantic Link looks like in practice. In a Fabric notebook, you can read data from a Power BI semantic model and evaluate DAX measures against it with a few lines:

python
import sempy.fabric as fabric

# Read a table from your Power BI semantic model directly into a FabricDataFrame
# A FabricDataFrame behaves like a pandas DataFrame but carries PBI metadata
df_orders = fabric.read_table(
    dataset="SalesDataset",       # your semantic model name
    table="Orders"                # table name within the model
)

# Evaluate a DAX measure defined in the semantic model
# No need to reimplement the business logic in Python
revenue_by_region = fabric.evaluate_measure(
    dataset="SalesDataset",
    measure="[Total Revenue]",
    groupby_columns=["Orders[Region]", "Orders[OrderYear]"]
)

# revenue_by_region is a pandas DataFrame — use it with any Python library
import matplotlib.pyplot as plt

pivot = revenue_by_region.pivot(
    index="Orders[OrderYear]",
    columns="Orders[Region]",
    values="[Total Revenue]"
)

fig, ax = plt.subplots(figsize=(10, 5))
pivot.plot(kind="bar", ax=ax)
ax.set_title("Revenue by Region and Year")
ax.set_ylabel("Total Revenue")
plt.tight_layout()
plt.show()

The fabric.evaluate_measure() call executes the DAX measure as defined in the semantic model — including any row-level security filters, calculation groups, or complex CALCULATE context your DAX implements. This is fundamentally different from re-querying the underlying data: it evaluates the business logic exactly as Power BI would, which means your Python analysis and your Power BI report are guaranteed to agree on the numbers.

To use Semantic Link, your Fabric workspace must have a capacity license (F or P SKU) and the semantic model must be in the same workspace as your notebook, or you must have access configured cross-workspace. For Fabric Runtime 1.2 (Spark 3.4) and above, Semantic Link is available in the default runtime and does not need to be installed separately. The sempy library is pre-installed in Fabric notebooks — no pip install required for standard usage, though you can run %pip install -U semantic-link to update to the latest version.

At FabCon 2026 in Atlanta (March 2026), Microsoft announced Direct Lake on OneLake reaching General Availability — enabling Power BI semantic models to query data directly from OneLake without import or DirectQuery overhead. This is directly relevant to Python users because it is the recommended destination for Python-processed data written from Fabric notebooks: write the output to a Lakehouse delta table, and Power BI reads it via Direct Lake with no refresh latency. This architecture — Python in Fabric notebooks writing to OneLake, Power BI reading via Direct Lake — is now fully production-ready, not a preview pattern. Rather than running Python scripts inside Power BI Desktop with CSV-based data exchange, the architecture is moving toward running Python in Fabric Notebooks with direct API access to semantic models, lakehouse data, and machine learning infrastructure, then surfacing the results in Power BI reports.

"Semantic Link is the connective tissue between analytical code and business logic" — Microsoft engineering blog, describing how SemPy bridges Fabric notebooks and Power BI semantic models without duplicating DAX measure definitions.

A Real-World Workflow: Predictive Churn Scoring

To tie this together, here is a realistic end-to-end workflow that demonstrates where Python adds value that Power BI alone cannot.

The goal: A telecom company wants to predict which customers are likely to churn in the next 90 days and surface those predictions in an executive Power BI dashboard.

Step 1 — Data Ingestion (Python as Data Source): A Python script in Power BI connects to the company's PostgreSQL customer database using psycopg2 (a PEP 249-compliant module), pulls customer demographics, usage patterns, and support ticket history, and returns a consolidated DataFrame. For financial use cases, the same pattern applies — see analyzing financial data with Python for how this looks with market and accounting data sources.

Step 2 — Feature Engineering (Python Transformation): A second Python script in Power Query applies feature engineering that is impractical in M: calculating rolling 30-day averages of usage metrics, encoding categorical variables, and normalizing numeric features using scikit-learn's StandardScaler.

Step 3 — Model Scoring (Python Transformation): A pre-trained GradientBoostingClassifier pipeline (serialized with joblib as churn_pipeline.pkl) is loaded and used to score each customer with a churn probability. The output DataFrame includes the original customer data plus a new churn_risk_score column.

Step 4 — Visualization (Native Power BI + Python Visual): The dashboard uses native Power BI visuals for KPI cards, slicers, and bar charts. A Python visual provides a SHAP (SHapley Additive exPlanations) waterfall plot showing the top features driving churn risk for a selected customer segment — a visualization type that Power BI does not offer natively.

python
import matplotlib.pyplot as plt
import shap
import joblib
import pandas as pd

# Load the pre-trained pipeline (scaler + GradientBoostingClassifier)
pipeline = joblib.load(r"C:\Models\churn_pipeline.pkl")

# 'dataset' provided by Power BI
X = dataset.drop(columns=["CustomerID", "churn_risk_score"])

# shap.TreeExplainer requires the underlying estimator, not the Pipeline wrapper.
# Extract the final step (the classifier) from the fitted pipeline.
# The scaler is already baked into the pipeline — apply it first so the
# features TreeExplainer sees match what the model was trained on.
classifier = pipeline.named_steps["model"]
X_scaled = pipeline[:-1].transform(X)   # apply all steps except the final estimator

# TreeExplainer uses tree structure directly — no background dataset required
# for GradientBoostingClassifier (tree_path_dependent by default).
# With sklearn GradientBoostingClassifier, model_output defaults to "raw"
# (log-odds), which returns shap_values of shape (n_samples, n_features).
# Use model_output="probability" to get probability-space SHAP values instead.
explainer = shap.TreeExplainer(classifier, model_output="raw")
shap_values = explainer(X_scaled)

# shap_values is an Explanation object of shape (n_samples, n_features).
# GradientBoostingClassifier in raw mode is single-output, so no class indexing.
# shap.plots.bar() creates its own figure internally — do not call plt.subplots()
# before it. Pass show=False so you can add a title before plt.show() hands
# the figure to Power BI's PythonScriptWrapper to save as PNG.
shap.plots.bar(shap_values, show=False)
plt.title("Top Features Driving Churn Risk")
plt.tight_layout()
plt.show()

Use Case Walkthrough: Pulling Data from a REST API Power BI Cannot Reach

Power BI ships with dozens of native connectors, but they stop well short of every API an organization might need. A SaaS product with a custom authentication scheme, an internal microservice that returns nested JSON, a government data portal that requires session cookies — none of these have a native connector, and building a custom one in M is painful. Python solves this with three lines and the requests library.

This walkthrough uses a realistic scenario: pulling paginated incident data from a security operations platform via REST API and loading it into a Power BI table for dashboard analysis. The same pattern works for any JSON-returning API.

Before You Start

This use case uses Python as a data source (Mode 1). The script runs once during data refresh. The requests library works in Power BI Desktop because it runs locally, but it will be blocked in the Power BI Service, which disables outbound networking. If you need this to refresh in the cloud, move the data extraction to a Fabric notebook and publish the results to a lakehouse that Power BI reads natively.

Step 1 — Install the required library

In your Power BI virtual environment, install requests if it is not already present. You only need to do this once.

bash
# Activate your Power BI virtual environment first
C:\Envs\powerbi-env\Scripts\activate

# Then install
pip install requests

Step 2 — Write and test the script outside Power BI first

Never paste untested code into Power BI's script editor. The editor has no debugger and error messages are minimal. Build the script in VS Code or a Jupyter notebook against your actual API endpoint, confirm it produces a clean DataFrame, then move it into Power BI.

python
import pandas as pd
import requests

API_BASE = "https://api.yoursoc.example.com/v2"
API_KEY  = "YOUR_API_KEY_HERE"

HEADERS = {
    "Authorization": f"Bearer {API_KEY}",
    "Accept": "application/json",
}

def fetch_all_incidents() -> pd.DataFrame:
    """Paginate through the incidents endpoint and return a flat DataFrame."""
    records = []
    page = 1

    while True:
        response = requests.get(
            f"{API_BASE}/incidents",
            headers=HEADERS,
            params={"page": page, "per_page": 200, "status": "all"},
            timeout=30,
        )
        response.raise_for_status()
        payload = response.json()

        records.extend(payload["incidents"])

        # Stop when there are no more pages
        if page >= payload["meta"]["total_pages"]:
            break
        page += 1

    df = pd.json_normalize(records)

    # Rename nested keys to clean column names
    df = df.rename(columns={
        "id":                    "IncidentID",
        "title":                 "Title",
        "severity":              "Severity",
        "status":                "Status",
        "created_at":            "CreatedAt",
        "resolved_at":           "ResolvedAt",
        "assignee.name":         "AssigneeName",
        "affected_assets":       "AffectedAssets",
    })

    # Parse timestamps
    df["CreatedAt"]   = pd.to_datetime(df["CreatedAt"],  utc=True, errors="coerce")
    df["ResolvedAt"]  = pd.to_datetime(df["ResolvedAt"], utc=True, errors="coerce")

    # Calculate time-to-resolution in hours
    df["ResolutionHours"] = (
        (df["ResolvedAt"] - df["CreatedAt"])
        .dt.total_seconds()
        .div(3600)
        .round(2)
    )

    return df[["IncidentID", "Title", "Severity", "Status",
               "CreatedAt", "ResolvedAt", "ResolutionHours",
               "AssigneeName", "AffectedAssets"]]


# In your test environment, call it and inspect
incidents = fetch_all_incidents()
print(incidents.shape)
print(incidents.dtypes)
print(incidents.head())

Step 3 — Adapt the script for Power BI

Power BI scans the script's namespace for pandas DataFrame variables. The function definition above is fine — you just need to make sure the final DataFrame is assigned to a module-level variable name. Remove the print calls, which are meaningless inside Power BI and add execution overhead.

python
import pandas as pd
import requests

API_BASE = "https://api.yoursoc.example.com/v2"
API_KEY  = "YOUR_API_KEY_HERE"

HEADERS = {
    "Authorization": f"Bearer {API_KEY}",
    "Accept": "application/json",
}

records = []
page = 1

while True:
    response = requests.get(
        f"{API_BASE}/incidents",
        headers=HEADERS,
        params={"page": page, "per_page": 200, "status": "all"},
        timeout=30,
    )
    response.raise_for_status()
    payload = response.json()
    records.extend(payload["incidents"])

    if page >= payload["meta"]["total_pages"]:
        break
    page += 1

incidents = pd.json_normalize(records)

incidents = incidents.rename(columns={
    "id":             "IncidentID",
    "title":          "Title",
    "severity":       "Severity",
    "status":         "Status",
    "created_at":     "CreatedAt",
    "resolved_at":    "ResolvedAt",
    "assignee.name":  "AssigneeName",
    "affected_assets":"AffectedAssets",
})

incidents["CreatedAt"]  = pd.to_datetime(incidents["CreatedAt"],  utc=True, errors="coerce")
incidents["ResolvedAt"] = pd.to_datetime(incidents["ResolvedAt"], utc=True, errors="coerce")

incidents["ResolutionHours"] = (
    (incidents["ResolvedAt"] - incidents["CreatedAt"])
    .dt.total_seconds()
    .div(3600)
    .round(2)
)

incidents = incidents[[
    "IncidentID", "Title", "Severity", "Status",
    "CreatedAt", "ResolvedAt", "ResolutionHours",
    "AssigneeName", "AffectedAssets"
]]

Step 4 — Load it into Power BI

In Power BI Desktop, go to Home > Get Data > More > Other > Python script. Paste the adapted script and click OK. Power BI executes it and presents the incidents DataFrame for selection. Select it and click Load. The table lands in your data model exactly as if it came from a native connector.

Do Not Hard-Code API Keys in Production

The script above places the API key directly in the source for clarity. In a real deployment, store secrets in environment variables and read them with os.environ.get("SOC_API_KEY"). Anyone who can open the .pbix file or view it in the PBIR folder structure can read a hard-coded key. Environment variables are set at the OS level and are not serialized into the report file.

Step 5 — Handle refresh

When you click Refresh in Power BI Desktop, the script re-executes from scratch. Every refresh is a full re-run: new API call, full pagination, full DataFrame rebuild. For large datasets or slow APIs, consider adding a date filter parameter to your API request so each refresh only fetches recent records, then use Power Query's append operations to merge with historically cached data.

If you publish this report and need scheduled refresh in the Power BI Service, you will hit the networking block described earlier. The supported path at that point is to move the data extraction into a Fabric notebook, write the result to a Fabric Lakehouse, and point Power BI at the Lakehouse table instead of the Python script.

Use Case Walkthrough: Cleaning Messy Tabular Data in Power Query

Every analyst has encountered the spreadsheet from hell: inconsistent date formats across rows, whitespace embedded in category names, numeric values mixed with text error strings, and duplicate records that are technically distinct because one has a trailing space. M handles straightforward transformations cleanly, but complex multi-step cleaning logic — especially anything involving conditional logic across multiple columns simultaneously — is where Python earns its place in the transformation pipeline.

This walkthrough uses Python as a transformation step (Mode 2) to clean an exported sales CSV that contains real-world data quality issues before loading it into the Power BI data model.

Mode 2 Mechanics

In a transformation step, Power BI injects the current query table into your script as a variable named dataset. Whatever DataFrame you return as dataset at the end of the script becomes the output of that transformation step. You do not need to load data yourself — Power BI handles the handoff. The script receives the data exactly as it existed at that point in the Power Query pipeline.

Step 1 — Load the raw data with any standard connector

In Power BI Desktop, use Home > Get Data > Text/CSV to load your messy source file. Power BI brings it in as-is. Do not apply any M transformations yet. Navigate to Transform Data to open the Power Query Editor. You should see the raw, uncleaned table.

Step 2 — Add a Python transformation step

With the query selected in the Power Query Editor, go to Transform > Run Python script. This opens the script editor with the note that the current data is available as dataset.

Step 3 — Write the cleaning script

The following script addresses six common data quality problems in a single transformation step. In M, accomplishing all of this would require six separate applied steps and would be substantially harder to read, modify, or peer review.

python
import pandas as pd

# 'dataset' is provided by Power BI — do not reassign it at the top
df = dataset.copy()

# ── 1. Standardize column names ─────────────────────────────────────────────
# Strip whitespace, lowercase, replace spaces with underscores
df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(r"\s+", "_", regex=True)
    .str.replace(r"[^a-z0-9_]", "", regex=True)
)

# ── 2. Strip leading/trailing whitespace from all string columns ─────────────
# include both "object" (pandas 2.x) and "string" (pandas 3.0+ default)
str_cols = df.select_dtypes(include=["object", "string"]).columns
df[str_cols] = df[str_cols].apply(lambda col: col.str.strip())

# ── 3. Normalize category values to title case ───────────────────────────────
# Catches variants like "NORTH WEST", "north west", "North west"
for col in ["region", "product_category", "sales_channel"]:
    if col in df.columns:
        df[col] = df[col].str.title()

# ── 4. Parse mixed-format dates ──────────────────────────────────────────────
# The source file has dates as "2025-03-14", "14/03/2025", and "Mar 14 2025"
if "order_date" in df.columns:
    df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce", format="mixed")

# ── 5. Coerce revenue to numeric, replacing error strings with NaN ───────────
# Common error strings: "#ERROR", "N/A", "—", empty string
# pd.to_numeric with errors="coerce" converts all non-numeric strings to NaN
# directly — no pre-pass needed. The method and limit keyword arguments of
# replace() were deprecated in pandas 2.1 and removed in 3.0; they are
# unrelated to regex use. Using to_numeric here is simply the correct tool
# for converting mixed-type columns to float.
if "revenue" in df.columns:
    df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")

# ── 6. Remove true duplicates based on business key ─────────────────────────
# A real duplicate is a row with the same order_id AND line_item_number.
# We keep the last occurrence, which is typically the most recent correction.
if {"order_id", "line_item_number"}.issubset(df.columns):
    before = len(df)
    df = df.drop_duplicates(subset=["order_id", "line_item_number"], keep="last")
    dupes_removed = before - len(df)

# Return the cleaned DataFrame under the expected variable name
dataset = df
Why You Must Return dataset

In a Python transformation step, Power BI looks for the variable named dataset in the script's namespace when it collects the output. If you rename the variable at the end — returning df instead of dataset = df — Power BI will either show an empty output or throw a navigator error. Always reassign your final cleaned DataFrame back to dataset before the script ends.

Step 4 — Select the output and close the editor

After you click OK, Power BI shows a navigator with one or more DataFrames found in the script's namespace. Select dataset and click OK. Power BI adds this as a new applied step in the query pipeline, labeled "Run Python script." You can rename it by right-clicking the step and choosing Rename — call it something like "Python: Clean and normalize" so the pipeline is self-documenting.

Step 5 — Add downstream M steps as normal

The cleaned DataFrame is now a standard Power Query table. You can continue adding M transformation steps after the Python step — changing column types, removing columns, applying filters. The Python step sits in the pipeline like any other step; it just happens to have run a Python subprocess to produce its output.

One practical note: each time you edit an applied step before the Python step, Power BI will re-execute the Python script to propagate the change through the pipeline. If your script is slow, you will notice this. The mitigation is to apply all upstream M transformations first — reduce the row count, drop columns the script does not need — so the CSV that Power BI hands to Python is as small as possible.

Combine Multiple Steps Into One Script

Because every Python transformation step triggers a CSV serialization roundtrip, running three separate Python steps on the same query costs three roundtrips. If you have multiple Python cleaning operations, consolidate them into a single script. The performance difference on a 100,000-row table is measurable.

Use Case Walkthrough: Fuzzy Entity Resolution Across Two Tables

This problem appears in nearly every organization that assembles data from more than one system: a CRM table and a billing table both contain company names, but they were entered by different people across years of inconsistent data entry. "Microsoft Corporation," "Microsoft Corp.," "MICROSOFT CORP," and "Microsoft" are the same entity. DAX has no join condition that can handle this. M has no fuzzy join on arbitrary string similarity. Power BI's native fuzzy merge feature exists but is limited to a single threshold with no visibility into confidence scores, and it silently drops rows that fall below the threshold.

Python's thefuzz library solves this properly. It computes Levenshtein-based similarity ratios between strings and lets you return the match key and the confidence score as separate columns. The analyst then controls the threshold in Power BI using a slicer — a genuinely useful capability that no native tooling provides.

This is a Mode 2 transformation, but with a twist: the script needs access to two tables simultaneously. The technique is to load the lookup table as a second Power Query query, then reference it inside the Python script using pandas.

Performance Note

thefuzz with process.extractOne() runs a pairwise comparison for each row. On a 10,000-row table against a 500-name lookup, this is fast enough. Since version 0.19.0, thefuzz uses rapidfuzz as its backend — so both libraries use the same C++ implementation. On very large tables (500,000+ rows), using rapidfuzz directly rather than through the thefuzz compatibility wrapper removes a small amount of overhead from the wrapping layer. The API is identical, so the script below works with either library by changing the import.

Step 1 — Install the library

bash
C:\Envs\powerbi-env\Scripts\activate
pip install thefuzz

Since thefuzz 0.19.0 (current version 0.22.1), the library depends directly on rapidfuzz and no longer requires python-Levenshtein as a separate install. rapidfuzz is pulled in automatically. If you prefer to skip the thefuzz compatibility wrapper entirely, you can use rapidfuzz directly — the API is identical (from rapidfuzz import process in place of from thefuzz import process) and both carry MIT licenses.

Step 2 — Load both tables into Power Query as separate queries

Load your primary table (for example, a CRM export) using any standard connector. In a separate query, load the canonical lookup table — the authoritative list of company names you want to match against (for example, a billing system export or a manually curated reference list). Give the lookup query a clear name: CompanyLookup.

Do not try to merge them yet with Power BI's native Merge Queries. You will handle the join inside the Python script.

Step 3 — Add a Python transformation step to the primary table

Select the primary table query. Go to Transform > Run Python script. The script below receives the primary table as dataset. To access the lookup table from inside the Python step, you reference it using the same dataset-injection mechanism — but for a second table, you pass it in by adding it as an additional parameter via a helper column, or more cleanly, you load it directly inside the script from its source. The most reliable production pattern is to load the lookup directly in the Python script, keeping it fully self-contained.

python
import pandas as pd
from thefuzz import process

# 'dataset' is the primary CRM table — injected by Power BI
df = dataset.copy()

# ── Load the canonical lookup table ─────────────────────────────────────────
# In production, load from the same source as your CompanyLookup query.
# Using an absolute path here because Power BI changes the working directory.
lookup = pd.read_csv(r"C:\Data\canonical_companies.csv")
canonical_names = lookup["CanonicalName"].dropna().unique().tolist()

# ── Run fuzzy matching ───────────────────────────────────────────────────────
# For each company name in the CRM, find the closest canonical match.
# extractOne returns (best_match_string, score) — score is 0-100.

def resolve_entity(name: str) -> tuple:
    """Return (best_match, confidence_score) for a given name string."""
    if pd.isna(name) or str(name).strip() == "":
        return (None, 0)
    result = process.extractOne(
        str(name).strip(),
        canonical_names,
        score_cutoff=0   # return even low-confidence matches; let analyst filter
    )
    if result is None:
        return (None, 0)
    return (result[0], result[1])

# Apply to the company name column — adjust column name to match your data
matches = df["company_name"].apply(resolve_entity)

df["matched_canonical_name"] = matches.apply(lambda x: x[0])
df["match_confidence"]        = matches.apply(lambda x: x[1])

# Return enriched DataFrame
dataset = df

Step 4 — Build the confidence slicer in Power BI

After loading the enriched table into your data model, create a slicer visual bound to the match_confidence column. Set it as a "Between" range slicer. Now the analyst controls the threshold: setting it to 85 shows only high-confidence matches, setting it to 60 reveals borderline cases for manual review.

In a separate table visual, add both company_name and matched_canonical_name side by side with match_confidence. This gives the analyst a live review queue that updates with the slicer. Rows below whatever confidence threshold the team agrees on can be flagged for manual correction in the source system — a workflow that is completely impossible with either Power BI's native fuzzy merge or any M transformation.

Why score_cutoff=0 Instead of a Hard Threshold

Setting score_cutoff=0 means every row gets a match attempt — even rows with very low confidence. Returning all scores to Power BI and letting the analyst filter is intentional: different business decisions tolerate different levels of ambiguity. A revenue attribution model might require 90+ confidence; a marketing segment assignment might accept 70. Hard-coding the threshold in the script removes that flexibility. Surface the score, let the data tell you where the natural clusters are, then agree on a threshold with the business owner.

Use Case Walkthrough: NLP Sentiment Scoring That Feeds Native Power BI Visuals

The instinctive approach to text analytics in Power BI is to build a Python visual that renders a word cloud or a sentiment bar chart. That instinct produces the wrong result. Python visuals are static images, cannot cross-filter other visuals, and hit a 1-minute execution timeout in the Service. The correct pattern is to use Python as a transformation step that adds sentiment columns to your data model, then build the visuals entirely in native Power BI — where they respond to every slicer, filter, and cross-highlight interaction on the page.

This walkthrough adds VADER sentiment scoring to a customer support ticket table. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based model specifically tuned for short, informal text. It requires no training data, no GPU, and runs in milliseconds per row. vaderSentiment is not on Microsoft's current Power BI Service supported package list, which means this transformation step runs correctly in Power BI Desktop but will fail with a ModuleNotFoundError in the Service. To deploy the sentiment scoring to the Service, move the Python transformation into a Fabric notebook, compute the scores there, write the enriched table to a Lakehouse, and point Power BI at the Lakehouse table.

Step 1 — Install vaderSentiment

bash
C:\Envs\powerbi-env\Scripts\activate
pip install vaderSentiment

Step 2 — Load your ticket table

Load the support ticket export using whatever connector fits your source — SQL, CSV, REST API. The table needs at minimum a ticket ID column, a text column (the ticket body or customer comment), and ideally a date column. Navigate to Transform Data and select the query.

Before adding the Python step, use M to drop any columns the sentiment script does not need. The fewer columns Power BI serializes into the temp CSV, the faster the Python step runs. A 100,000-row table with 40 columns costs significantly more I/O than the same table with 4 columns.

Step 3 — Add the Python transformation step

python
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

df = dataset.copy()

# Initialise the analyser once outside the apply — not inside the loop
analyser = SentimentIntensityAnalyzer()

def score_text(text) -> dict[str, float]:
    """Return VADER polarity scores for a single text string."""
    if pd.isna(text) or str(text).strip() == "":
        return {"neg": 0.0, "neu": 1.0, "pos": 0.0, "compound": 0.0}
    return analyser.polarity_scores(str(text))

# Apply once, expand the dict results into columns in a single operation
# This avoids iterating over the series four times separately
scores_df = pd.DataFrame(
    df["ticket_body"].apply(score_text).tolist(),
    index=df.index
)

# The compound score is the single most useful value:
# +1.0 = maximally positive, -1.0 = maximally negative
# VADER's documented thresholds: >= 0.05 positive, <= -0.05 negative
df["sentiment_compound"] = scores_df["compound"]
df["sentiment_positive"]  = scores_df["pos"]
df["sentiment_negative"]  = scores_df["neg"]
df["sentiment_neutral"]   = scores_df["neu"]

df["sentiment_label"] = df["sentiment_compound"].apply(
    lambda score: "Positive" if score >= 0.05
                  else ("Negative" if score <= -0.05 else "Neutral")
)

dataset = df

Step 4 — Build native Power BI visuals on the sentiment columns

Once the enriched table is in the data model, sentiment_compound, sentiment_label, and the component scores are regular columns. This is where the pattern pays off compared to a Python visual:

A clustered bar chart with sentiment_label on the axis and count on the value shows the distribution across Positive, Neutral, and Negative. A line chart with date on the axis and average sentiment_compound on the value shows sentiment trend over time. A slicer on sentiment_label lets the user filter the entire report page to negative tickets only — and every other visual on the page responds to that filter, including ticket counts, agent performance cards, and category breakdowns. None of this works with a Python visual because Python visuals do not participate in cross-filtering.

A table visual showing ticket_id, ticket_body, sentiment_compound, and sentiment_label — sorted descending by sentiment_compound with a conditional format on the score column — gives supervisors a direct queue of the most negative open tickets. That is an actionable dashboard that would take ten minutes to configure in native Power BI once the Python transformation step is in place.

Why the Compound Score

VADER returns four values: pos, neg, neu (proportions that sum to 1.0), and compound (a normalized score from -1.0 to +1.0). The compound score is computed by summing the valence scores of each word, adjusted for rules like capitalization, punctuation, and degree modifiers, then normalizing to the -1 to +1 range. It is the single most useful value for downstream analytics because it is a continuous variable — you can average it, trend it, bin it, and compare it across segments. The three proportion scores are useful for understanding the mix of sentiment signals in a given text but are harder to build KPIs around.

Step 5 — Consider batching for very large tables

VADER is fast, but Power BI's CSV serialization is the bottleneck, not the sentiment scoring itself. If your table has more than 150,000 rows, you will hit Power BI's documented row limit for Python visual inputs — but the transformation step mode has no documented row ceiling beyond the 30-minute timeout. For very large tables, add an upstream M filter that limits the Python step to tickets from the last 90 days, then use a separate incremental refresh strategy to maintain historical scores in the data model as a separate table loaded without re-running the Python script.

Use Case Walkthrough: Scikit-Learn Model Inference as a Transformation Step

Scikit-learn appears in the Power BI Service supported package list, yet almost no tutorials show it being used the way it should be used: not for training inside Power BI, but for running inference on a pre-trained model that was serialized outside Power BI and loaded at refresh time. Training belongs in a Fabric notebook or a dedicated ML pipeline. Power BI's Python step is the deployment surface, not the training surface.

The pattern is this: you train and serialize your model externally, commit the .pkl file to a location the gateway machine can reach, and write a transformation script that loads the model at refresh time, scores the current data, and returns a DataFrame with a new prediction column. The column then becomes a first-class field in your data model — available in DAX measures, slicers, and native visuals with no Python execution at report view time.

This walkthrough uses a customer churn risk model as the example. The model is a GradientBoostingClassifier trained on historical subscription data. The transformation step scores the active customer table on every refresh, adding a churn_probability column and a tiered churn_risk_label.

Step 1 — Train and serialize the model outside Power BI

The training script runs once in your development environment and produces a .pkl file. The important detail is to serialize both the model and the preprocessing pipeline together in a single Pipeline object. If you serialize them separately and the preprocessing step changes, the inference script breaks silently. Serializing the full pipeline means the same transformations always apply to inference data.

python
# train_churn_model.py — run once outside Power BI
import pandas as pd
import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier

# Load historical labeled data
df = pd.read_csv(r"C:\Data\churn_training.csv")
FEATURES = ["tenure_months", "monthly_charge", "support_calls_90d",
            "contract_type_encoded", "payment_method_encoded"]
TARGET   = "churned"

X = df[FEATURES]
y = df[TARGET]

# Bundle scaler + model into a single Pipeline.
# This ensures the same scaling logic is always applied at inference time.
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model",  GradientBoostingClassifier(
        n_estimators=400,
        learning_rate=0.05,
        max_depth=4,
        subsample=0.8,
        random_state=42
    ))
])

pipeline.fit(X, y)

# Serialize the full pipeline using joblib — the recommended approach for
# scikit-learn objects. joblib handles large numpy arrays more efficiently
# than pickle and is included in the Power BI Service supported package list.
joblib.dump(pipeline, r"C:\Models\churn_pipeline.pkl", compress=3)

print("Pipeline serialized.")

Step 2 — Write the Power BI transformation script

This script runs as a Mode 2 transformation step. Power BI injects the active customer table as dataset. The script loads the pipeline, validates the required columns, scores the data, and returns the enriched DataFrame.

python
import pandas as pd
import joblib

# Load the full pipeline — absolute path because PythonScriptWrapper
# changes the working directory to a temp folder before your script runs
MODEL_PATH = r"C:\Models\churn_pipeline.pkl"
FEATURES   = ["tenure_months", "monthly_charge", "support_calls_90d",
              "contract_type_encoded", "payment_method_encoded"]

df = dataset.copy()

# ── Column existence guard ────────────────────────────────────────────────────
missing = set(FEATURES) - set(df.columns)
if missing:
    raise ValueError(
        f"Churn scoring script: expected feature columns not found: {missing}. "
        f"Available columns: {list(df.columns)}"
    )

# ── Empty DataFrame guard ─────────────────────────────────────────────────────
if df.empty:
    df["churn_probability"] = pd.Series(dtype=float)
    df["churn_risk_label"]  = pd.Series(dtype=str)
    dataset = df
else:
    pipeline = joblib.load(MODEL_PATH)

    X = df[FEATURES].copy()

    # Fill any nulls in feature columns with column medians
    # so a missing value in one row does not crash the entire refresh
    X = X.fillna(X.median(numeric_only=True))

    # predict_proba returns shape (n_rows, 2); column 1 is P(churn=True)
    df["churn_probability"] = pipeline.predict_proba(X)[:, 1]

    df["churn_risk_label"] = df["churn_probability"].apply(
        lambda p: "High" if p >= 0.70
                  else ("Medium" if p >= 0.35 else "Low")
    )

    dataset = df

Step 3 — Build the Power BI report layer

With churn_probability and churn_risk_label in the data model, you build the report entirely in native Power BI. A card visual showing average churn probability for the currently filtered segment, a clustered bar showing customer counts by risk tier, and a table sorted descending by probability with conditional formatting on the score column gives a complete at-risk customer dashboard. Every visual responds to slicers and cross-filters. No Python is executed at view time.

One DAX measure that pairs well with this pattern is a conditional KPI that flags when average churn probability for a filtered segment exceeds your operational threshold:

dax
-- DAX measure: segment churn alert
Churn Alert =
VAR AvgProb = AVERAGE(Customers[churn_probability])
RETURN
    IF(AvgProb >= 0.70, "HIGH RISK", IF(AvgProb >= 0.35, "MODERATE", "STABLE"))
Why Serialize the Full Pipeline, Not Just the Model

A GradientBoostingClassifier — or any scikit-learn model trained on scaled features — must receive the same scaling at inference time as at training time. If you serialize only the model and apply scaling separately in the inference script, any discrepancy between the two scaling operations (different data range, a column added or removed, different fit data) will corrupt the probability estimates silently. The model will produce numbers, but they will be wrong. Serializing the full Pipeline object means the same scaler fitted on training data is always applied before the model, guaranteed, on every refresh. Use joblib rather than pickle for sklearn pipelines: joblib is scikit-learn's own recommended serialization tool, handles large numpy arrays more efficiently, and is included in the Power BI Service supported package list.

Use Case Walkthrough: Rolling Cohort Retention and Unevenly Spaced Time Series

This is a category of problem that DAX handles poorly at scale and M cannot handle at all: cohort retention analysis over unevenly spaced event data, and rolling aggregations over time series where intervals are not uniform. These problems appear constantly in SaaS analytics (monthly active user cohorts), subscription businesses (renewal retention by signup month), and operational reporting (rolling 30/60/90-day averages where some days have no events).

DAX can compute cohort retention with complex CALCULATE and FILTER expressions, but the performance degrades badly on large tables because it cannot materialize an intermediate cohort assignment efficiently. pandas solves this in a few operations using merge and pivot, and the result — a complete cohort matrix — loads into the data model as a standard table that a matrix visual can render instantly.

Part A: Cohort retention matrix

The input is a table of customer subscription events with at minimum a customer ID, an event date, and an event type. The script computes each customer's cohort month (the month of their first subscription event), then calculates monthly retention for each cohort.

python
import pandas as pd

df = dataset.copy()

# ── Validate and parse ────────────────────────────────────────────────────────
required = {"customer_id", "event_date"}
missing = required - set(df.columns)
if missing:
    raise ValueError(f"Cohort script: missing columns {missing}")

df["event_date"] = pd.to_datetime(df["event_date"], errors="coerce")
df = df.dropna(subset=["event_date"])

# Cohort month = the calendar month of each customer's earliest event
df["cohort_month"] = (
    df.groupby("customer_id", observed=True)["event_date"]
      .transform("min")
      .dt.to_period("M")
)
df["event_month"] = df["event_date"].dt.to_period("M")

# ── Build cohort-period counts ────────────────────────────────────────────────
cohort_data = (
    df.groupby(["cohort_month", "event_month"], observed=True)["customer_id"]
      .nunique()
      .reset_index(name="active_customers")
)

# Cohort size = number of unique customers in month 0 of each cohort
cohort_sizes = (
    cohort_data[cohort_data["cohort_month"] == cohort_data["event_month"]]
    .set_index("cohort_month")["active_customers"]
    .rename("cohort_size")
)
cohort_data = cohort_data.join(cohort_sizes, on="cohort_month")

# Period index: how many months after cohort month is this event month?
cohort_data["period_number"] = (
    cohort_data["event_month"] - cohort_data["cohort_month"]
).apply(lambda x: x.n)

# Retention rate for each cell
cohort_data["retention_rate"] = (
    cohort_data["active_customers"] / cohort_data["cohort_size"]
).round(4)

# Convert Period objects to strings so Power BI can load the columns cleanly
cohort_data["cohort_month"] = cohort_data["cohort_month"].astype(str)
cohort_data["event_month"]  = cohort_data["event_month"].astype(str)

dataset = cohort_data[
    ["cohort_month", "event_month", "period_number",
     "cohort_size", "active_customers", "retention_rate"]
]

Once this table is in the data model, you build a matrix visual with cohort_month on rows, period_number on columns, and retention_rate on values. Conditional formatting on the values produces the standard heatmap retention matrix. This is a result DAX could approximate with CALCULATE-heavy measures, but the query time on a 2-million-row event table is orders of magnitude faster coming from a pre-materialized Python-computed table.

Part B: Rolling aggregations on unevenly spaced time series

Power BI's built-in time intelligence functions assume a continuous calendar. When your data has gaps — days with no transactions, sensors that go offline, event logs with irregular intervals — DAX's rolling window functions give incorrect averages because they count empty periods as zeros or skip them unpredictably. pandas rolling() with a time offset handles this correctly:

python
import pandas as pd

df = dataset.copy()
df["event_date"] = pd.to_datetime(df["event_date"], errors="coerce")
df = df.dropna(subset=["event_date"]).sort_values("event_date")
df = df.set_index("event_date")

# Rolling 30-day sum and 30-day mean, computed on actual timestamps
# not on row count — so gaps are handled correctly
df["rolling_30d_revenue"] = (
    df["revenue"]
    .rolling("30D")    # "30D" = 30-day window based on index timestamps
    .sum()
    .round(2)
)

df["rolling_30d_avg_revenue"] = (
    df["revenue"]
    .rolling("30D")
    .mean()
    .round(2)
)

# Reset index so event_date becomes a column again for Power BI
df = df.reset_index()
dataset = df
Time Offset vs Row Count in pandas rolling()

Passing an integer to rolling() — like rolling(30) — means "the 30 preceding rows." On unevenly spaced data, those 30 rows might span 3 days or 6 months depending on how the data is distributed. Passing a time offset string like "30D" means "all rows within the 30 calendar days preceding each timestamp," which is the semantically correct interpretation for a rolling monthly average. The index must be a DatetimeIndex for offset-based rolling to work — which is why the script converts and sets the date column as the index before calling rolling().

Use Case Walkthrough: Paginated API Ingestion with OAuth Token Refresh

The REST API use case earlier in this guide covers the basic pattern. This section covers what that one deliberately deferred: API ingestion that requires OAuth 2.0 authentication, and APIs that return paginated results where the total number of pages is not known in advance. These are the two most common complications in real API integrations, and both are straightforward to handle in Python while being impractical or impossible in M.

OAuth 2.0 client credentials flow is the standard for machine-to-machine authentication against APIs like Microsoft Graph, Salesforce, ServiceNow, and many SaaS platforms. The pattern is: exchange a client ID and client secret for a bearer token, then include that token in each API request. Tokens expire (typically after 3,600 seconds), so any long-running script that makes multiple API calls needs to handle token refresh mid-run.

Credential Storage

Never hardcode credentials in a Power BI Python script. The .pbix file is not encrypted storage. Store client IDs and secrets in environment variables on the gateway machine and read them with os.environ.get(). Even better, store them in Windows Credential Manager and access them via the keyring library. The script below uses environment variables. Rotate credentials periodically and ensure the gateway machine has least-privilege access to the API.

python
import os
import time
import pandas as pd
import requests

# ── OAuth 2.0 client credentials flow ────────────────────────────────────────
# Credentials are read from environment variables set on the gateway machine.
# Never hardcode secrets in a Power BI script.
TOKEN_URL     = "https://auth.yoursaas.example.com/oauth2/token"
API_BASE      = "https://api.yoursaas.example.com/v3"
CLIENT_ID     = os.environ.get("POWERBI_API_CLIENT_ID")
CLIENT_SECRET = os.environ.get("POWERBI_API_CLIENT_SECRET")

if not CLIENT_ID or not CLIENT_SECRET:
    raise RuntimeError(
        "POWERBI_API_CLIENT_ID and POWERBI_API_CLIENT_SECRET must be set "
        "as environment variables on the gateway machine."
    )

def fetch_token() -> tuple[str, float]:
    """Return (access_token, expiry_epoch) using client credentials flow."""
    resp = requests.post(
        TOKEN_URL,
        data={
            "grant_type":    "client_credentials",
            "client_id":     CLIENT_ID,
            "client_secret": CLIENT_SECRET,
            "scope":         "api.read",
        },
        timeout=30,
    )
    resp.raise_for_status()
    payload    = resp.json()
    token      = payload["access_token"]
    expires_in = payload.get("expires_in", 3600)   # seconds until expiry
    # Store the absolute expiry time; refresh 60 seconds early to be safe
    expiry     = time.time() + expires_in - 60
    return token, expiry

def get_headers(token: str) -> dict:
    return {"Authorization": f"Bearer {token}", "Accept": "application/json"}

# ── Paginated ingestion ───────────────────────────────────────────────────────
# Many APIs return a cursor or next_page URL instead of a total page count.
# This pattern handles both: cursor-based and offset/limit pagination.

token, token_expiry = fetch_token()
all_records: list[dict] = []
page = 1
page_size = 200

while True:
    # Refresh the token if it is close to expiry
    if time.time() >= token_expiry:
        token, token_expiry = fetch_token()

    resp = requests.get(
        f"{API_BASE}/incidents",
        headers=get_headers(token),
        params={"page": page, "per_page": page_size, "status": "all"},
        timeout=60,
    )
    resp.raise_for_status()
    payload = resp.json()

    records = payload.get("data", [])
    if not records:
        break   # empty page means we have consumed all results

    all_records.extend(records)

    # Some APIs signal the last page with a flag; others just return empty
    if not payload.get("has_more", True):
        break

    page += 1

# ── Flatten nested JSON into a DataFrame ─────────────────────────────────────
df = pd.json_normalize(all_records)

# Ensure the DataFrame is never empty even if the API returned nothing,
# so Power BI does not throw a "no tables found" navigator error
if df.empty:
    df = pd.DataFrame(columns=["id", "created_at", "severity", "status"])
Why pd.json_normalize Instead of pd.DataFrame

REST API responses typically return nested JSON where a single record might contain a nested object like {"id": 1, "assignee": {"name": "Alice", "team": "SOC"}}. Passing this directly to pd.DataFrame() creates a column called assignee containing dicts, which Power BI cannot load — it expects scalar values in each cell. pd.json_normalize() flattens nested objects into dot-notated columns: assignee.name and assignee.team. For deeply nested APIs, the record_path and meta parameters give fine-grained control over which levels are flattened and which become separate repeated columns.

Use Case Walkthrough: Schema Drift Detection and Type Coercion

The column existence guard shown in the error handling section is a minimum viable schema check. Production pipelines need more: the ability to detect when an upstream source adds, removes, or renames columns; when a previously numeric column starts arriving as a string; and when date formats change mid-stream. These failures are the most common cause of silent data corruption in Power BI Python transformation steps, because pandas will often perform an operation that appears to succeed while operating on the wrong type or against the wrong column.

This section gives a reusable schema validation function that goes beyond checking for column presence. It validates column types against an expected schema, attempts safe coercions where the type is wrong but the data is recoverable, and raises an informative error — with the problematic column name and the actual vs expected type — when coercion is not possible.

python
import pandas as pd
from typing import Literal

# ── Schema specification ──────────────────────────────────────────────────────
# Define the columns this script depends on and their expected pandas dtypes.
# "numeric" and "datetime" are handled as type groups for coercion;
# all other values are matched exactly via dtype.kind.
EXPECTED_SCHEMA: dict[str, Literal["numeric", "datetime", "string", "bool"]] = {
    "order_id":       "string",
    "order_date":     "datetime",
    "revenue":        "numeric",
    "customer_tier":  "string",
    "is_renewal":     "bool",
}

def validate_and_coerce(
    df: pd.DataFrame,
    schema: dict[str, str]
) -> pd.DataFrame:
    """
    Validate df against schema. Attempt safe coercions for type mismatches.
    Raises ValueError with an informative message if any column cannot be coerced.
    """
    df = df.copy()

    # Check for missing columns first — a missing column is fatal
    missing = set(schema.keys()) - set(df.columns)
    if missing:
        raise ValueError(
            f"Schema validation failed: expected columns not present: {sorted(missing)}. "
            f"Available columns: {sorted(df.columns.tolist())}"
        )

    coercion_log: list[str] = []

    for col, expected_type in schema.items():
        actual_dtype = df[col].dtype

        if expected_type == "numeric":
            if not pd.api.types.is_numeric_dtype(actual_dtype):
                coercion_log.append(f"{col}: {actual_dtype} -> numeric (coerced)")
                df[col] = pd.to_numeric(df[col], errors="coerce")
                null_count = df[col].isna().sum()
                if null_count > 0:
                    coercion_log.append(
                        f"  WARNING: {null_count} values in '{col}' could not be "
                        f"converted to numeric and were set to NaN"
                    )

        elif expected_type == "datetime":
            if not pd.api.types.is_datetime64_any_dtype(actual_dtype):
                coercion_log.append(f"{col}: {actual_dtype} -> datetime (coerced)")
                df[col] = pd.to_datetime(df[col], errors="coerce", format="mixed")
                null_count = df[col].isna().sum()
                if null_count > 0:
                    coercion_log.append(
                        f"  WARNING: {null_count} values in '{col}' could not be "
                        f"parsed as datetime and were set to NaT"
                    )

        elif expected_type == "string":
            if not pd.api.types.is_string_dtype(actual_dtype):
                coercion_log.append(f"{col}: {actual_dtype} -> string (coerced)")
                df[col] = df[col].astype(str)

        elif expected_type == "bool":
            if not pd.api.types.is_bool_dtype(actual_dtype):
                # Handle common boolean representations: 1/0, "true"/"false", "yes"/"no"
                bool_map = {
                    "true": True, "false": False,
                    "yes": True,  "no": False,
                    "1": True,    "0": False,
                    1: True,      0: False,
                }
                try:
                    # Preserve original values before mapping so the error message
                    # can show the unrecognized values rather than NaN
                    original_col = df[col].copy()
                    mapped = df[col].map(bool_map)
                    if mapped.isna().any():
                        unrecognized = original_col.loc[mapped.isna()].unique().tolist()
                        raise ValueError(
                            f"Column '{col}': expected bool, found '{actual_dtype}' "
                            f"with values that could not be mapped to True/False. "
                            f"Unrecognized values: {unrecognized}"
                        )
                    df[col] = mapped
                    coercion_log.append(f"{col}: {actual_dtype} -> bool (coerced)")
                except ValueError:
                    raise
                except Exception as exc:
                    raise ValueError(str(exc)) from exc

    if coercion_log:
        # Surface coercions as a warning via a known-safe mechanism:
        # raise a warning-level exception that Power BI will display in the step
        import warnings
        warnings.warn(
            "Schema coercions applied:\n" + "\n".join(coercion_log),
            stacklevel=2
        )

    return df


# ── Apply to the injected dataset ─────────────────────────────────────────────
df = validate_and_coerce(dataset, EXPECTED_SCHEMA)

# Continue with transformation logic on the validated DataFrame
# ...

dataset = df

The value of this pattern over a bare column check is that it surfaces the problem at the point of schema change, not three visuals downstream when a measure returns blank or a chart renders unexpected results. The coercion log in the warning message gives whoever debugs the refresh a precise account of what changed and what was fixed automatically versus what requires attention in the source system.

Why Schema Drift Hits Power BI Python Steps Harder Than SQL Queries

When a SQL source changes a column type, Power BI's native SQL connector throws a type error at load time and the refresh fails visibly. When a Python transformation step receives a column with an unexpected type, pandas will often silently succeed — applying a string operation to a numeric column, for example, or treating a date column as an object and producing string-concatenated "dates." The failure mode is not a crash; it is wrong data that passes refresh and corrupts the data model. Schema validation in the Python step is the only layer that catches this before it reaches the data model.

What Actually Happens When Power BI Runs Your Python Script

There is a detail about the Python integration that almost no documentation spells out clearly, and it has real implications for both performance and security: Power BI communicates with Python through temporary CSV files on disk. When you run a Python transformation step, Power BI serializes the current query's data to a temporary CSV in a system temp directory, starts a Python subprocess, passes the file path as an environment variable, and waits for Python to write its output back to another CSV. Power BI then reads that output and continues the query pipeline.

This means several things that most tutorials skip over. First, your data leaves Power BI's memory and lands on disk in plaintext, even temporarily. If you are working with sensitive PII or regulated data, that temporary file path is accessible on the local machine for the duration of script execution. This matters for compliance teams evaluating whether Python scripts can be used in a production Power BI Desktop workflow on machines with shared access or endpoint monitoring. The temp files are deleted after execution, but this should be documented in your data handling policies.

"Data at rest, even briefly, is data at risk" — a principle consistently reinforced in NIST SP 800-53 guidance on media protection, and one that applies directly to Power BI's temp-file data exchange mechanism.

Second, this serialization roundtrip is why large DataFrames cause disproportionate performance degradation. CSV serialization is slow, and for a 500MB dataset, the write-to-disk and read-from-disk operations can dwarf the actual computation time inside the Python script. This is not a Python performance problem — it is an I/O problem baked into the integration architecture. The practical ceiling for workable performance in this mode is roughly 100,000 to 150,000 rows with a moderate column count, depending on disk speed.

Third, this architecture explains why the Power BI Service imposes stricter limits than Desktop. In the Service, the temp files are written to an isolated compute container, which has lower I/O throughput and a tighter execution sandbox than a local developer machine. The 1-minute visual timeout in the Service is not arbitrary — it reflects the resource constraints of that containerized environment.

Pro Tip

If your organization has data loss prevention (DLP) policies that monitor temp directory writes, coordinate with your security team before deploying Python-in-Power BI workflows. Document the temp file behavior in your data flow diagrams. The files are ephemeral, but the write events are real.

What the Script Wrapper Actually Does to Your Code

When you click Run in a Python visual editor, Power BI does not execute your script directly. It wraps it inside a generated file called PythonScriptWrapper.PY and executes that. Understanding the exact contents of that wrapper explains several behaviors that appear mysterious until you see them in context.

The auto-generated prolog sets the matplotlib backend to Agg by calling matplotlib.use('Agg') before your code runs. This is Power BI forcing a non-interactive, file-based rendering mode — it does not render to a screen. Your plt.show() call is then monkey-patched: the wrapper replaces it with a lambda that calls matplotlib.pyplot.savefig(str(uuid.uuid1())), writing the output as a PNG file named with a UUID in the temp folder. Power BI then reads that PNG file and renders it as the visual. This is the mechanism behind the static image behavior — the figure is literally written to disk and read back as a file, not streamed.

The wrapper also sets a fixed figsize in inches and locks the DPI to 72. The figsize is calculated from the canvas dimensions of the visual in the report layout. This means any figsize or dpi argument you pass in your own plt.figure() call is overridden by the wrapper. If your matplotlib script sets dpi=150 to increase sharpness, that setting has no effect — the wrapper's 72 DPI lock takes precedence. To get a larger, higher-fidelity visual, you must resize the visual tile in the Power BI canvas; there is no way to increase resolution from inside the script.

The wrapper also changes the working directory to a UUID-named temp folder using os.chdir(). This is executed before your script runs, which means any relative file path in your script resolves to that ephemeral temp directory, not to the directory where your .pbix file lives or where your Python environment is installed. Loading a model with a relative path like open("model.pkl", "rb") will silently fail. All file references in Python scripts embedded in Power BI visuals must use absolute paths.

The preamble also includes two lines that are always executed regardless of what you write:

python
# This preamble runs before your code — you cannot suppress it
dataset = pandas.DataFrame(YourField1, YourField2)
dataset = dataset.drop_duplicates()

The dataset.drop_duplicates() call is not optional and is not documented prominently. Power BI silently removes duplicate rows from the dataset before your script sees it. If your chart is missing rows that you know exist in the data model — particularly in scenarios with repeated values across dimensions — this is why. The documented fix is to include an index field from your data model in the visual's Values well. That index column makes every row unique from Power BI's perspective, so drop_duplicates() leaves all rows intact.

The Four Things You Cannot Override

The wrapper sets four things before your script runs that you cannot undo from inside the script: (1) the matplotlib backend is forced to Agg via matplotlib.use('Agg') — any plt.switch_backend() call in your own script has no effect because the backend is already locked before your code runs, (2) dpi is locked at 72 — any dpi argument in plt.figure() is overridden, (3) dataset.drop_duplicates() has already run — your script never sees the original duplicate rows, and (4) the working directory has been changed to a UUID-named temp folder — all relative paths in your script resolve to that temp directory, not to any project folder. Any script that tries to fight these will either silently fail or produce unexpected results. Work with the wrapper, not against it.

One additional constraint that catches teams off guard: Python visuals do not support renaming input columns. If you add a field to the visual's Values well, bind it in your script by its original column name, and then rename that column in the data model, the visual will throw an error on the next refresh. The column name your script references must always match the column name exactly as it exists in the data model at the time the script runs.

The Service Runtime: Python 3.11, a Fixed Package Set, and No Networking

The Power BI Service runs Python visuals in an isolated compute container on Python 3.11 for workspaces backed by Premium or Fabric capacity. The supported package list is a fixed, versioned set managed by Microsoft — you cannot install additional packages or upgrade versions in the Service. Microsoft's supported packages page was last updated in June 2025 and has not changed since; as of April 2026, that list includes pandas 2.2.2, numpy 2.0.0, matplotlib 3.8.4, seaborn 0.13.2, scikit-learn 1.5.0, scipy 1.13.1, statsmodels 0.14.2, xgboost 2.0.3, and Pillow 10.4.0, among others. Always cross-check against the official supported packages list before deploying, as Microsoft may update the list after this article was written.

Notice what is not on that list. shap is not supported in the Service, which means the SHAP waterfall plot in the churn scoring workflow described earlier will render correctly in Power BI Desktop but produce an error when the report is published. requests, psycopg2, sqlalchemy, and every other package that makes outbound network connections are also blocked — not just absent from the supported list, but actively blocked at the network layer. A script that queries an external API or database will fail with a connection refused error in the Service even if the package itself were available. Always validate your Python visual scripts against the current supported package list before deploying to production.

Conda / Anaconda Warning

Microsoft's official documentation explicitly recommends against using Conda or Anaconda distributions with Power BI Desktop. Conda environments require an activation step that Power BI's subprocess launch does not perform, which causes import failures even for packages that are correctly installed. If you need to use a specific package that is only available through Conda, the only supported workaround is to launch Power BI Desktop from an already-activated Conda prompt — a fragile arrangement that does not survive a normal Windows desktop shortcut. For production use, standard Python from python.org with a venv-based virtual environment is the reliable option.

Check Your Understanding
The Script Wrapper and CSV Mechanism
Question 1 of 3
out of 3 correct

Setting Up a Proper Python Environment for Power BI

Windows Only

Power BI Desktop is a Windows application. All local Python setup and virtual environment configuration in this section uses Windows paths and commands. If you are on macOS or Linux, you can still write and test Python scripts that are destined for Power BI — your development machine runs the script, but the .pbix file and Power BI Desktop itself must run on a Windows machine. The virtual environment commands on macOS/Linux use forward slashes and source env/bin/activate instead of \Scripts\activate, but the package pinning strategy and library choices are identical.

Power BI Desktop uses whatever Python executable it finds at the path you configure in File > Options and settings > Options > Python scripting. If you point it at your system Python, you will eventually break something: a global package upgrade for one project will silently change the behavior of a Power BI report that depends on an older version of the same package.

The correct approach is to create a dedicated virtual environment for Power BI and point Desktop at that environment's Python executable. This is not optional if you take report reproducibility seriously. Here is how to set one up properly:

bash
# Create a dedicated environment for Power BI scripts
python -m venv C:\Envs\powerbi-env

# Activate it
C:\Envs\powerbi-env\Scripts\activate

# Pin to the versions the Power BI Service runs in its cloud container.
# This ensures scripts that work in Desktop also work after publishing.
# Service-locked versions (as of April 2026): pandas 2.2.2, numpy 2.0.0,
# matplotlib 3.8.4, seaborn 0.13.2, scikit-learn 1.5.0, scipy 1.13.1,
# statsmodels 0.14.2, xgboost 2.0.3.
# shap is NOT on the Service supported list — Desktop only.
# shap 0.51.0 (current) requires Python >=3.11, which matches the Service
# runtime — so this venv is compatible for local Desktop use.
pip install pandas==2.2.2 numpy==2.0.0 matplotlib==3.8.4 seaborn==0.13.2 ^
            scikit-learn==1.5.0 scipy==1.13.1 statsmodels==0.14.2 xgboost==2.0.3 ^
            shap==0.51.0

# Freeze the exact environment for documentation and reproducibility
pip freeze > requirements-powerbi.txt

Then in Power BI Desktop, set the Python home directory to C:\Envs\powerbi-env. Power BI will use the python.exe inside that environment's Scripts folder. From this point, any changes you make to your system Python or other projects have zero impact on your Power BI reports.

Keep the requirements-powerbi.txt file in version control alongside your .pbix file. When another team member needs to run or modify the report, they can recreate the exact environment with pip install -r requirements-powerbi.txt.

Compatibility Warning

The Power BI Service does not use your local virtual environment. When you publish a report, Python visuals run in Microsoft's cloud environment using their supported package list. Your local virtual environment governs Desktop behavior only. Always validate published Python visuals against the official supported packages list before deploying to production.

Spot the Bug
Something is wrong with this Power BI Python environment setup

A developer sets up a virtual environment for Power BI and reports that Power BI keeps using the wrong Python installation despite pointing the settings at the new environment. Can you identify the mistake in their setup script?

# Step 1: Create the environment python -m venv C:\Envs\powerbi-env # Step 2: Install packages pip install pandas==2.2.2 matplotlib==3.8.4 seaborn==0.13.2 # Step 3: In Power BI Options > Python scripting # Home directory set to: C:\Envs\powerbi-env\Scripts # Step 4: Verify # File > Options > Python scripting shows the path is saved

What is the bug?

The PBIR Format Change and What It Means for Python Reports

Starting on January 25, 2026, Microsoft began rolling out the Power BI Enhanced Report Format (PBIR) as the default for all new reports created in the Power BI Service, with full deployment expected to complete by the end of April 2026. Power BI Desktop followed with the March 2026 release. Existing reports are automatically converted to PBIR when edited and saved — as of March 2026, automatic upgrades apply to reports with fewer than 100 visuals; larger reports are being rolled out gradually. PBIR remains technically in Preview during this default rollout, with General Availability planned for Q3 2026, at which point PBIR will become the only supported report format. This is relevant to Python users for reasons that most coverage of this change ignores.

PBIR stores report definitions as a folder-based structure rather than a single .pbix binary. Each visual's definition, including the Python script content for Python visuals, is stored as a readable JSON file inside the report folder. This has a meaningful implication: Python scripts inside your Power BI reports are now diffable, reviewable, and version-controllable in Git at the individual script level. Teams that previously had to diff binary .pbix files to audit Python script changes can now do so with standard text tooling.

This also means that script injection — where a malicious actor modifies a Python script inside a .pbix file to exfiltrate data or execute arbitrary code on the analyst's machine — becomes more visible in a proper CI/CD pipeline with PBIR. It is still a theoretical risk worth including in your threat model, but the PBIR format makes code review of embedded Python scripts a real, practical option rather than an afterthought.

Python, Copilot, and the Emerging Hybrid Workflow

Power BI's integration of Copilot — which has been expanding steadily through 2025 and into 2026, including the announcement at FabCon 2025 that Copilot would be available across all Fabric SKUs — creates a new dynamic for Python users. Copilot can now generate DAX and M code, answer natural language questions about your data, and summarize report insights. But it cannot write or debug the Python scripts embedded in your reports. This creates a clear functional boundary: Copilot handles the conversational BI layer, Python handles the programmatic analytics layer, and Semantic Link in Fabric handles the bridge between them.

The emerging pattern for advanced analytics teams is a three-tier workflow. In the Fabric notebook layer, Python handles model training, complex transformations, and data preparation, surfacing results to a lakehouse. In the Power BI layer, DAX and native visuals handle interactive exploration, with Copilot accelerating measure creation and insight summarization. Python visuals in Power BI fill the narrow remaining gap: specialized chart types (SHAP plots, custom statistical distributions, bespoke geospatial renders) that neither native Power BI visuals nor Copilot-generated content can produce.

Understanding this three-tier model is important for deciding when to reach for Python visuals versus when to push work into a Fabric notebook. The rule of thumb: if the Python output is a visualization for consumption in an existing Power BI report, use a Python visual. If the Python work is analytical — training, scoring, complex feature engineering, or anything that benefits from a full notebook execution environment — it belongs in Fabric.

Interactive Decision Tool
Where Does Your Python Work Belong?
What is the primary output of your Python work?
Recommendation → Python Visual in Power BI
Your output is a chart that will be consumed inside an existing Power BI report. Use a Python visual (Mode 3). Keep the script focused on rendering only — pass pre-aggregated data from your data model rather than doing heavy computation inside the visual script. Remember the 5-minute Desktop timeout and 1-minute Service timeout.
One more question
Is this transformation logic that runs every time the report refreshes, or is it a one-time or scheduled data preparation job?
Recommendation → Power Query Python Transformation (Mode 2)
Your transformation belongs in a Power Query Python step. It runs as part of the report refresh pipeline and feeds the data model directly. Keep the script under 30 minutes and pre-filter the dataset variable upstream before it reaches Python. Remember that networking is blocked in the Service — your script cannot call external APIs.
Recommendation → Fabric Notebook + Lakehouse
Scheduled or one-time data preparation is better handled in a Fabric Notebook where you have full compute resources, no networking restrictions, and proper scheduling via Fabric pipelines. Store the output in a Lakehouse and connect Power BI to it via DirectLake or import mode.
Recommendation → Fabric Notebook
Model training and batch scoring belong in a Fabric Notebook, not embedded in Power BI. Use the notebook for training, scoring, and writing results to a Lakehouse. Then surface the predictions in Power BI as a standard dataset column — no Python visual needed. Semantic Link lets you pull Power BI semantic model data directly into the notebook for feature engineering.

Mocking the Dataset Variable for Local Development

The best practices section correctly advises writing and testing scripts in an external editor before pasting them into Power BI. But there is a specific obstacle: Power BI transformation scripts depend on a dataset variable that is injected by Power BI at runtime. If you open the script in VS Code or a Jupyter notebook, that variable does not exist, and the script fails immediately.

The solution is to mock the dataset variable in your local environment. Add a small block at the top of your development file that loads a representative sample of the data from a CSV, then assign it to dataset. Wrap the block so it only runs when Power BI is not present:

python
import pandas as pd

# ── Local development mock ───────────────────────────────────────────────────
# Power BI injects 'dataset' at runtime. When running outside Power BI
# (in VS Code, Jupyter, etc.), load a sample CSV to stand in for it.
# Remove or comment out this block before pasting into Power BI.
try:
    dataset
except NameError:
    dataset = pd.read_csv(r"C:\Dev\sample_data\sales_sample.csv")
# ── End mock ─────────────────────────────────────────────────────────────────

# Your actual script starts here — identical to what you paste into Power BI
df = dataset.copy()

# ... your transformation logic ...

dataset = df

The try/except NameError pattern checks whether dataset already exists in the namespace. If Power BI has injected it, the block does nothing. If you are running locally and the variable does not exist, it loads from the CSV. This means the exact same file runs correctly in both environments without modification.

The sample CSV should be a realistic slice of the production data — not a toy dataset with two rows, but enough rows and enough edge cases (nulls, mixed formats, duplicate keys) to surface real bugs before the script ever runs inside Power BI. Export 1,000 to 5,000 rows from the actual data source and commit that file alongside the script in version control.

Absolute Paths, Not Relative Paths

Because the PythonScriptWrapper.PY changes the working directory to a temp folder before your script runs, any relative file path in your script will resolve to that temp folder — not to where your .pbix file or development files are. Always use absolute paths in Power BI Python scripts. This applies to model files loaded from pickle, reference CSVs, lookup tables, or any other file reference.

Error Handling Patterns for Production Scripts

When a Python script inside Power BI fails — unhandled exception, missing column, a package that is not installed — Power BI's error surface is minimal. In a data source script, you get a generic "DataSource.Error" with a truncated message that often obscures the actual Python traceback. In a transformation step, the query step shows as failing without the underlying exception. In a visual, the chart placeholder displays "Script error" with no further detail visible without inspecting the script manually.

Building error handling into your scripts at the right level prevents these silent failures from becoming support incidents in production reports.

Column existence guards. The cleaning and transformation scripts in this guide already demonstrate this pattern — checking whether a column exists before operating on it with if "column_name" in df.columns:. This prevents the script from throwing a KeyError when the upstream data changes shape, which is the most common cause of Python step failures in production Power BI reports.

python
import pandas as pd

df = dataset.copy()

# Guard against schema changes in the upstream data source
required_columns = {"order_id", "revenue", "order_date"}
missing = required_columns - set(df.columns)
if missing:
    raise ValueError(
        f"Power BI Python script: expected columns not found: {missing}. "
        f"Available columns: {list(df.columns)}"
    )

# Proceed knowing the required columns exist
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")

dataset = df

Raising a ValueError with a descriptive message is better than letting a KeyError propagate — the message you write will appear in Power BI's error output, making the cause immediately clear to whoever is debugging the report.

Empty DataFrame handling. If the upstream data source returns no rows — because a filter was applied that excludes everything, or because the API returned an empty response — your script may fail on operations that assume at least one row exists. Check for emptiness early:

python
import pandas as pd

df = dataset.copy()

if df.empty:
    # Return an empty DataFrame with the expected schema
    # so downstream Power BI steps do not fail on a missing table
    dataset = pd.DataFrame(columns=["order_id", "revenue", "sentiment_label"])
else:
    # Normal processing
    # ...
    dataset = df

Returning an empty DataFrame with the correct column schema is almost always better than letting the script error. Power BI loads the empty table, visuals show no data, and the report stays functional. When data returns on the next refresh, everything recovers automatically.

Network timeout handling for data source scripts. If your Python data source script calls an external API and the API is slow or unavailable, the script will hang until the 30-minute Power Query timeout is reached. Set explicit timeouts on every network call and wrap them in try/except:

python
import pandas as pd
import requests

try:
    response = requests.get(
        "https://api.example.com/data",
        headers={"Authorization": "Bearer YOUR_TOKEN"},
        timeout=30,        # fail after 30 seconds, not 30 minutes
    )
    response.raise_for_status()
    data = response.json()
    df = pd.json_normalize(data["results"])
except requests.exceptions.Timeout:
    raise RuntimeError(
        "API request timed out after 30 seconds. "
        "Check API availability or increase the timeout value."
    )
except requests.exceptions.HTTPError as e:
    raise RuntimeError(f"API returned HTTP {e.response.status_code}: {e}")
except Exception as e:
    raise RuntimeError(f"Data source script failed: {type(e).__name__}: {e}")
Spot the Bug
Something is wrong with this Power BI Python transformation script

A developer writes a Mode 2 transformation script that loads sentiment scores and returns the enriched table. Reports using this script work fine in Power BI Desktop but fail on every cloud refresh with a generic DataSource.Error. Can you identify the bug?

import pandas as pd from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer import requests # Fetch company name lookup from internal API lookup_resp = requests.get( "https://internal-api.corp.example.com/v1/company-names", headers={"Authorization": "Bearer TOKEN"}, timeout=30, ) lookup_resp.raise_for_status() canonical_names = lookup_resp.json()["names"] df = dataset.copy() analyser = SentimentIntensityAnalyzer() df["sentiment_compound"] = df["ticket_body"].apply( lambda t: analyser.polarity_scores(str(t))["compound"] ) dataset = df

What is the bug?

Troubleshooting Common Errors

These are the errors that come up most often in Power BI Python workflows, and what each one actually means.

"Python home directory is not set" / "No Python environment found." Power BI Desktop has not been pointed at a Python installation. Go to File > Options and settings > Options > Python scripting and set the home directory. Make sure you are pointing at the environment root (for example, C:\Envs\powerbi-env), not the \Scripts subfolder. Relaunch Power BI Desktop after saving.

ModuleNotFoundError for a package you know is installed. The most common cause: the package is installed in your system Python or a different virtual environment, but Power BI is pointing at a different Python. Confirm which Python Power BI is actually using — open the script editor, run import sys; print(sys.executable) as a test script, and verify the path matches your intended environment. A second common cause is installing the package while the venv is not activated, which installs it globally instead.

Data source script returns nothing / navigator shows no tables. Power BI only recognizes variables in the script's global namespace that are pandas DataFrames. If your DataFrame is created inside a function and the function is called but its return value is not assigned to a module-level variable, Power BI will find nothing. Make sure the final DataFrame is assigned to a variable at the top level of the script — not inside a function, class, or conditional block.

Transformation step output is empty / "Run Python script" step shows no rows. Two common causes: either the script raised an exception that was swallowed (check the applied step for an error icon), or the script did not reassign the processed data back to dataset. In a transformation step, Power BI specifically looks for the variable named dataset when collecting output. Returning df instead of dataset = df at the end of the script produces an empty result.

Python visual is missing rows I know are in the data. The PythonScriptWrapper.PY always runs dataset = dataset.drop_duplicates() before your script. Add an index column from your data model to the visual's Values well. That index makes every row unique, so drop_duplicates() leaves all rows intact.

Script works in Desktop but fails after publishing to the Service. Two distinct causes. First, the script imports a package that is not on the Service supported list — check every import against Microsoft's current supported packages page. Second, the script makes an outbound network call (requests.get, a database connection, anything that opens a socket) — all outbound networking is blocked in the Service container. Both of these must be resolved for the published report to work.

Scheduled refresh fails in the Service for a Python data source report. Scheduled refresh for Python data sources requires a personal mode on-premises data gateway running on the machine where the Python environment is set up. If no gateway is configured, the Service cannot execute the Python script and refresh will fail with a gateway error. The standard enterprise gateway does not support Python data sources. See the gateway section above for the full explanation and alternative architecture.

"Script error" in a published Python visual / visual renders blank. In the Service, Python visual scripts are limited to 1 minute of execution time. If your script is doing heavy computation inside the visual (model inference, large data joins, complex feature engineering), it will time out. Move the computation upstream into a Power Query transformation step, which has a 30-minute limit, and pass only pre-computed values to the visual script. If the visual is in an "Embed for your customers" or "Publish to web" report, it will blank out after May 2026 regardless of script content — see the deprecation section.

Conda packages fail to import despite being installed. Conda environments require an activation step that Power BI's subprocess launch skips. Import errors for packages in a Conda environment are expected. Use a standard CPython venv instead. If you must use a Conda-managed package, the only workaround is launching Power BI Desktop from an already-activated Conda command prompt — not the desktop shortcut.

Precise Technical Facts Even Experienced Users Often Miss

These are specific, documented facts about Power BI's Python integration that rarely appear in tutorials and are not surfaced by the UI. Each one has caused real production problems for practitioners who did not know it.

The PythonScriptWrapper.PY prolog implicitly imports pandas — you do not need to

Every time Power BI executes a Python script, it generates a temporary PythonScriptWrapper.PY file inside a GUID-named folder under your system's temp directory. The auto-generated prolog of that wrapper already contains import pandas and sets up the matplotlib Agg backend. This means that pandas is available in the namespace before your script body runs, even if you do not write an explicit import pandas statement. You should still write your own imports for clarity and maintainability, but this explains a specific behavior: if you debug by examining the wrapper file, you will find imports you did not write.

The wrapper's epilog iterates over all variables in your script's global namespace, checks whether each one is of type pandas.DataFrame, and writes a CSV file named after the variable. This is how Power BI discovers every DataFrame your script produces — it is a namespace scan, not a return value. Variables created inside a function that are not exposed at the top level are invisible to this scan.

The Desktop timeout for Python visuals is 5 minutes — not 1 minute

The 1-minute execution timeout for Python visuals applies in the Power BI Service. In Power BI Desktop, the timeout is 5 minutes (300,000 milliseconds, as documented in the PythonScriptTimeoutException stack trace). This asymmetry means a visual that renders fine locally can time out immediately after publishing. If your Python visual script takes between 1 and 5 minutes locally, it will never work in the Service. The practical implication: benchmark your visual script in Desktop and treat anything above 20–30 seconds with suspicion — Service execution has more overhead due to containerization.

Python visual input data is silently truncated at two separate limits

Power BI applies two independent truncation rules to data passed into a Python visual, and both happen silently — no error is raised, no warning is shown in the report. First, if the input dataset exceeds 150,000 rows, only the top 150,000 rows are passed to the script; the rest are discarded. Second, if any string value in the input data exceeds 32,766 characters, that value is silently cut off at exactly 32,766 characters. This is not specific to Python visuals — it is the maximum character length that the Power Query engine can load into a single cell. The 32,766 figure is not arbitrary: it corresponds to the maximum signed 16-bit integer value (32,767) minus one, a constraint inherited from the underlying ADO.NET data pipeline. For NLP scripts that operate on long document text, this truncation can silently corrupt your analysis without raising any exception.

All Python visuals are rendered at exactly 72 DPI regardless of display settings

The PythonScriptWrapper.PY prolog hard-codes the matplotlib figure call as matplotlib.pyplot.figure(figsize=(width_inches, height_inches), dpi=72), where the width and height are calculated from the current visual container size in Power BI. You cannot override the 72 DPI constraint from inside your script — Power BI generates this call before your script body runs. Calling plt.figure(dpi=150) inside your script does not change the output resolution because the wrapper already initialized the figure. On high-density displays, Python visuals will appear noticeably less sharp than native Power BI visuals. This is a platform constraint, not a configuration issue.

The Power BI Service Python runtime is Python 3.11, not the version you use locally

For reports in Premium or Fabric-backed workspaces, the Service runs Python 3.11. The supported package list (last updated by Microsoft in June 2025, unchanged as of April 2026) specifies this runtime explicitly. The exact pinned package versions available in the Service are: pandas 2.2.2, numpy 2.0.0, matplotlib 3.8.4, seaborn 0.13.2, scikit-learn 1.5.0, scipy 1.13.1, statsmodels 0.14.2, and xgboost 2.0.3. These are the only packages available. There is no requests, no sqlalchemy, no thefuzz, no vaderSentiment — and no mechanism to install additional packages in the Service runtime. If your script imports anything not on that list, it will fail with a ModuleNotFoundError after publishing, even if it works perfectly in Desktop.

xgboost is on the Service supported list — scikit-learn's RandomForestClassifier is not the only option

A common assumption is that the Service only supports basic statistical packages. In fact, xgboost 2.0.3 is on the current supported list, which means gradient-boosted tree models can run in published Python visuals without a gateway. The package is available in Premium/Fabric workspaces. This is not widely known and is not called out in the main Power BI Python documentation pages.

The Service payload limit for Python visuals is 30 MB — not the input data size

The 30 MB limit documented for Python visuals in the Service applies to the total compressed payload, which includes both the input data and the Python script itself. This is distinct from the 150,000-row limit, which applies first. A dataset with 50,000 rows of wide, dense numeric data can exceed 30 MB after serialization even though it is well below the row limit. The limit is checked after compression, so the actual uncompressed size of your input data can be higher, but there is no formula to predict the exact compressed size without measuring. Always test your published visual with real production data volumes before relying on it in a deployed report.

Each Python DataFrame you create in a transformation script gets its own temporary CSV on disk

When Power BI's wrapper epilog scans your script's namespace after execution, it writes every pandas.DataFrame it finds to a separate CSV file in the temporary wrapper folder. If your script creates intermediate DataFrames — even ones you intended as local working variables — each one is written to disk. A script that creates three intermediate DataFrames and one final output DataFrame writes four CSV files, even if only one is relevant. This is a measurable I/O cost on large datasets. The mitigation is to use del intermediate_df after you are done with intermediate variables, so the namespace scan does not find them. This is one of the few cases in normal Python code where explicitly deleting a local variable has a concrete performance effect.

Renaming a column in the Power BI Fields pane breaks the Python visual script

Python visuals receive column names exactly as they exist in the data model at the time the visual runs — not as you have renamed them in the visual's field configuration. If you drag a column called OrderDate into the visual's Values well and then rename it to Date using the field rename option in the visual's formatting pane, the script still receives OrderDate as the column name. Conversely, if someone later renames the underlying column in the data model from OrderDate to order_date, the script breaks because dataset["OrderDate"] no longer exists. Python visual scripts are tightly coupled to the physical column names in the model and must be updated when those names change.

Sovereign cloud tenants and tenants in regions without Fabric availability cannot use Python visuals in the Service

Python visuals in the Power BI Service are only supported for tenants in regions where Microsoft Fabric is available. Sovereign cloud customers — including US Government Community Cloud (GCC), GCC High, DoD, and China cloud tenants — cannot run Python visuals in the Service at all, regardless of license type. They can still use Python in Power BI Desktop and through personal gateways, but the cloud execution container that Python visuals require is not provisioned in sovereign regions. This is distinct from the May 2026 deprecation, which affects specific embedding scenarios; sovereign cloud exclusion is a permanent architectural constraint.

Best Practices for Python in Power BI

Write and test your scripts in an external editor first. Power BI's built-in Python editor has no syntax highlighting, no autocompletion, and no debugger. Use VS Code, PyCharm, or a Jupyter notebook to develop and test your scripts before pasting them into Power BI. Use the dataset mocking pattern described in the local development section to replicate Power BI's data injection outside the tool.

Keep your scripts focused. Do not try to replicate your entire data pipeline in a single Power BI Python script. Use Python for the specific tasks where it adds value — connecting to unsupported sources, applying statistical transformations, or rendering advanced visuals — and let Power BI handle everything else.

Always use absolute file paths. The PythonScriptWrapper.PY changes the working directory to a temporary folder before your script runs. Any relative path in your script resolves to that temp directory, not to your .pbix file location. Use absolute paths for every file reference: model pickle files, lookup CSVs, configuration files.

Use type hints in your functions (per PEP 484). When your Power BI reports depend on Python scripts that other team members may need to maintain, type-annotated functions are dramatically easier to understand and debug.

Add column existence guards and empty DataFrame handling. Production scripts that assume a fixed schema will fail silently when the upstream data changes. See the error handling section for the specific patterns.

Mind the data volume. The CSV-based data exchange mechanism means that performance degrades with large datasets. Filter your data upstream before it reaches the Python script, and avoid loading more rows than the script actually needs.

Pin your Python version and package versions. Power BI uses the local Python installation, and different Python or package versions can produce different results. Document which Python version and package versions your reports depend on, and consider using virtual environments.

Be aware of the Service constraints. A script that works in Power BI Desktop may not work when the report is published to the Power BI Service. Networking is blocked, only supported packages are available, and the execution timeout is stricter. Validate every import against the current supported packages list before deploying.

Prefer Python as an analytical preprocessing layer over Python as a renderer. Python visuals are static, cannot cross-filter other visuals, and carry a 1-minute Service timeout. The more powerful pattern is using Python as a transformation step to compute new columns — sentiment scores, anomaly flags, match confidence values, risk tiers — and then surfacing those columns through native Power BI visuals, which are fully interactive, cross-filter correctly, and work without any Python execution at report view time. Every use case where you are tempted to build a Python visual is worth asking: could I compute this in a transformation step and display it natively instead?

Plan for the gateway before you publish. If the report uses a Python data source and will be refreshed on a schedule in the Service, you need a personal mode gateway running on the same machine as the Python environment. Plan this before you commit to the architecture, not after you encounter a failing scheduled refresh. See the gateway section for the constraints and alternatives.

Step-by-Step Setup
How to Set Up and Use Python in Power BI Desktop
8 steps  ·  ~45 min total
  1. Install Python from python.org

    Download and install the standard CPython distribution from python.org. Do not use Conda or Anaconda — Power BI does not perform the activation step those environments require, which causes import failures even for correctly installed packages.

  2. Create a dedicated virtual environment

    Run python -m venv C:\Envs\powerbi-env. A dedicated venv ensures that package upgrades for other projects never silently change the behavior of your Power BI reports.

  3. Install and pin packages to Service-matching versions

    Activate the environment, then run: pip install pandas==2.2.2 numpy==2.0.0 matplotlib==3.8.4 seaborn==0.13.2 scikit-learn==1.5.0 scipy==1.13.1 statsmodels==0.14.2 xgboost==2.0.3. These match what the Power BI Service runs in its cloud container. Then freeze: pip freeze > requirements-powerbi.txt and commit that file to version control.

  4. Point Power BI Desktop at the environment root

    Open File > Options and settings > Options > Python scripting. Set the home directory to C:\Envs\powerbi-env — the root folder, not the \Scripts subfolder. Power BI resolves python.exe from the root automatically.

  5. Use Python as a data source (Mode 1)

    Select Get Data > Other > Python script. Write a script that assigns one or more pandas.DataFrame variables. Power BI scans the namespace and presents each DataFrame as a loadable table. Use this mode for APIs, legacy databases, or custom file formats that have no native connector.

  6. Use Python as a Power Query transformation step (Mode 2)

    In Power Query Editor, select Transform > Run Python script. The current query result arrives as the dataset variable. Modify it and assign the result back to dataset. Use this mode for fuzzy matching, NLP scoring, feature engineering — anything that runs during the refresh pipeline and feeds the data model.

  7. Use Python as a visual renderer (Mode 3)

    In the report canvas, add a Python visual from the Visualizations pane. Drop fields into the Values well. Fields arrive as columns in the dataset DataFrame. Write matplotlib or seaborn rendering code. Call plt.show() — the PythonScriptWrapper.PY redirects this to save a PNG at 72 DPI. Remember: visuals are static images; they do not cross-filter other visuals.

  8. Validate against the Service package list before publishing

    The Power BI Service uses a fixed, versioned package set — you cannot install additional libraries. Cross-reference your imports against Microsoft's supported packages list before publishing. Scripts that import requests, psycopg2, shap, or sqlalchemy will fail in the Service.

FAQ

Frequently Asked Questions

What Python object type must a Power BI data source script return?

Power BI only recognizes pandas.DataFrame objects. If your script produces a NumPy array, a list of dictionaries, or any other structure, you must convert it to a DataFrame before Power BI can use it. Power BI scans the script namespace after execution and presents every DataFrame variable it finds as a loadable table.

What Python libraries are available in the Power BI Service?

The Power BI Service supports a fixed, curated list of open-source packages running on Python 3.11 for Premium and Fabric-backed workspaces. The current set includes pandas 2.2.2, numpy 2.0.0, matplotlib 3.8.4, seaborn 0.13.2, scikit-learn 1.5.0, scipy 1.13.1, statsmodels 0.14.2, and xgboost 2.0.3, among others.

Outbound networking is blocked in the cloud container, so packages that make external HTTP or database connections — including requests, psycopg2, and sqlalchemy — will not function. shap is also absent from the supported list, meaning SHAP visualizations that work in Desktop will error after publishing.

Will Python visuals still work after the May 2026 deprecation?

Python visuals in the Embed for your customers (app owns data) and Publish to web scenarios will stop rendering after May 1, 2026 — the charts will display as blank. Python visuals in standard reports and Embed for your organization scenarios are unaffected. If your organization uses the app-owns-data embedding pattern with any Python visuals, migration to native Power BI visuals or DAX-based analytics must be completed before that date.

How does Power BI actually pass data to a Python script?

Power BI communicates with Python through temporary CSV files on disk. When a Python transformation or visual script runs, Power BI serializes the current query data to a temp CSV, launches a Python subprocess, waits for execution to finish, then reads the output CSV back into the query pipeline. The temp files are deleted after execution.

This architecture has two practical consequences: large DataFrames cause disproportionate slowdowns because CSV serialization is slow, and the data lands on disk in plaintext during the script's execution window — a compliance consideration for sensitive datasets.

Why do rows disappear silently in a Power BI Python visual?

Power BI wraps every Python visual script inside an auto-generated file called PythonScriptWrapper.PY. That wrapper always executes dataset = dataset.drop_duplicates() before your script code runs. This behavior is not prominently documented. If your chart is missing rows that you know exist in the data model, this is why.

The fix is to add an index field from your data model to the visual's Values well. That index makes every row unique from Power BI's perspective, so drop_duplicates() leaves all rows intact.

Can you increase the resolution of a Python visual beyond 72 DPI?

No. The PythonScriptWrapper.PY locks the DPI at 72 and overrides plt.show() with a call to plt.savefig() at that fixed resolution before your script runs. Any dpi argument you pass in your own plt.figure() call is ignored. The only way to get a larger output is to resize the visual tile on the Power BI report canvas.

Why does Conda fail with Power BI Desktop?

Conda environments require an activation step that Power BI's subprocess launch does not perform. This causes import failures even for packages that are correctly installed in the Conda environment. Microsoft's documentation explicitly recommends against using Conda or Anaconda with Power BI Desktop. The supported approach is standard CPython from python.org with a venv-based virtual environment pointing at the environment root (not the \Scripts subfolder).

How do you do fuzzy entity matching between two Power BI tables using Python?

Use Python as a Power Query transformation step (Mode 2). Install the thefuzz library in your Power BI virtual environment. Your script receives one table as the dataset variable. Load the second table in a separate Power Query query, then merge the two DataFrames in your script using thefuzz.process.extractOne() to compute a match key and confidence score for each row. Return a DataFrame with the original columns plus match_key and match_confidence. Power BI merges the two queries natively on match_key, and match_confidence becomes a slicer so analysts can set their own threshold.

How do you run NLP sentiment analysis so the results work with native Power BI visuals?

Use Python as a transformation step (Mode 2) rather than a Python visual. Apply VADER (vaderSentiment) or TextBlob to the text column in the dataset DataFrame. Return the DataFrame with two new columns: sentiment_compound (a float from -1.0 to 1.0) and sentiment_label (Positive, Neutral, or Negative).

Power BI loads those columns into the data model as regular fields, so native bar charts, KPI cards, time-series lines, and slicers all work correctly and participate in cross-filtering — something a Python visual cannot do.

What is Semantic Link in Microsoft Fabric?

Semantic Link connects Power BI semantic models directly to Fabric notebooks via the SemPy Python library. It introduces FabricDataFrame — a pandas-compatible object that carries Power BI metadata including relationships, hierarchies, and column descriptions. You can retrieve and evaluate DAX measures from your Power BI models directly in Python, without reimplementing the business logic in the notebook. This is the forward direction for Python analytics in the Microsoft ecosystem: Fabric notebooks for compute, Power BI for visualization and distribution.

Why does scheduled refresh fail for reports with Python data sources?

Scheduled refresh for Python data sources in the Power BI Service requires a personal mode on-premises data gateway running on the same machine as your Python environment. The standard enterprise gateway does not support Python data sources — this is a hard constraint, not a configuration option. An important operational detail: the personal mode gateway cannot run as a background Windows service — it is an interactive application only, which means a user must be logged in and the application must be open on the machine at refresh time. Without a personal gateway online at refresh time, the Service cannot execute the Python script and the refresh will fail with a gateway error. If your organization does not permit personal gateways, the alternative is to move the Python data extraction into a Fabric notebook, write the results to a Lakehouse, and connect Power BI to the Lakehouse table using a standard connector.

Does Power BI support incremental refresh with Python data sources?

Native Power BI incremental refresh — which folds a date range filter into the connector query — does not work automatically with Python data sources the way it does with SQL connectors. Power BI's RangeStart and RangeEnd parameters must be explicitly threaded through M into your Python script via parameter substitution, which is complex and fragile. The practical alternative for many teams is a rolling window approach: the Python script fetches only recent records on each refresh, while older historical data is maintained as a separate cached table loaded without a Python step. For full incremental refresh at scale, the recommended architecture is to move the Python extraction into a Fabric notebook writing to a Lakehouse, which supports proper incremental refresh through the standard Lakehouse connector.

How do you test a Power BI Python transformation script outside of Power BI?

Use the dataset mocking pattern. Add a try/except NameError block at the top of your script that loads a representative CSV sample and assigns it to dataset if the variable does not already exist. When Power BI executes the script, dataset is already injected and the mock block is skipped. When you run the script locally in VS Code or Jupyter, the variable does not exist and the CSV is loaded instead. The same script runs correctly in both environments without modification. Always use absolute paths in the mock, and make sure your sample data includes enough edge cases — nulls, mixed formats, boundary values — to surface real bugs before the script goes into Power BI.

What should you do when a Python script in Power BI fails with a generic error?

Power BI's error surface for Python scripts is minimal. The most useful debugging steps in order: first, check that the script runs correctly outside Power BI using the dataset mock pattern — if it fails locally, you have a reproducible error with a proper Python traceback. Second, check that every import resolves correctly in the specific Python environment Power BI is using — run import sys; print(sys.executable) in a test script to confirm which Python Power BI is actually using. Third, add column existence guards and explicit error messages at the top of the script so that schema mismatches produce readable error text rather than a KeyError. Fourth, if the script works in Desktop but not in the Service, the cause is almost always either an unsupported package import or an outbound network call that is blocked in the Service container.

How does Python in Power BI compare to R for data science use cases?

Both languages are supported in all three integration modes (data source, transformation, visual) and face the same Service constraints — fixed package list, no networking, 1-minute visual timeout, and the May 2026 embedded deprecation. Python has broader package coverage for data engineering, NLP, and machine learning, and is the primary language for Microsoft Fabric's notebook environment. R has deeper statistical modeling packages (particularly for mixed-effects models, survival analysis, and econometrics) and may be the better fit for teams already fluent in the tidyverse. For teams starting fresh with no existing preference, Python is the more future-aligned choice given its Fabric integration. Nothing prevents a team from using both — each operates independently within Power BI.

How do you handle a Python script that works in Desktop but fails after publishing?

Two causes account for the large majority of publish failures. First, the script imports a package that is not on the Power BI Service supported list — cross-reference every import statement against Microsoft's current supported packages page. Second, the script makes an outbound network call (requests.get, a database connection, any socket operation) — all outbound networking is blocked in the Service container. Packages like requests, psycopg2, and sqlalchemy are absent from the supported list specifically because they make network connections. For data source scripts that require networking, move the extraction to a Fabric notebook and read the results from a Lakehouse.

Where Things Stand

Python's integration with Power BI has been available since 2018, and in the years since, it has gone from a niche feature to a core part of how advanced analytics teams operate. The combination gives you the analytical depth of Python's scientific computing ecosystem with the polished, interactive, enterprise-grade visualization and distribution layer of Power BI.

The landscape is shifting in several directions simultaneously. Microsoft Fabric's Python Notebooks and Semantic Link are moving the heavier Python workloads out of the report and into a proper cloud-native compute environment. The deprecation of Python visuals in embedded scenarios signals that Microsoft sees the future of Python-in-BI as a first-class notebook experience rather than inline scripting. Copilot is handling more of the conversational analytics layer that once required custom tooling. And the PBIR format — which became the default in the Power BI Service in January 2026 and in Power BI Desktop with the March 2026 release — has made Python scripts inside Power BI reports reviewable, diffable, and auditable in version control for the first time.

What has not changed is the core contract. Power BI still consumes pandas DataFrames. Data connections still go through PEP 249-compliant modules. Scripts still benefit from PEP 484 type hints for maintainability. Python still solves the problems that DAX and M cannot. And the temporary CSV mechanism that powers the integration is still on disk, still subject to the same I/O constraints, and still worth understanding at that level of detail if you are serious about production deployments.

The question to ask about every Python script in your Power BI workflow is no longer just "does this work?" — it is "does this belong here, or does it belong in a Fabric notebook?" Getting that decision right is where the real leverage is in 2026.

That is real understanding — not a copy-paste tutorial. Now go build something with it.
Certificate of Completion
Final Exam
Pass mark: 80% · Score 80% or higher to receive your certificate

Enter your name as you want it to appear on your certificate, then start the exam. Your name is used only to generate your certificate and is never transmitted or stored anywhere.

Question 1 of 10