LangChain DocumentLoaders

Use langchain to get text from any data source.

LangChain DocumentLoaders

LangChain is library that provides a kitchen sink of tools for LLMs, particularly integrating LLMs with other tools.

One underrated feature of Langchain is DocumentLoaders, which allow you to acquire text data from any source, which is super useful even if you aren’t using LLMs at all! (It can also be useful to hijack these loaders to acquire data for fine tuning!)

For example, if you are trying to get data from a website as text here are some useful DocumentLoaders:

  1. RecursiveURLLoader
  2. SeleniumLoader
  3. SitemapLoader: this is explored below.

I think it is useful to combine LangChain DocumentLoaders with HuggingFace datasets, because it allows you to save, version and do other fun things like perform semantic search of your data with FAISS.

As of this writing, there are over 125 different kinds of DocumentLoaders. I haven’t been able to find a loader that isn’t there to quickly acquire data I need.

Sitemap Loader

Sitemaps are a nice way to see a listing of all pages on a site. This is useful for acquiring all of the text from a large site that might contain many pages. Below, I use the SitemapLoader to get all of the text from https://quarto.org.

Warning

There is currently a bug in langchain, so I had to install an old version right before this commit which broke the SitemapLoader. I had to downgrade to v0.0.202 via pip install langchain==0.0.202

import nest_asyncio
nest_asyncio.apply() # you don't need this line outside notebooks
from langchain.document_loaders.sitemap import SitemapLoader
sitemap_loader = SitemapLoader(web_path="https://quarto.org/sitemap.xml")
sitemap_loader.requests_per_second = 4
docs = sitemap_loader.load()
Fetching pages: 100%|####################################| 269/269 [00:16<00:00, 16.21it/s]
print(f'There are {len(docs)} pages')
There are 269 pages

Let’s look at the content of one page:

example = docs[0]
example.dict()
{'page_content': '\n\n\n\n\nQuarto - About Quarto\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nOverview\n\n\n\nGet Started\n\n\n\nGuide\n\n\n\nExtensions\n\n\n\nReference\n\n\n\nGallery\n\n\n\nBlog\n\n\n\nHelp\n\n\n\n\n\nReport a Bug\n\n\n\n\nAsk a Question\n\n\n\n\nFAQ\n\n\n\n\n \n\n\n\n\n\n \n\n\n\n\n\n\n\n\nOn this page\n\nGoals\nProject\nContribute\n\nEdit this pageReport an issue\n\n\n\n\n\nAbout Quarto\nOpen source tools for scientific and technical publishing\n\n\n\n\n\nGoals\nThe overarching goal of Quarto is to make the process of creating and collaborating on scientific and technical documents dramatically better. We hope to do this in several dimensions:\n\nCreate a writing and publishing environment with great integrated tools for technical content. We want to make authoring with embedded code, equations, figures, complex diagrams, interactive widgets, citations, cross references, and the myriad other special requirements of scientific discourse straightforward and productive for everyone.\nHelp authors take full advantage of the web as a connected, interactive platform for communications, while still providing the ability to create excellent printed output from the same document source. Researchers shouldn’t need to choose between LaTeX, MS Word, and HTML but rather be able to author documents that target all of them at the same time.\nMake reproducible research and publications the norm rather than the exception. Reproducibility requires that the code and data required to create a manuscript are an integrated part of it. However, this isn’t often straightforward in practice—Quarto aims to make it easier to adopt a reproducible workflow than not.\n\nQuarto is open source software licensed under the GNU GPL v2. We believe that it’s better for everyone if the tools used for research and science are free and open. Reproducibility, widespread sharing of knowledge and techniques, and the leveling of the playing field by eliminating cost barriers are but a few of the shared benefits of free software in science.\n\n\nProject\nAt the core of Quarto is Pandoc, a powerful and flexible document processing tool. Quarto adds a number of facilities to Pandoc aimed at scientific and technical publishing, including:\n\nEmbedding code and output from Python, R, and JavaScript via integration with Jupyter, Knitr, and Observable.\nA variety of extensions to Pandoc markdown useful for technical writing including cross-references, sub-figures, layout panels, hoverable citations and footnotes, callouts, and more.\nA project system for rendering groups of documents at once, sharing options across documents, and producing aggregate output like websites and books.\n\nDevelopment of Quarto is sponsored by Posit, PBC, where we previously created a similar system (R Markdown) that shared the same goals, but was targeted principally at users of the R language. The same core team works on both Quarto and R Markdown:\n\nJ.J. Allaire (@jjallaire)\nChristophe Dervieux (@cderv)\nCarlos Scheidegger (@cscheid)\nCharles Teague (@dragonstyle)\nYihui Xie (@yihui)\n\nWith Quarto, we are hoping to bring these tools to a much wider audience.\nQuarto is a registered trademark of Posit. Please see our trademark policy for guidelines on usage of the Quarto trademark.\n\n\nContribute\nYou can contribute to Quarto in many ways:\n\nBy opening issues to provide feedback and share ideas.\nBy submitting Pull Request (PR) to fix opened issues\nBy submitting Pull Request (PR) to suggest new features (it is considered good practice to open an issue for discussion before working on a pull request for a new feature).\n\nPlease be mindful of our code of conduct as you interact with other community members.\n\nPull Requests\nPull requests are very welcome! Here’s how to contribute via PR:\n\nFork the repository, clone it locally, and make your changes in a new branch specific to the PR. For example:\n\n\nTerminal\n\n# clone your fork\n$ git clone https://github.com/<username>/quarto-cli\n\n# configure for your platform (./configure.sh or ./configure.cmd for windows)\n$ cd quarto-cli\n$ ./configure.sh\n\n# checkout a new branch\n$ git checkout -b feature/newthing\n\nFor significant changes (e.g more than small bug fixes), ensure that you have signed the individual or corporate contributor agreement as appropriate. You can send the signed copy to jj@rstudio.com.\nSubmit the pull request. It is ok to submit as draft in your are still working on it but would like some feedback from us. It always good to share in the open that you are working on it.\n\nWe’ll try to be as responsive as possible in reviewing and accepting pull requests.\n\n\n \n\n\n \n\n\n\nProudly supported by \n\n\n\n\n\nAbout\n\n\n\n\nFAQ\n\n\n\n\nLicense\n\n\n\n\nTrademark\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',
 'metadata': {'source': 'https://quarto.org/about.html',
  'loc': 'https://quarto.org/about.html',
  'lastmod': '2023-07-05T19:35:15.135Z'}}

Clean the data

When we look at this page, we can see a bunch of unwanted text. The navbar and the sidenav are showing up, and we do not want this. We can update the parsing function to fix this:

from bs4 import BeautifulSoup


def remove_nav_and_header_elements(content: BeautifulSoup) -> str:
    exclude = content.find_all(["nav", "footer", "header", "head"])
    for element in exclude:
        element.decompose()

    return str(content.get_text()).strip()
sitemap_loader = SitemapLoader(web_path="https://quarto.org/sitemap.xml",
                              parsing_function=remove_nav_and_header_elements)
sitemap_loader.requests_per_second = 4
docs = sitemap_loader.load()
Fetching pages: 100%|####################################| 269/269 [00:05<00:00, 52.00it/s]
example = docs[0]
example.dict()
{'page_content': 'Goals\nThe overarching goal of Quarto is to make the process of creating and collaborating on scientific and technical documents dramatically better. We hope to do this in several dimensions:\n\nCreate a writing and publishing environment with great integrated tools for technical content. We want to make authoring with embedded code, equations, figures, complex diagrams, interactive widgets, citations, cross references, and the myriad other special requirements of scientific discourse straightforward and productive for everyone.\nHelp authors take full advantage of the web as a connected, interactive platform for communications, while still providing the ability to create excellent printed output from the same document source. Researchers shouldn’t need to choose between LaTeX, MS Word, and HTML but rather be able to author documents that target all of them at the same time.\nMake reproducible research and publications the norm rather than the exception. Reproducibility requires that the code and data required to create a manuscript are an integrated part of it. However, this isn’t often straightforward in practice—Quarto aims to make it easier to adopt a reproducible workflow than not.\n\nQuarto is open source software licensed under the GNU GPL v2. We believe that it’s better for everyone if the tools used for research and science are free and open. Reproducibility, widespread sharing of knowledge and techniques, and the leveling of the playing field by eliminating cost barriers are but a few of the shared benefits of free software in science.\n\n\nProject\nAt the core of Quarto is Pandoc, a powerful and flexible document processing tool. Quarto adds a number of facilities to Pandoc aimed at scientific and technical publishing, including:\n\nEmbedding code and output from Python, R, and JavaScript via integration with Jupyter, Knitr, and Observable.\nA variety of extensions to Pandoc markdown useful for technical writing including cross-references, sub-figures, layout panels, hoverable citations and footnotes, callouts, and more.\nA project system for rendering groups of documents at once, sharing options across documents, and producing aggregate output like websites and books.\n\nDevelopment of Quarto is sponsored by Posit, PBC, where we previously created a similar system (R Markdown) that shared the same goals, but was targeted principally at users of the R language. The same core team works on both Quarto and R Markdown:\n\nJ.J. Allaire (@jjallaire)\nChristophe Dervieux (@cderv)\nCarlos Scheidegger (@cscheid)\nCharles Teague (@dragonstyle)\nYihui Xie (@yihui)\n\nWith Quarto, we are hoping to bring these tools to a much wider audience.\nQuarto is a registered trademark of Posit. Please see our trademark policy for guidelines on usage of the Quarto trademark.\n\n\nContribute\nYou can contribute to Quarto in many ways:\n\nBy opening issues to provide feedback and share ideas.\nBy submitting Pull Request (PR) to fix opened issues\nBy submitting Pull Request (PR) to suggest new features (it is considered good practice to open an issue for discussion before working on a pull request for a new feature).\n\nPlease be mindful of our code of conduct as you interact with other community members.\n\nPull Requests\nPull requests are very welcome! Here’s how to contribute via PR:\n\nFork the repository, clone it locally, and make your changes in a new branch specific to the PR. For example:\n\n\nTerminal\n\n# clone your fork\n$ git clone https://github.com/<username>/quarto-cli\n\n# configure for your platform (./configure.sh or ./configure.cmd for windows)\n$ cd quarto-cli\n$ ./configure.sh\n\n# checkout a new branch\n$ git checkout -b feature/newthing\n\nFor significant changes (e.g more than small bug fixes), ensure that you have signed the individual or corporate contributor agreement as appropriate. You can send the signed copy to jj@rstudio.com.\nSubmit the pull request. It is ok to submit as draft in your are still working on it but would like some feedback from us. It always good to share in the open that you are working on it.\n\nWe’ll try to be as responsive as possible in reviewing and accepting pull requests.',
 'metadata': {'source': 'https://quarto.org/about.html',
  'loc': 'https://quarto.org/about.html',
  'lastmod': '2023-07-05T19:35:15.135Z'}}

Create a HF Dataset

We can use the from_list method to load that sitemap data into a HF Dataset.

from datasets import Dataset 
repo_name = 'hamel/quarto'
quarto_data = Dataset.from_list([d.dict() for d in docs])
quarto_data
Dataset({
    features: ['page_content', 'metadata'],
    num_rows: 269
})
quarto_data.push_to_hub(repo_name)
Updating downloaded metadata with the new split.

Download the data

You can download the data from the HuggingFace Hub like this:

from datasets import load_dataset
remote_data = load_dataset(repo_name)
Using custom data configuration hamel--quarto-b88699e31e28f953
Downloading and preparing dataset None/None (download: Unknown size, generated: 1.81 MiB, post-processed: Unknown size, total: 1.81 MiB) to /Users/hamel/.cache/huggingface/datasets/hamel___parquet/hamel--quarto-b88699e31e28f953/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
Dataset parquet downloaded and prepared to /Users/hamel/.cache/huggingface/datasets/hamel___parquet/hamel--quarto-b88699e31e28f953/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.
remote_data['train'][0]
{'page_content': 'Goals\nThe overarching goal of Quarto is to make the process of creating and collaborating on scientific and technical documents dramatically better. We hope to do this in several dimensions:\n\nCreate a writing and publishing environment with great integrated tools for technical content. We want to make authoring with embedded code, equations, figures, complex diagrams, interactive widgets, citations, cross references, and the myriad other special requirements of scientific discourse straightforward and productive for everyone.\nHelp authors take full advantage of the web as a connected, interactive platform for communications, while still providing the ability to create excellent printed output from the same document source. Researchers shouldn’t need to choose between LaTeX, MS Word, and HTML but rather be able to author documents that target all of them at the same time.\nMake reproducible research and publications the norm rather than the exception. Reproducibility requires that the code and data required to create a manuscript are an integrated part of it. However, this isn’t often straightforward in practice—Quarto aims to make it easier to adopt a reproducible workflow than not.\n\nQuarto is open source software licensed under the GNU GPL v2. We believe that it’s better for everyone if the tools used for research and science are free and open. Reproducibility, widespread sharing of knowledge and techniques, and the leveling of the playing field by eliminating cost barriers are but a few of the shared benefits of free software in science.\n\n\nProject\nAt the core of Quarto is Pandoc, a powerful and flexible document processing tool. Quarto adds a number of facilities to Pandoc aimed at scientific and technical publishing, including:\n\nEmbedding code and output from Python, R, and JavaScript via integration with Jupyter, Knitr, and Observable.\nA variety of extensions to Pandoc markdown useful for technical writing including cross-references, sub-figures, layout panels, hoverable citations and footnotes, callouts, and more.\nA project system for rendering groups of documents at once, sharing options across documents, and producing aggregate output like websites and books.\n\nDevelopment of Quarto is sponsored by Posit, PBC, where we previously created a similar system (R Markdown) that shared the same goals, but was targeted principally at users of the R language. The same core team works on both Quarto and R Markdown:\n\nJ.J. Allaire (@jjallaire)\nChristophe Dervieux (@cderv)\nCarlos Scheidegger (@cscheid)\nCharles Teague (@dragonstyle)\nYihui Xie (@yihui)\n\nWith Quarto, we are hoping to bring these tools to a much wider audience.\nQuarto is a registered trademark of Posit. Please see our trademark policy for guidelines on usage of the Quarto trademark.\n\n\nContribute\nYou can contribute to Quarto in many ways:\n\nBy opening issues to provide feedback and share ideas.\nBy submitting Pull Request (PR) to fix opened issues\nBy submitting Pull Request (PR) to suggest new features (it is considered good practice to open an issue for discussion before working on a pull request for a new feature).\n\nPlease be mindful of our code of conduct as you interact with other community members.\n\nPull Requests\nPull requests are very welcome! Here’s how to contribute via PR:\n\nFork the repository, clone it locally, and make your changes in a new branch specific to the PR. For example:\n\n\nTerminal\n\n# clone your fork\n$ git clone https://github.com/<username>/quarto-cli\n\n# configure for your platform (./configure.sh or ./configure.cmd for windows)\n$ cd quarto-cli\n$ ./configure.sh\n\n# checkout a new branch\n$ git checkout -b feature/newthing\n\nFor significant changes (e.g more than small bug fixes), ensure that you have signed the individual or corporate contributor agreement as appropriate. You can send the signed copy to jj@rstudio.com.\nSubmit the pull request. It is ok to submit as draft in your are still working on it but would like some feedback from us. It always good to share in the open that you are working on it.\n\nWe’ll try to be as responsive as possible in reviewing and accepting pull requests.',
 'metadata': {'lastmod': '2023-07-05T19:35:15.135Z',
  'loc': 'https://quarto.org/about.html',
  'source': 'https://quarto.org/about.html'}}

GitHub Issues

We can use the GitHubIssuesLoader to get all of the issues from a GitHub repo.

from langchain.document_loaders import GitHubIssuesLoader

This assumes you have set the GITHUB_PERSONAL_ACCESS_TOKEN as an environment variable

loader = GitHubIssuesLoader(
    repo="quarto-dev/quarto-cli",
    state='all', #get both open and closed issues
    include_prs=False,
)
quarto_issues = loader.load()
len(quarto_issues)
2841

Wow, that’s a lot of Issues! Let’s take a look at one:

In the issue below, I can see that it doesn’t include comments. We would have to get those separately with further API calls, but this is a good start!

quarto_issues[0]
Document(page_content="### Bug description\n\nI am running Quarto to produce a report in which I mix python with R (using 95% R). When I started trying to use python, I believe that RStudio is not recognizing my python installation and packages:\r\n\r\n```\r\nError in py_call_impl(callable, call_args$unnamed, call_args$named) : \r\n  ModuleNotFoundError: No module named 'pandas'\r\n```\r\n\r\nAny ideas?\n\n### Steps to reproduce\n\n---\r\ntitle: 'Nota Técnica - Índice Socioeconômico dos Estudantes'\r\nauthor: ''\r\nformat: \r\n  html:\r\n    encoding: 'UTF-8'\r\n    theme: style.css\r\neditor: visual\r\nlang: pt\r\nexecute:\r\n  echo: false\r\n  warning: false\r\n---\r\n\r\n```{python}\r\n#| label: load-py-pckgs\r\nimport pandas as pd\r\nimport seaborn as sns\r\n```\n\n### Expected behavior\n\nI expected it to load python packages and modules with no problems.\n\n### Actual behavior\n\nError when loading any python package.\n\n### Your environment\n\nIDE: RStudio 2022.07.2 Build 576\r\nR version 4.1.0 (2021-05-18) (can't update because of internal packages built with this version)\r\nOS: Windows 11\n\n### Quarto check output\n\n`quarto check` gives me:\r\nC:\\Users\\joao.freire\\Documents>quarto check\r\n\r\n[>] Checking versions of quarto binary dependencies...\r\n      Pandoc version 3.1.1: OK\r\n      Dart Sass version 1.55.0: OK\r\n[>] Checking versions of quarto dependencies......OK\r\n[>] Checking Quarto installation......OK\r\n      Version: 1.3.433\r\n      Path: C:\\Users\\joao.freire\\AppData\\Local\\Programs\\Quarto\\bin\r\n      CodePage: 1252\r\n\r\n[>] Checking basic markdown render....OK\r\n\r\n[>] Checking Python 3 installation....OK\r\n      Version: 3.11.4\r\n      Path: C:/Program Files/Python311/python.exe\r\n      Jupyter: 5.3.1\r\n      Kernels: python3\r\n\r\n(/) Checking Jupyter engine render....0.00s - Debugger warning: It seems that frozen modules are being used, which may\r\n0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off\r\n0.00s - to python to disable frozen modules.\r\n0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this\r\nvalidation.\r\n0.00s - Debugger warning: It seems that frozen modules are being used, which may\r\n0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off\r\n0.00s - to python to disable frozen modules.\r\n0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this\r\nvalidation.\r\n[>] Checking Jupyter engine render....OK\r\n\r\n[>] Checking R installation...........OK\r\n      Version: 4.1.0\r\n      Path: C:/Users/joao.freire/Documents/R/R-4.1.0\r\n      LibPaths:\r\n        - C:/Users/joao.freire/Documents/R/R-4.1.0/library\r\n      knitr: 1.42\r\n      rmarkdown: 2.23\r\n\r\n[>] Checking Knitr engine render......OK\r\n\r\n\r\nC:\\Users\\joao.freire\\Documents>quarto check\r\n\r\n[>] Checking versions of quarto binary dependencies...\r\n      Pandoc version 3.1.1: OK\r\n      Dart Sass version 1.55.0: OK\r\n[>] Checking versions of quarto dependencies......OK\r\n[>] Checking Quarto installation......OK\r\n      Version: 1.3.433\r\n      Path: C:\\Users\\joao.freire\\AppData\\Local\\Programs\\Quarto\\bin\r\n      CodePage: 1252\r\n\r\n[>] Checking basic markdown render....OK\r\n\r\n[>] Checking Python 3 installation....OK\r\n      Version: 3.11.4\r\n      Path: C:/Program Files/Python311/python.exe\r\n      Jupyter: 5.3.1\r\n      Kernels: python3\r\n\r\n```\r\n(\\) Checking Jupyter engine render....0.00s - Debugger warning: It seems that frozen modules are being used, which may\r\n0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off\r\n0.00s - to python to disable frozen modules.\r\n0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this\r\nvalidation.\r\n0.00s - Debugger warning: It seems that frozen modules are being used, which may\r\n0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off\r\n0.00s - to python to disable frozen modules.\r\n0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this\r\nvalidation.\r\n[>] Checking Jupyter engine render....OK\r\n\r\n[>] Checking R installation...........OK\r\n      Version: 4.1.0\r\n      Path: C:/Users/joao.freire/Documents/R/R-4.1.0\r\n      LibPaths:\r\n        - C:/Users/joao.freire/Documents/R/R-4.1.0/library\r\n      knitr: 1.42\r\n      rmarkdown: 2.23\r\n\r\n[>] Checking Knitr engine render......OK\r\n```", metadata={'url': 'https://github.com/quarto-dev/quarto-cli/issues/6113', 'title': 'Quarto not recognizing python packages', 'creator': 'joaoaugustofrei', 'created_at': '2023-07-05T21:11:31Z', 'comments': 1, 'state': 'open', 'labels': ['bug'], 'assignee': None, 'milestone': None, 'locked': False, 'number': 6113, 'is_pull_request': False})

Upload to the Hub

We can upload these issues to the hub like so, this will be available at https://huggingface.co/datasets/hamel/quarto-issues

ds = Dataset.from_list([x.dict() for x in quarto_issues])
ds.push_to_hub('hamel/quarto-issues')
from datasets import load_dataset
remote_data = load_dataset('hamel/quarto-issues')
Using custom data configuration hamel--quarto-issues-52921768ee5c97fb
Downloading and preparing dataset None/None (download: 1.99 MiB, generated: 4.78 MiB, post-processed: Unknown size, total: 6.77 MiB) to /Users/hamel/.cache/huggingface/datasets/hamel___parquet/hamel--quarto-issues-52921768ee5c97fb/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
Dataset parquet downloaded and prepared to /Users/hamel/.cache/huggingface/datasets/hamel___parquet/hamel--quarto-issues-52921768ee5c97fb/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.
remote_data['train'][0]
{'page_content': "### Bug description\n\nI am running Quarto to produce a report in which I mix python with R (using 95% R). When I started trying to use python, I believe that RStudio is not recognizing my python installation and packages:\r\n\r\n```\r\nError in py_call_impl(callable, call_args$unnamed, call_args$named) : \r\n  ModuleNotFoundError: No module named 'pandas'\r\n```\r\n\r\nAny ideas?\n\n### Steps to reproduce\n\n---\r\ntitle: 'Nota Técnica - Índice Socioeconômico dos Estudantes'\r\nauthor: ''\r\nformat: \r\n  html:\r\n    encoding: 'UTF-8'\r\n    theme: style.css\r\neditor: visual\r\nlang: pt\r\nexecute:\r\n  echo: false\r\n  warning: false\r\n---\r\n\r\n```{python}\r\n#| label: load-py-pckgs\r\nimport pandas as pd\r\nimport seaborn as sns\r\n```\n\n### Expected behavior\n\nI expected it to load python packages and modules with no problems.\n\n### Actual behavior\n\nError when loading any python package.\n\n### Your environment\n\nIDE: RStudio 2022.07.2 Build 576\r\nR version 4.1.0 (2021-05-18) (can't update because of internal packages built with this version)\r\nOS: Windows 11\n\n### Quarto check output\n\n`quarto check` gives me:\r\nC:\\Users\\joao.freire\\Documents>quarto check\r\n\r\n[>] Checking versions of quarto binary dependencies...\r\n      Pandoc version 3.1.1: OK\r\n      Dart Sass version 1.55.0: OK\r\n[>] Checking versions of quarto dependencies......OK\r\n[>] Checking Quarto installation......OK\r\n      Version: 1.3.433\r\n      Path: C:\\Users\\joao.freire\\AppData\\Local\\Programs\\Quarto\\bin\r\n      CodePage: 1252\r\n\r\n[>] Checking basic markdown render....OK\r\n\r\n[>] Checking Python 3 installation....OK\r\n      Version: 3.11.4\r\n      Path: C:/Program Files/Python311/python.exe\r\n      Jupyter: 5.3.1\r\n      Kernels: python3\r\n\r\n(/) Checking Jupyter engine render....0.00s - Debugger warning: It seems that frozen modules are being used, which may\r\n0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off\r\n0.00s - to python to disable frozen modules.\r\n0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this\r\nvalidation.\r\n0.00s - Debugger warning: It seems that frozen modules are being used, which may\r\n0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off\r\n0.00s - to python to disable frozen modules.\r\n0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this\r\nvalidation.\r\n[>] Checking Jupyter engine render....OK\r\n\r\n[>] Checking R installation...........OK\r\n      Version: 4.1.0\r\n      Path: C:/Users/joao.freire/Documents/R/R-4.1.0\r\n      LibPaths:\r\n        - C:/Users/joao.freire/Documents/R/R-4.1.0/library\r\n      knitr: 1.42\r\n      rmarkdown: 2.23\r\n\r\n[>] Checking Knitr engine render......OK\r\n\r\n\r\nC:\\Users\\joao.freire\\Documents>quarto check\r\n\r\n[>] Checking versions of quarto binary dependencies...\r\n      Pandoc version 3.1.1: OK\r\n      Dart Sass version 1.55.0: OK\r\n[>] Checking versions of quarto dependencies......OK\r\n[>] Checking Quarto installation......OK\r\n      Version: 1.3.433\r\n      Path: C:\\Users\\joao.freire\\AppData\\Local\\Programs\\Quarto\\bin\r\n      CodePage: 1252\r\n\r\n[>] Checking basic markdown render....OK\r\n\r\n[>] Checking Python 3 installation....OK\r\n      Version: 3.11.4\r\n      Path: C:/Program Files/Python311/python.exe\r\n      Jupyter: 5.3.1\r\n      Kernels: python3\r\n\r\n```\r\n(\\) Checking Jupyter engine render....0.00s - Debugger warning: It seems that frozen modules are being used, which may\r\n0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off\r\n0.00s - to python to disable frozen modules.\r\n0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this\r\nvalidation.\r\n0.00s - Debugger warning: It seems that frozen modules are being used, which may\r\n0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off\r\n0.00s - to python to disable frozen modules.\r\n0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this\r\nvalidation.\r\n[>] Checking Jupyter engine render....OK\r\n\r\n[>] Checking R installation...........OK\r\n      Version: 4.1.0\r\n      Path: C:/Users/joao.freire/Documents/R/R-4.1.0\r\n      LibPaths:\r\n        - C:/Users/joao.freire/Documents/R/R-4.1.0/library\r\n      knitr: 1.42\r\n      rmarkdown: 2.23\r\n\r\n[>] Checking Knitr engine render......OK\r\n```",
 'metadata': {'assignee': None,
  'comments': 1,
  'created_at': '2023-07-05T21:11:31Z',
  'creator': 'joaoaugustofrei',
  'is_pull_request': False,
  'labels': ['bug'],
  'locked': False,
  'milestone': None,
  'number': 6113,
  'state': 'open',
  'title': 'Quarto not recognizing python packages',
  'url': 'https://github.com/quarto-dev/quarto-cli/issues/6113'}}