CSVLoader, JsonLoader

CSVLoader

student.csv

name,age,gender
Bill,10,1
Mary,15,0
Jello,12,1
Boka,11,0
1
2
3
4
5
6
7
8
9
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(
    file_path= "./data/student.csv",
    #csv_args={"delimiter":","},
    encoding="utf-8")
# documents = loader.load()
for document in loader.lazy_load():
    print(type(document), document)

JsonLoader

安裝jq

pip3 install jq    

json

以下結構, name 是 string , age 是 int , hobby 是 list , info 是 json 。
student0.json

{
  "name": "Mary",
  "age": 10,
  "gender": 0,
  "hobby": [
    "singing",
    "swimming",
    "cooking"
  ],
  "info": {
    "tel": "03-111111",
    "address": "Taiwan, Taoyuan City"
  }
}

取得所有key、value

1
2
3
4
5
6
7
8
9
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path= "./data/student0.json", # json檔案路徑
    jq_schema=".", # .為根目錄,代表顯示所有key value
    text_content= False
)
documents = loader.load()
print(documents)

以下,內容是在page_content,source為來源資料,seq_num為每一列的序號。

[Document(metadata={'source': '/Users/cici/PythonProject/AIProject/data/student0.json', 'seq_num': 1}, page_content='{"name": "Mary", "age": 10, "gender": 0, "hobby": ["singing", "swimming", "cooking"], "info": {"tel": "03-111111", "address": "Taiwan, Taoyuan City"}}')]

取得特定key的value

1
2
3
4
5
6
7
8
9
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path= "./data/student0.json", # json檔案路徑
    jq_schema=".name", # 顯示key為name
    text_content= False
)
documents = loader.load()
print(documents)

以下,內容是在page_content,只顯示名字,source為來源資料,seq_num為每一列的序號。

[Document(metadata={'source': '/Users/cici/PythonProject/AIProject/data/student0.json', 'seq_num': 1}, page_content='Mary')]

透過子元素key,取出value

1
2
3
4
5
6
7
8
9
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path= "./data/student0.json", # json檔案路徑
    jq_schema=".info.tel", # key為info,再取得子json中key為tel
    text_content= False
)
documents = loader.load()
print(documents)

以下,內容是在page_content,只顯示電話,source為來源資料,seq_num為每一列的序號。

[Document(metadata={'source': '/Users/cici/PythonProject/AIProject/data/student0.json', 'seq_num': 1}, page_content='03-111111')]

取得 list 所有元素

取得所有hobby list,每一個元素都是Document物件,所以結果會有三個Document物件,顯示的資料在page_content。

1
2
3
4
5
6
7
8
9
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path= "./data/student0.json", # json檔案路徑
    jq_schema=".hobby[]", # []代表取得所有list
    text_content= False
)
documents = loader.load()
print(documents)
[Document(metadata={'source': '/Users/cici/PythonProject/AIProject/data/student0.json', 'seq_num': 1}, page_content='singing'), Document(metadata={'source': '/Users/cici/PythonProject/AIProject/data/student0.json', 'seq_num': 2}, page_content='swimming'), Document(metadata={'source': '/Users/cici/PythonProject/AIProject/data/student0.json', 'seq_num': 3}, page_content='cooking')]

根據index 索引,取得元素

取得hobby[1]

1
2
3
4
5
6
7
8
9
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path= "./data/student0.json", # json檔案路徑
    jq_schema=".hobby[1]", # [index] 根據索引取得元素 
    text_content= False
)
documents = loader.load()
print(documents)
[Document(metadata={'source': '/Users/cici/PythonProject/AIProject/data/student0.json', 'seq_num': 1}, page_content='swimming')]

每一行都是一個json元素

student.json

{"name": "Mary", "age": 10, "gender": 0}
{"name": "Bill", "age": 11, "gender": 1}
{"name": "Jello", "age": 10, "gender": 0}

json_lines 設為 True

每一行為獨立的json

1
2
3
4
5
6
7
8
9
10
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path= "./data/student.json", # json檔案路徑
    jq_schema=".",  # 從根目錄
    text_content=False, # 內容是文字嗎? 
	json_lines=True # 每一行為獨立的json
)
documents = loader.load()
print(documents)

以下結果,內容是在page_content,source為來源資料,seq_num為每一列的序號。

[Document(metadata={'source': '/Users/cici/PythonProject/AIProject/data/student.json', 'seq_num': 1}, page_content='{"name": "Mary", "age": 10, "gender": 0}'), Document(metadata={'source': '/Users/cici/PythonProject/AIProject/data/student.json', 'seq_num': 2}, page_content='{"name": "Bill", "age": 11, "gender": 1}'), Document(metadata={'source': '/Users/cici/PythonProject/AIProject/data/student.json', 'seq_num': 3}, page_content='{"name": "Jello", "age": 10, "gender": 0}')]

只取得key為name

1
2
3
4
5
6
7
8
9
10
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path= "./data/student.json", # json檔案路徑
    jq_schema=".name",  # 只取key為name的資料
    text_content=False, # 內容是文字嗎?
	json_lines=True # 每一行為獨立的json檔案
)
documents = loader.load()
print(documents)

以下結果,內容是在page_content,source為來源資料,seq_num為每一列的序號。

[Document(metadata={'source': '/Users/cici/PythonProject/AIProject/data/student.json', 'seq_num': 1}, page_content='Mary'), Document(metadata={'source': '/Users/cici/PythonProject/AIProject/data/student.json', 'seq_num': 2}, page_content='Bill'), Document(metadata={'source': '/Users/cici/PythonProject/AIProject/data/student.json', 'seq_num': 3}, page_content='Jello')]

list中的多個json

student2.json,json資料放在list中。

[
  {"name": "Mary", "age": 10, "gender": 0},
  {"name": "Bill", "age": 11, "gender": 1},
  {"name": "Jello", "age": 10, "gender": 0}
]

取得list中json的key為name

1
2
3
4
5
6
7
8
9
10
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path= "./data/student2.json",
    jq_schema=".[].name",
    text_content= False
)

documents = loader.load()
print(documents)
[Document(metadata={'source': '/Users/cici/PythonProject/AIProject/data/student2.json', 'seq_num': 1}, page_content='Mary'), Document(metadata={'source': '/Users/cici/PythonProject/AIProject/data/student2.json', 'seq_num': 2}, page_content='Bill'), Document(metadata={'source': '/Users/cici/PythonProject/AIProject/data/student2.json', 'seq_num': 3}, page_content='Jello')]

結論

  • json_lines=True 為每一行都是各別獨立的json
  • text_content=False 內容為文字嗎?true代表是。

json

  • .name 只取得key為name的value
  • .info.tel 取得info子json中,key為tel的value

list:

  • [] 取出list中所有元素
  • [].name 取出list中所有元素,但key為name
  • [1] 根據index,取出特定的元素

results matching ""

    No results matching ""